We're seeing a pod scheduling issue with Zookeeper...
# gooddata-cn
p
We're seeing a pod scheduling issue with Zookeeper in our dev environment that is causing Pulsar not to start. There are 2 events in our zookeeper pod, "FailedScheduling" and "NotTriggerScaleUp", suggesting that the pods are not assignable to any workers and hence the pod is stuck in pending status (see attached pod describe output). We're only seeing this issue in our dev environment, zookeeper comes up fine in stage and prod. How should we debug this issue? Is there a workaround? Thanks for any suggestions.
r
Please review the PersistentVolumeClaim called
pulsar-zookeeper-data-pulsar-zookeeper-0
Copy code
kubectl describe pvc -n pulsar pulsar-zookeeper-data-pulsar-zookeeper-0
There might be events related to this problem. Also, check if the storage class defined in that pvc actually exists.
👍 1
p
Here's the result of the above describe:
Copy code
kubectl describe pvc -n pulsar pulsar-zookeeper-data-pulsar-zookeeper-0
Name:          pulsar-zookeeper-data-pulsar-zookeeper-0
Namespace:     pulsar
StorageClass:  gp2
Status:        Bound
Volume:        pvc-55d0867e-a3e5-4f05-8dfd-481687445f31
Labels:        app=pulsar
               component=zookeeper
               release=pulsar
Annotations:   <http://ebs.csi.aws.com/volumeType|ebs.csi.aws.com/volumeType>: gp3
               <http://pv.kubernetes.io/bind-completed|pv.kubernetes.io/bind-completed>: yes
               <http://pv.kubernetes.io/bound-by-controller|pv.kubernetes.io/bound-by-controller>: yes
               <http://volume.beta.kubernetes.io/storage-provisioner|volume.beta.kubernetes.io/storage-provisioner>: <http://ebs.csi.aws.com|ebs.csi.aws.com>
               <http://volume.kubernetes.io/selected-node|volume.kubernetes.io/selected-node>: ip-10-161-143-251.us-west-2.compute.internal
               <http://volume.kubernetes.io/storage-provisioner|volume.kubernetes.io/storage-provisioner>: <http://ebs.csi.aws.com|ebs.csi.aws.com>
Finalizers:    [<http://kubernetes.io/pvc-protection|kubernetes.io/pvc-protection>]
Capacity:      2Gi
Access Modes:  RWO
VolumeMode:    Filesystem
Used By:       pulsar-zookeeper-0
Events:        <none>
The storage class "gp2" appears to exist:
r
And the volume
pvc-55d0867e-a3e5-4f05-8dfd-481687445f31
exists? Does its storageclass match the PVC storageclass? What driver is it using? I can see you're using ebs.csi.aws.com/volumeType annotation in PVC set to gp3. Did you try to change the volume type? Also, as I can see you already have AWS CSI driver installed and its respective storage class gp3 is already created, it may be better to use it instead of storage class gp2 backed by the old Kubernetes "in-tree" storage driver (kubernetes.io/aws-ebs). The CSI offers more functionality, like volume expansion, snapshotting, etc.
👍 1
p
No the volume does not exist, we removed it as a debugging step so we could perform a fresh deployment. We were thinking that the init jobs would provision the required volumes.
ok, we'll try with "gp3".
r
When you deleted PV, then delete also PVC (the command may hang because PVC is attached to pod). Meanwhile, delete pod
pulsar-zookeeper-0
- no worries, it will be recreated from statefulset, including the PVC
👍 1
but definitely go to gp3 as well
I wonder if changing storageclass is as simple as just setting it in statefulset's PVC template... 🤔 Maybe not
I never tried
And other gp2-class volumes work fine?
p
yes, gp2 is working for us in stage and prod, but we could try gp3 in dev and (if all goes well) switch to gp3 everywhere
1
r
did you managed to delete PVC and pod?
it should fix your deployment
p
will try, we'll need an admin to perform the delete on our cluster ... I will ping them when they're free
We resolved our issue with Pulsar (having to do with quorums in ZK), now GD.CN is deployed on our dev environment. 🎉 However, we are noticing frequent pod restarts in GD.CN. I've attached the results of kubectl describe pod -n gooddata-cn gooddata-cn-export-controller-5d48966c4b-xdb9z for one of the restarting pods, which suggests some issue with the liveness probes. Because of this issue, the pods are continually terminated and recreated. We're still investigating but perhaps you have some ideas that can help. Thanks so much! @Robert Moucha
Some additional logs for reference, I'm seeing messages about GRPC and HTTP health checks.
r
Hi, the calcique pods behave correctly: liveness checks are OK, but readiness checks fail because service gooddata-cn-metadata-api-headless (which they depend on) is not available. So calcique pods are running but are set as not ready. So the root cause is in metadata-api pods.
👍 1
Please check if metadata-api pods are working correctly. Not sure about the export-controller - are the pods still restarting? And one (unrelated) issue - I recommend configuring export controller with S3 bucket for exporter file storage.
👍 1
p
Thanks Robert. Metadata-api is another service that continually restarts, it seems due to the startup probe failing. Attaching the logs and pod describe.
r
Sorry for late response, I was out of office last week. I checked the attached metadata-api log but there are no signs of failing probes. There are other issues in your log, however. I can see a lot of "No organization found for hostname 10.163.188.224" errors. The "10.163.188.224" is actually a pod IP, not an organization hostname (as specified in your Organization custom resources). I wonder where they are comming from. Other, possibly related errors suggest that something, identifying itself as "Go-http-client/1.1" is calling
OPTIONS
method on gRPC interface:
Copy code
,"logger":"io.grpc.netty.NettyServerTransport.connections","thread":"grpc-default-worker-ELG-13-2","msg":"Transport failed","exc":"io.netty.handler.codec.http2.Http2Exception: Unexpected HTTP/1.x request: OPTIONS /"...
👍 1