We re seeing a pod scheduling issue with Zookeeper in our de GoodData #gooddata-cn

We're seeing a pod scheduling issue with Zookeeper...

Pete Lorenz

08/31/2023, 10:34 PM

We're seeing a pod scheduling issue with Zookeeper in our dev environment that is causing Pulsar not to start. There are 2 events in our zookeeper pod, "FailedScheduling" and "NotTriggerScaleUp", suggesting that the pods are not assignable to any workers and hence the pod is stuck in pending status (see attached pod describe output). We're only seeing this issue in our dev environment, zookeeper comes up fine in stage and prod. How should we debug this issue? Is there a workaround? Thanks for any suggestions.

zookeeper-pod.txt

Robert Moucha

09/01/2023, 8:02 AM

Please review the PersistentVolumeClaim called

pulsar-zookeeper-data-pulsar-zookeeper-0

Copy code

kubectl describe pvc -n pulsar pulsar-zookeeper-data-pulsar-zookeeper-0

There might be events related to this problem. Also, check if the storage class defined in that pvc actually exists.

👍 1

Pete Lorenz

09/01/2023, 2:26 PM

Here's the result of the above describe:

Copy code

kubectl describe pvc -n pulsar pulsar-zookeeper-data-pulsar-zookeeper-0
Name:          pulsar-zookeeper-data-pulsar-zookeeper-0
Namespace:     pulsar
StorageClass:  gp2
Status:        Bound
Volume:        pvc-55d0867e-a3e5-4f05-8dfd-481687445f31
Labels:        app=pulsar
               component=zookeeper
               release=pulsar
Annotations:   <http://ebs.csi.aws.com/volumeType|ebs.csi.aws.com/volumeType>: gp3
               <http://pv.kubernetes.io/bind-completed|pv.kubernetes.io/bind-completed>: yes
               <http://pv.kubernetes.io/bound-by-controller|pv.kubernetes.io/bound-by-controller>: yes
               <http://volume.beta.kubernetes.io/storage-provisioner|volume.beta.kubernetes.io/storage-provisioner>: <http://ebs.csi.aws.com|ebs.csi.aws.com>
               <http://volume.kubernetes.io/selected-node|volume.kubernetes.io/selected-node>: ip-10-161-143-251.us-west-2.compute.internal
               <http://volume.kubernetes.io/storage-provisioner|volume.kubernetes.io/storage-provisioner>: <http://ebs.csi.aws.com|ebs.csi.aws.com>
Finalizers:    [<http://kubernetes.io/pvc-protection|kubernetes.io/pvc-protection>]
Capacity:      2Gi
Access Modes:  RWO
VolumeMode:    Filesystem
Used By:       pulsar-zookeeper-0
Events:        <none>

Pete Lorenz

09/01/2023, 2:28 PM

The storage class "gp2" appears to exist:

Robert Moucha

09/01/2023, 3:02 PM

And the volume

pvc-55d0867e-a3e5-4f05-8dfd-481687445f31

exists? Does its storageclass match the PVC storageclass? What driver is it using? I can see you're using ebs.csi.aws.com/volumeType annotation in PVC set to gp3. Did you try to change the volume type? Also, as I can see you already have AWS CSI driver installed and its respective storage class gp3 is already created, it may be better to use it instead of storage class gp2 backed by the old Kubernetes "in-tree" storage driver (kubernetes.io/aws-ebs). The CSI offers more functionality, like volume expansion, snapshotting, etc.

👍 1

Pete Lorenz

09/01/2023, 3:04 PM

No the volume does not exist, we removed it as a debugging step so we could perform a fresh deployment. We were thinking that the init jobs would provision the required volumes.

Pete Lorenz

09/01/2023, 3:06 PM

ok, we'll try with "gp3".

Robert Moucha

09/01/2023, 3:12 PM

When you deleted PV, then delete also PVC (the command may hang because PVC is attached to pod). Meanwhile, delete pod

pulsar-zookeeper-0

- no worries, it will be recreated from statefulset, including the PVC

👍 1

Robert Moucha

09/01/2023, 3:13 PM

but definitely go to gp3 as well

Robert Moucha

09/01/2023, 3:15 PM

I wonder if changing storageclass is as simple as just setting it in statefulset's PVC template... 🤔 Maybe not

Robert Moucha

09/01/2023, 3:15 PM

I never tried

Robert Moucha

09/01/2023, 3:16 PM

And other gp2-class volumes work fine?

Pete Lorenz

09/01/2023, 3:17 PM

yes, gp2 is working for us in stage and prod, but we could try gp3 in dev and (if all goes well) switch to gp3 everywhere

✅ 1

Robert Moucha

09/01/2023, 3:23 PM

did you managed to delete PVC and pod?

Robert Moucha

09/01/2023, 3:23 PM

it should fix your deployment

Pete Lorenz

09/01/2023, 3:25 PM

will try, we'll need an admin to perform the delete on our cluster ... I will ping them when they're free

Pete Lorenz

09/08/2023, 12:00 AM

We resolved our issue with Pulsar (having to do with quorums in ZK), now GD.CN is deployed on our dev environment. 🎉 However, we are noticing frequent pod restarts in GD.CN. I've attached the results of kubectl describe pod -n gooddata-cn gooddata-cn-export-controller-5d48966c4b-xdb9z for one of the restarting pods, which suggests some issue with the liveness probes. Because of this issue, the pods are continually terminated and recreated. We're still investigating but perhaps you have some ideas that can help. Thanks so much! @Robert Moucha

describe-export-controller.txt

Pete Lorenz

09/08/2023, 2:57 PM

Some additional logs for reference, I'm seeing messages about GRPC and HTTP health checks.

describe-calcique.txt gooddata-cn-calcique-754cbc4597-xvl7z_calcique.log

Robert Moucha

09/11/2023, 7:32 AM

Hi, the calcique pods behave correctly: liveness checks are OK, but readiness checks fail because service gooddata-cn-metadata-api-headless (which they depend on) is not available. So calcique pods are running but are set as not ready. So the root cause is in metadata-api pods.

👍 1

Robert Moucha

09/11/2023, 7:40 AM

Please check if metadata-api pods are working correctly. Not sure about the export-controller - are the pods still restarting? And one (unrelated) issue - I recommend configuring export controller with S3 bucket for exporter file storage.

👍 1

Pete Lorenz

09/11/2023, 3:00 PM

Thanks Robert. Metadata-api is another service that continually restarts, it seems due to the startup probe failing. Attaching the logs and pod describe.

gooddata-cn-metadata-api-6f86cfbd6d-tzv9w_metadata-api.log describe-metadata-api-02.txt

Robert Moucha

09/19/2023, 4:25 PM

Sorry for late response, I was out of office last week. I checked the attached metadata-api log but there are no signs of failing probes. There are other issues in your log, however. I can see a lot of "No organization found for hostname 10.163.188.224" errors. The "10.163.188.224" is actually a pod IP, not an organization hostname (as specified in your Organization custom resources). I wonder where they are comming from. Other, possibly related errors suggest that something, identifying itself as "Go-http-client/1.1" is calling

OPTIONS

method on gRPC interface:

Copy code

,"logger":"io.grpc.netty.NettyServerTransport.connections","thread":"grpc-default-worker-ELG-13-2","msg":"Transport failed","exc":"io.netty.handler.codec.http2.Http2Exception: Unexpected HTTP/1.x request: OPTIONS /"...

👍 1

Open in Slack

Previous Next