It seems the zookeeper pods on two of our GCP deployments ca GoodData #gooddata-cn

It seems the zookeeper pods on two of our GCP depl...

Pete Lorenz

02/05/2024, 6:58 AM

It seems the zookeeper pods on two of our GCP deployments cannot be scheduled due to the following error:

Copy code

0/31 nodes are available: 1 node(s) had untolerated taint {ToBeDeletedByClusterAutoscaler: 1707115371}, 1 node(s) had untolerated taint {ToBeDeletedByClusterAutoscaler: 1707115382}, 1 node(s) had untolerated taint {<http://node.cloudprovider.kubernetes.io/uninitialized|node.cloudprovider.kubernetes.io/uninitialized>: true}, 1 node(s) had untolerated taint {<http://node.kubernetes.io/not-ready|node.kubernetes.io/not-ready>: }, 18 node(s) had volume node affinity conflict, 9 node(s) didn't match Pod's node affinity/selector. preemption: 0/31 nodes are available: 31 Preemption is not helpful for scheduling.

We are getting similar errors for Pulsar bookie and, hence, Pulsar and our core GD pods are in a Crashloop state. We haven't made any changes to the chart or deployment so it's not clear how there could suddenly be affinity issues preventing scheduling the pods. Nevertheless, the issue is making our GoodData deployments unusable due to the dependency on Pulsar. Are there any workarounds we can follow to get our clusters back up and running? Thanks for any help.

Boris

02/05/2024, 2:49 PM

Hello Pete, it seems that this issue is not directly related to GD.CN deployment, but rather GKE issue. Based on

18 node(s) had volume node affinity conflict

msg it seems that you have GD.CN deployed in multiple availability zones. Based on

9 node(s) didn't match Pod's node affinity/selector

it can be assumed that you have dedicated nodes for your GD.CN installation. it could be that cluster auto-scaling got somehow broken and 4 worker nodes in cluster are in some bad shape, so it would be good to fix the cluster autoscaling and somehow heal the nodes with

ToBeDeletedByClusterAutoscaler

unintialized/not-ready

taints. Is it correct that you deploy GD.CN in multiple availability zones and you are trying to pin GD.CN to specific GKE worker nodes? If yes, please first heal the kubernetes cluster (maybe with help from google support?) and ensure that all the worker nodes in the cluster are healthy.

Pete Lorenz

02/05/2024, 3:12 PM

Thank you, Boris. Investigating ...

Pete Lorenz

02/05/2024, 11:49 PM

To close this thread, it was an issue with the storage driver specific to our GCP clusters (with a misleading error message) that we resolved by enabling Google's own driver rather than the default k8s driver. A support ticket with Google helped resolve the issue. Thanks so much for pointing us in the right direction @Boris

Boris

02/06/2024, 7:28 AM

Thanks for sharing the resolution Pete!

Open in Slack

Previous Next