It seems the zookeeper pods on two of our GCP depl...
# gooddata-cn
p
It seems the zookeeper pods on two of our GCP deployments cannot be scheduled due to the following error:
Copy code
0/31 nodes are available: 1 node(s) had untolerated taint {ToBeDeletedByClusterAutoscaler: 1707115371}, 1 node(s) had untolerated taint {ToBeDeletedByClusterAutoscaler: 1707115382}, 1 node(s) had untolerated taint {<http://node.cloudprovider.kubernetes.io/uninitialized|node.cloudprovider.kubernetes.io/uninitialized>: true}, 1 node(s) had untolerated taint {<http://node.kubernetes.io/not-ready|node.kubernetes.io/not-ready>: }, 18 node(s) had volume node affinity conflict, 9 node(s) didn't match Pod's node affinity/selector. preemption: 0/31 nodes are available: 31 Preemption is not helpful for scheduling.
We are getting similar errors for Pulsar bookie and, hence, Pulsar and our core GD pods are in a Crashloop state. We haven't made any changes to the chart or deployment so it's not clear how there could suddenly be affinity issues preventing scheduling the pods. Nevertheless, the issue is making our GoodData deployments unusable due to the dependency on Pulsar. Are there any workarounds we can follow to get our clusters back up and running? Thanks for any help.
b
Hello Pete, it seems that this issue is not directly related to GD.CN deployment, but rather GKE issue. Based on
18 node(s) had volume node affinity conflict
msg it seems that you have GD.CN deployed in multiple availability zones. Based on
9 node(s) didn't match Pod's node affinity/selector
it can be assumed that you have dedicated nodes for your GD.CN installation. it could be that cluster auto-scaling got somehow broken and 4 worker nodes in cluster are in some bad shape, so it would be good to fix the cluster autoscaling and somehow heal the nodes with
ToBeDeletedByClusterAutoscaler
or
unintialized/not-ready
taints. Is it correct that you deploy GD.CN in multiple availability zones and you are trying to pin GD.CN to specific GKE worker nodes? If yes, please first heal the kubernetes cluster (maybe with help from google support?) and ensure that all the worker nodes in the cluster are healthy.
p
Thank you, Boris. Investigating ...
To close this thread, it was an issue with the storage driver specific to our GCP clusters (with a misleading error message) that we resolved by enabling Google's own driver rather than the default k8s driver. A support ticket with Google helped resolve the issue. Thanks so much for pointing us in the right direction @Boris
b
Thanks for sharing the resolution Pete!