Pete Lorenz
02/05/2024, 6:58 AM0/31 nodes are available: 1 node(s) had untolerated taint {ToBeDeletedByClusterAutoscaler: 1707115371}, 1 node(s) had untolerated taint {ToBeDeletedByClusterAutoscaler: 1707115382}, 1 node(s) had untolerated taint {<http://node.cloudprovider.kubernetes.io/uninitialized|node.cloudprovider.kubernetes.io/uninitialized>: true}, 1 node(s) had untolerated taint {<http://node.kubernetes.io/not-ready|node.kubernetes.io/not-ready>: }, 18 node(s) had volume node affinity conflict, 9 node(s) didn't match Pod's node affinity/selector. preemption: 0/31 nodes are available: 31 Preemption is not helpful for scheduling.
We are getting similar errors for Pulsar bookie and, hence, Pulsar and our core GD pods are in a Crashloop state. We haven't made any changes to the chart or deployment so it's not clear how there could suddenly be affinity issues preventing scheduling the pods. Nevertheless, the issue is making our GoodData deployments unusable due to the dependency on Pulsar. Are there any workarounds we can follow to get our clusters back up and running? Thanks for any help.Boris
02/05/2024, 2:49 PM18 node(s) had volume node affinity conflict
msg it seems that you have GD.CN deployed in multiple availability zones. Based on 9 node(s) didn't match Pod's node affinity/selector
it can be assumed that you have dedicated nodes for your GD.CN installation.
it could be that cluster auto-scaling got somehow broken and 4 worker nodes in cluster are in some bad shape, so it would be good to fix the cluster autoscaling and somehow heal the nodes with ToBeDeletedByClusterAutoscaler
or unintialized/not-ready
taints.
Is it correct that you deploy GD.CN in multiple availability zones and you are trying to pin GD.CN to specific GKE worker nodes? If yes, please first heal the kubernetes cluster (maybe with help from google support?) and ensure that all the worker nodes in the cluster are healthy.Pete Lorenz
02/05/2024, 3:12 PMPete Lorenz
02/05/2024, 11:49 PMBoris
02/06/2024, 7:28 AM