We've observed that the bitnami etcd chart is depl...
# gooddata-cn
p
We've observed that the bitnami etcd chart is deployed as part of our GD.CN 3.1.0 deployment and creates a statefulset with 3 pods. It seems that one of the 3 pods is consistently down in each of our environments with the following error:
Copy code
Updating member in existing cluster
2024-01-19T15:36:31.311396926Z {"level":"warn","ts":"2024-01-19T15:36:31.311232Z","logger":"etcd-client","caller":"v3@v3.5.10/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"<etcd-endpoints://0xc000322c40/gooddata-cn-etcd-0.gooddata-cn-etcd-headless.gooddata-cn.svc.cluster.local:2379>","attempt":0,"error":"rpc error: code = NotFound desc = etcdserver: member not found"}
2024-01-19T15:36:31.311425474Z Error: etcdserver: member not found
The pod enters Crashloopbackoff state and repeatedly restarts. Any ideas on what we can do to resolve this? Is the third-party etcd deployment even necessary and, if not, can we safely remove it? Thanks for suggestions.
j
This is our internal procedure for this case: Remove broken ETCD pod from list of members in ETCD cluster (removal has to be done from healthy pod, e.g. `etcd-0`:
Copy code
BROKEN_POD_MEMBER_ID=`kubectl exec -it -n quiver etcd-0 -- etcdctl member list | grep etcd-1 | cut -d, -f 1`
echo $BROKEN_POD_MEMBER_ID
365963a78ee27498
kubectl exec -it -n quiver etcd-0 -- etcdctl member remove $BROKEN_POD_MEMBER_ID
Delete affected ETCD’s pod persistent volume claim (asynchronously) to let new pod initialize fresh configuration:
Copy code
kubectl delete -n quiver pvc data-etcd-1 --wait=false
Trigger ETCD statefulset rolling update:
Copy code
kubectl describe sts -n quiver etcd | grep CLUSTER_STATE
# !!! WARNING: Result STATE must be "existing". If not, please, ping us in this thread !!!
# If state == "existing", then:
kubectl rollout restart -n quiver sts etcd
Now all the ETCD pods should be healthy again. If not, please, ping us in this thread!
👍 1
r
don't forget to update kubectl command line parameters to match your environment, particularly
-n namespace
and also etcd pod names may be different
👍 1
p
Unfortunately, we're not able to run these commands as written since our cluster is administered though Rancher and we don't have kubectl and etcctl available in the same shell. Nevertheless, I tried to piece together what these commands are doing and I see some issues: First, our failing pod doesn't appear to have a memberId so were not able to remove it by ID (the failing pod is gooddatacn-etcd-1, which doesn't appear in the member list):
Copy code
etcdctl member list
568dabbac362697e, started, gooddata-cn-etcd-2, <http://gooddata-cn-etcd-2.gooddata-cn-etcd-headless.gooddata-cn.svc.cluster.local:2380>, <http://gooddata-cn-etcd-2.gooddata-cn-etcd-headless.gooddata-cn.svc.cluster.local:2379>,<http://gooddata-cn-etcd.gooddata-cn.svc.cluster.local:2379>, false
c4affa16810d4ac9, started, gooddata-cn-etcd-0, <http://gooddata-cn-etcd-0.gooddata-cn-etcd-headless.gooddata-cn.svc.cluster.local:2380>, <http://gooddata-cn-etcd-0.gooddata-cn-etcd-headless.gooddata-cn.svc.cluster.local:2379>,<http://gooddata-cn-etcd.gooddata-cn.svc.cluster.local:2379>, false
Second, as mentioned in the procedure, the CLUSTER_STATE should be "existing", but it appears to be "new" (or non-existent, since this is ETCD_INITIAL_CLUSTER_STATE, we don't seem to have a CLUSTER_STATE field). Our biggest question is whether this etcd issue is simply a nuisance problem or if it may be causing the timeout we're seeing in analytical designer (described in a different thread), which is a real blocker for us, since we're currently unable to generate any insights for this organization.
Attaching the results of kubectl sts describe
Attaching results of kubectl describe pod on failing pod, it seems to have an issue attaching the volume