We've observed that the bitnami etcd chart is depl...
# gooddata-cn
p
We've observed that the bitnami etcd chart is deployed as part of our GD.CN 3.1.0 deployment and creates a statefulset with 3 pods. It seems that one of the 3 pods is consistently down in each of our environments with the following error:
Copy code
Updating member in existing cluster
2024-01-19T15:36:31.311396926Z {"level":"warn","ts":"2024-01-19T15:36:31.311232Z","logger":"etcd-client","caller":"v3@v3.5.10/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"<etcd-endpoints://0xc000322c40/gooddata-cn-etcd-0.gooddata-cn-etcd-headless.gooddata-cn.svc.cluster.local:2379>","attempt":0,"error":"rpc error: code = NotFound desc = etcdserver: member not found"}
2024-01-19T15:36:31.311425474Z Error: etcdserver: member not found
The pod enters Crashloopbackoff state and repeatedly restarts. Any ideas on what we can do to resolve this? Is the third-party etcd deployment even necessary and, if not, can we safely remove it? Thanks for suggestions.
j
This is our internal procedure for this case: Remove broken ETCD pod from list of members in ETCD cluster (removal has to be done from healthy pod, e.g. `etcd-0`:
Copy code
BROKEN_POD_MEMBER_ID=`kubectl exec -it -n quiver etcd-0 -- etcdctl member list | grep etcd-1 | cut -d, -f 1`
echo $BROKEN_POD_MEMBER_ID
365963a78ee27498
kubectl exec -it -n quiver etcd-0 -- etcdctl member remove $BROKEN_POD_MEMBER_ID
Delete affected ETCD’s pod persistent volume claim (asynchronously) to let new pod initialize fresh configuration:
Copy code
kubectl delete -n quiver pvc data-etcd-1 --wait=false
Trigger ETCD statefulset rolling update:
Copy code
kubectl describe sts -n quiver etcd | grep CLUSTER_STATE
# !!! WARNING: Result STATE must be "existing". If not, please, ping us in this thread !!!
# If state == "existing", then:
kubectl rollout restart -n quiver sts etcd
Now all the ETCD pods should be healthy again. If not, please, ping us in this thread!
👍 1
r
don't forget to update kubectl command line parameters to match your environment, particularly
-n namespace
and also etcd pod names may be different
👍 1
p
Unfortunately, we're not able to run these commands as written since our cluster is administered though Rancher and we don't have kubectl and etcctl available in the same shell. Nevertheless, I tried to piece together what these commands are doing and I see some issues: First, our failing pod doesn't appear to have a memberId so were not able to remove it by ID (the failing pod is gooddatacn-etcd-1, which doesn't appear in the member list):
Copy code
etcdctl member list
568dabbac362697e, started, gooddata-cn-etcd-2, <http://gooddata-cn-etcd-2.gooddata-cn-etcd-headless.gooddata-cn.svc.cluster.local:2380>, <http://gooddata-cn-etcd-2.gooddata-cn-etcd-headless.gooddata-cn.svc.cluster.local:2379>,<http://gooddata-cn-etcd.gooddata-cn.svc.cluster.local:2379>, false
c4affa16810d4ac9, started, gooddata-cn-etcd-0, <http://gooddata-cn-etcd-0.gooddata-cn-etcd-headless.gooddata-cn.svc.cluster.local:2380>, <http://gooddata-cn-etcd-0.gooddata-cn-etcd-headless.gooddata-cn.svc.cluster.local:2379>,<http://gooddata-cn-etcd.gooddata-cn.svc.cluster.local:2379>, false
Second, as mentioned in the procedure, the CLUSTER_STATE should be "existing", but it appears to be "new" (or non-existent, since this is ETCD_INITIAL_CLUSTER_STATE, we don't seem to have a CLUSTER_STATE field). Our biggest question is whether this etcd issue is simply a nuisance problem or if it may be causing the timeout we're seeing in analytical designer (described in a different thread), which is a real blocker for us, since we're currently unable to generate any insights for this organization.
Attaching the results of kubectl sts describe
Attaching results of kubectl describe pod on failing pod, it seems to have an issue attaching the volume
n
@Jan Soubusta we faced the same issue. After following the steps you suggested we got a different result as given below. Please suggest how to fix it
Copy code
$ kubectl describe sts -n gooddata-cn etcd | grep CLUSTER_STATE
      ETCD_INITIAL_CLUSTER_STATE:        new
j
@Robert Moucha please, could you assist here? I am not an expert for this area...
r
@Noushad Ali There is reported similar issue with Bitnami's etcd helm chart. The issue seems to be somehow related to
ETCD_INITIAL_CLUSTER_STATE
variable, that should be set to
new
on chart install, and to
existing
on subsequent chart upgrades. This approach is not not very friendly for any declarative deployment 😕 In our case, when we introduced dependency on etcd, we had to add
etcd.initialClusterState=new
into gooddata-cn chart values. Without it, it would be impossible to perform upgrade from an older CN version to a newer (that was adding etcd). Unfortunately, the presence of this setting in default chart value prevent the ectd cluster from switching to "existing" mode without overriding this value - from the current perspective, it was not ideal design decision 🤷🏻 Anyway, we will remove this setting some future release. For now, please set
etcd.initialClusterState=existing
and make sure the variable
ETCD_INITIAL_CLUSTER_STATE
is set to this value in all 3 etcd pods.
n
Thanks lot @Robert Moucha and @Jan Soubusta the suggested solution worked 🙂
r
I'm glad to hear that.