We ve observed that the bitnami etcd chart is deployed as pa GoodData #gooddata-cn

We've observed that the bitnami etcd chart is depl...

Pete Lorenz

01/19/2024, 3:54 PM

We've observed that the bitnami etcd chart is deployed as part of our GD.CN 3.1.0 deployment and creates a statefulset with 3 pods. It seems that one of the 3 pods is consistently down in each of our environments with the following error:

Copy code

Updating member in existing cluster
2024-01-19T15:36:31.311396926Z {"level":"warn","ts":"2024-01-19T15:36:31.311232Z","logger":"etcd-client","caller":"v3@v3.5.10/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"<etcd-endpoints://0xc000322c40/gooddata-cn-etcd-0.gooddata-cn-etcd-headless.gooddata-cn.svc.cluster.local:2379>","attempt":0,"error":"rpc error: code = NotFound desc = etcdserver: member not found"}
2024-01-19T15:36:31.311425474Z Error: etcdserver: member not found

The pod enters Crashloopbackoff state and repeatedly restarts. Any ideas on what we can do to resolve this? Is the third-party etcd deployment even necessary and, if not, can we safely remove it? Thanks for suggestions.

Jan Soubusta

01/22/2024, 12:21 PM

This is our internal procedure for this case: Remove broken ETCD pod from list of members in ETCD cluster (removal has to be done from healthy pod, e.g. `etcd-0`:

Copy code

BROKEN_POD_MEMBER_ID=`kubectl exec -it -n quiver etcd-0 -- etcdctl member list | grep etcd-1 | cut -d, -f 1`
echo $BROKEN_POD_MEMBER_ID
365963a78ee27498
kubectl exec -it -n quiver etcd-0 -- etcdctl member remove $BROKEN_POD_MEMBER_ID

Delete affected ETCD’s pod persistent volume claim (asynchronously) to let new pod initialize fresh configuration:

Copy code

kubectl delete -n quiver pvc data-etcd-1 --wait=false

Trigger ETCD statefulset rolling update:

Copy code

kubectl describe sts -n quiver etcd | grep CLUSTER_STATE
# !!! WARNING: Result STATE must be "existing". If not, please, ping us in this thread !!!
# If state == "existing", then:
kubectl rollout restart -n quiver sts etcd

Now all the ETCD pods should be healthy again. If not, please, ping us in this thread!

👍 1

Robert Moucha

01/22/2024, 1:42 PM

don't forget to update kubectl command line parameters to match your environment, particularly

-n namespace

and also etcd pod names may be different

👍 1

Pete Lorenz

01/23/2024, 8:29 PM

Unfortunately, we're not able to run these commands as written since our cluster is administered though Rancher and we don't have kubectl and etcctl available in the same shell. Nevertheless, I tried to piece together what these commands are doing and I see some issues: First, our failing pod doesn't appear to have a memberId so were not able to remove it by ID (the failing pod is gooddatacn-etcd-1, which doesn't appear in the member list):

Copy code

etcdctl member list
568dabbac362697e, started, gooddata-cn-etcd-2, <http://gooddata-cn-etcd-2.gooddata-cn-etcd-headless.gooddata-cn.svc.cluster.local:2380>, <http://gooddata-cn-etcd-2.gooddata-cn-etcd-headless.gooddata-cn.svc.cluster.local:2379>,<http://gooddata-cn-etcd.gooddata-cn.svc.cluster.local:2379>, false
c4affa16810d4ac9, started, gooddata-cn-etcd-0, <http://gooddata-cn-etcd-0.gooddata-cn-etcd-headless.gooddata-cn.svc.cluster.local:2380>, <http://gooddata-cn-etcd-0.gooddata-cn-etcd-headless.gooddata-cn.svc.cluster.local:2379>,<http://gooddata-cn-etcd.gooddata-cn.svc.cluster.local:2379>, false

Second, as mentioned in the procedure, the CLUSTER_STATE should be "existing", but it appears to be "new" (or non-existent, since this is ETCD_INITIAL_CLUSTER_STATE, we don't seem to have a CLUSTER_STATE field). Our biggest question is whether this etcd issue is simply a nuisance problem or if it may be causing the timeout we're seeing in analytical designer (described in a different thread), which is a real blocker for us, since we're currently unable to generate any insights for this organization.

Pete Lorenz

01/23/2024, 8:32 PM

Attaching the results of kubectl sts describe

etcd-sts-describe.txt

Pete Lorenz

01/23/2024, 8:37 PM

Attaching results of kubectl describe pod on failing pod, it seems to have an issue attaching the volume

etc-pod-describe.txt

Noushad Ali

09/30/2024, 7:34 AM

@Jan Soubusta we faced the same issue. After following the steps you suggested we got a different result as given below. Please suggest how to fix it

Noushad Ali

09/30/2024, 7:34 AM

Copy code

$ kubectl describe sts -n gooddata-cn etcd | grep CLUSTER_STATE
      ETCD_INITIAL_CLUSTER_STATE:        new

Jan Soubusta

09/30/2024, 11:08 AM

@Robert Moucha please, could you assist here? I am not an expert for this area...

Robert Moucha

10/02/2024, 12:27 PM

@Noushad Ali There is reported similar issue with Bitnami's etcd helm chart. The issue seems to be somehow related to

ETCD_INITIAL_CLUSTER_STATE

variable, that should be set to

new

on chart install, and to

existing

on subsequent chart upgrades. This approach is not not very friendly for any declarative deployment 😕 In our case, when we introduced dependency on etcd, we had to add

etcd.initialClusterState=new

into gooddata-cn chart values. Without it, it would be impossible to perform upgrade from an older CN version to a newer (that was adding etcd). Unfortunately, the presence of this setting in default chart value prevent the ectd cluster from switching to "existing" mode without overriding this value - from the current perspective, it was not ideal design decision 🤷🏻 Anyway, we will remove this setting some future release. For now, please set

etcd.initialClusterState=existing

and make sure the variable

ETCD_INITIAL_CLUSTER_STATE

is set to this value in all 3 etcd pods.

Noushad Ali

10/12/2024, 3:02 AM

Thanks lot @Robert Moucha and @Jan Soubusta the suggested solution worked 🙂

Robert Moucha

10/12/2024, 3:13 PM

I'm glad to hear that.

Open in Slack

Previous Next