We re seeing an issue in our dev cluster where the calcique GoodData #gooddata-cn

We're seeing an issue in our dev cluster where the...

Pete Lorenz

10/30/2023, 3:55 PM

We're seeing an issue in our dev cluster where the calcique and afm-exec-api pods (2 replicas) appear as running but unready. In the calcique logs, we see messages that the pods cannot resolve gooddata-cn-result-cache-headless:

Copy code

"level":"ERROR","logger":"com.gooddata.tiger.grpc.healthcheck.GrpcHealthCheck","thread":"boundedElastic-2","traceId":"9e2839743de46645","spanId":"9e2839743de46645","msg":"Error during GRPC Healthcheck call","action":"grpcHealthCheck","exc":"io.grpc.StatusRuntimeException: UNAVAILABLE: Unable to resolve host gooddata-cn-result-cache-headless

And warnings that the host is not resolvable:

Copy code

"level":"WARN","logger":"io.grpc.internal.ManagedChannelImpl","thread":"grpc-default-executor-4","msg":"[Channel<3>: (gooddata-cn-result-cache-headless:6567)] Failed to resolve name. status=Status{code=UNAVAILABLE, description=Unable to resolve host gooddata-cn-result-cache-headless, cause=java.lang.RuntimeException: java.net.UnknownHostException: gooddata-cn-result-cache-headless: Name or service not known\n\tat io.grpc.internal.DnsNameResolver.resolveAddresses(DnsNameResolver.java:223)

There are similar errors regarding gooddata-cn-metadata-api-headless:

Copy code

"level":"ERROR","logger":"com.gooddata.tiger.grpc.healthcheck.GrpcHealthCheck","thread":"boundedElastic-1","traceId":"756f885e0c24d41d","spanId":"756f885e0c24d41d","msg":"Error during GRPC Healthcheck call","action":"grpcHealthCheck","exc":"io.grpc.StatusRuntimeException: UNAVAILABLE: Unable to resolve host gooddata-cn-metadata-api-headless

The afm-exec-api pods have errors that they cannot resolve gooddata-cn-calcique-headless. Attaching the pod logs for the relevant services. I've tried restarting the pods but the issue remains. Any suggestions as to what might be the cause or how to resolve?

gooddata-cn-calcique-65697c6f89-ngbdw_calcique.log gooddata-cn-result-cache-7f6bdbcf6f-h5s4w_result-cache.log gooddata-cn-metadata-api-84dc4d7b7d-9znld_metadata-api.log gooddata-cn-afm-exec-api-7fdf4899dc-plhm7_afm-exec-api.log

Marley Bross

10/31/2023, 8:52 PM

Hi @Pete Lorenz have you found a resolution for this issue? I’ve noticed today that I’m unable to spin up the gooddata container edition docker image and am running into the same problem with the healthcheck error.

Pete Lorenz

10/31/2023, 9:07 PM

Thanks for getting back, Marley. We still haven't found a resolution.

Boris

11/01/2023, 9:06 AM

Hi, we are looking into this

Robert Moucha

11/01/2023, 4:24 PM

Unable to resolve host gooddata-cn-result-cache-headless

means that this service has no

records (no IP addresses). It happens when none of pods belonging to the deployment

gooddata-cn-result-cache

are Ready. You sent log from one of result-cache pods (gooddata-cn-result-cache-7f6bdbcf6f-h5s4w_result-cache.log) but there are no errors visible. According to timestamps, the pod was recently restarted, so I assume it crashed some time ago. If both pods are repeatedly crashing, it would explain why headless service doesn't serve any addresses. Please check the following: 1. How often are the pods restarting.

Copy code

kubectl -n gooddata-cn get pod --selector app.kubernetes.io/component=resultCache

NAME                                       READY   STATUS    RESTARTS   AGE
gooddata-cn-result-cache-584b5d67b-287gw   1/1     Running   0          31h
gooddata-cn-result-cache-584b5d67b-w6t75   1/1     Running   0          31h

If the

RESTARTS

column in non-zero, there should also be a time when the last restart occurred. 2. Why the pod restarted. The

kubectl describe pod --selector <http://app.kubernetes.io/component=resultCache|app.kubernetes.io/component=resultCache>

will show details for both pods. If the pod crashed, you can see valuable details in Events section, and also in Containers/result-cache/Last State secion. For example:

Copy code

Containers:
  result-cache:
    Container ID:   <containerd://1dd40a3efc53369eea5a45cb30f5f14538f827b4a263fb3e6fc606e332192e5>b
    Image:          xxxx/sql-executor:VVVV
    Image ID:       xxxx/sql-executor@sha256:92901e08171c11f5a11f407a9d18a3b6a2c2703e52bb32bfcee6c9c26ea7ffcd
    Ports:          6567/TCP, 9040/TCP, 9041/TCP
    Host Ports:     0/TCP, 0/TCP, 0/TCP
    State:          Running
      Started:      Wed, 01 Nov 2023 01:38:39 +0100    
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Wed, 01 Nov 2023 01:40:29 +0100
      Finished:     Wed, 01 Nov 2023 16:23:24 +0100
    Ready:          True
    Restart Count:  1

👍 1

Robert Moucha

11/01/2023, 4:26 PM

In the example above, the reason is

OOMKilled

- meaning the JVM process ran out of memory reserved by Kubernetes. But there may be other reasons as well.

Robert Moucha

11/01/2023, 4:29 PM

3. May be useful to get previous logs - what happened BEFORE container was restarted. To get previous logs, add

-p

to kubectl logs command, like:

Copy code

kubectl -n gooddata-cn logs -p gooddata-cn-result-cache-7f6bdbcf6f-h5s4w

Robert Moucha

11/01/2023, 4:31 PM

this is extremely useful if there was an issue with application itself, e.g. to discover and fix Java OutOfMemoryError exceptions.

Pete Lorenz

11/01/2023, 4:44 PM

The pods do seem to be restarting frequently:

Copy code

> kubectl -n gooddata-cn get pod --selector <http://app.kubernetes.io/component=resultCache|app.kubernetes.io/component=resultCache>
NAME                                        READY   STATUS             RESTARTS         AGE
gooddata-cn-result-cache-7f6bdbcf6f-9hphl   0/1     CrashLoopBackOff   13 (83s ago)     51m
gooddata-cn-result-cache-7f6bdbcf6f-bh9mg   0/1     CrashLoopBackOff   72 (3m54s ago)   6h17m
>

The CrashLoopBackOff is new however. We were not seeing this yesterday. The cause seems to be a startup probe failure. Attaching the result of "kubectl describe pod --selector app.kubernetes.io/component=resultCache -n gooddata-cn"

result-cache-describe-2023-11-01.txt

Pete Lorenz

11/01/2023, 4:49 PM

Attaching the results of kubectl logs -p for one of the result-cache pods

result-cache-logs-2023-11-01.txt

Pete Lorenz

11/01/2023, 4:50 PM

These errors seem new as of today. The situation seems a bit more unstable than yesterday.

Pete Lorenz

11/01/2023, 4:55 PM

One thing we observed yesterday is that none of our GD services have endpoints in our dev cluster:

Copy code

kubectl get endpoints -n gooddata-cn
NAME                                    ENDPOINTS                            AGE
gooddata-cn-afm-exec-api                <none>                               54d
gooddata-cn-analytical-designer         <none>                               54d
gooddata-cn-api-gateway                 <none>                               54d
gooddata-cn-api-gateway-headless        <none>                               54d
gooddata-cn-apidocs                     <none>                               54d
gooddata-cn-auth-service                <none>                               54d
gooddata-cn-auth-service-headless       <none>                               54d
gooddata-cn-calcique-headless           <none>                               54d
gooddata-cn-dashboards                  <none>                               54d
gooddata-cn-export-controller           <none>                               54d
gooddata-cn-home-ui                     <none>                               54d
gooddata-cn-ldm-modeler                 <none>                               54d
gooddata-cn-measure-editor              <none>                               54d
gooddata-cn-metadata-api                <none>                               54d
gooddata-cn-metadata-api-headless       <none>                               54d
gooddata-cn-result-cache-headless       <none>                               54d
gooddata-cn-scan-model                  <none>                               54d
gooddata-cn-sql-executor-headless       <none>                               54d
gooddata-cn-tabular-exporter-headless   <none>                               54d
gooddata-cn-web-components              <none>                               54d
ingress-nginx-controller                10.163.145.73:443,10.163.145.73:80   54d
ingress-nginx-controller-admission      10.163.145.73:8443                   54d

Perhaps this is a symptom of the other issues but it's concerning to us that even the services that are running without errors do not have endpoints.

😱 1

Robert Moucha

11/01/2023, 5:00 PM

if NONE of services have endpoints, this is definitely wrong. All these services should have 2 endpoints. Did you modified pod labels or service selectors somehow?

Robert Moucha

11/01/2023, 5:02 PM

regd the log, there's ugly exception that happened during pulsar client initialization. If the pod restart count is low, I would not care much about it.

Pete Lorenz

11/01/2023, 5:02 PM

Yes, we added an owner label "app.kubernetes.io/owner : gooddata-cn" which our ops team says is required to integrate with Dynatrace: However, we added the same label in our staging environment without issues.

Robert Moucha

11/01/2023, 5:08 PM

adding labels is ok, but service selector must match pod labels. For example: In service, there's selector for the following labels:

Copy code

<http://app.kubernetes.io/component|app.kubernetes.io/component>
<http://app.kubernetes.io/instance|app.kubernetes.io/instance>
<http://app.kubernetes.io/name|app.kubernetes.io/name>

In pods, there must be all these three labels and must have the same value as the service expects. Otherwise, the service will not have any endpoint

Pete Lorenz

11/01/2023, 5:10 PM

We did not change the selector labels and they are still present:

Copy code

kubectl describe deployment -n gooddata-cn gooddata-cn-afm-exec-api
Name:                   gooddata-cn-afm-exec-api
Namespace:              gooddata-cn
CreationTimestamp:      Thu, 07 Sep 2023 21:52:12 +0000
Labels:                 <http://app.kubernetes.io/component=afmExecApi|app.kubernetes.io/component=afmExecApi>
                        <http://app.kubernetes.io/instance=gooddata-cn|app.kubernetes.io/instance=gooddata-cn>
                        <http://app.kubernetes.io/managed-by=Helm|app.kubernetes.io/managed-by=Helm>
                        <http://app.kubernetes.io/name=gooddata-cn|app.kubernetes.io/name=gooddata-cn>
                        <http://app.kubernetes.io/owner=gooddata-cn|app.kubernetes.io/owner=gooddata-cn>
                        <http://app.kubernetes.io/version=2.5.1|app.kubernetes.io/version=2.5.1>
                        <http://helm.sh/chart=gooddata-cn-2.5.1|helm.sh/chart=gooddata-cn-2.5.1>
                        <http://objectset.rio.cattle.io/hash=6183899f57b94d51df0180e32d29a0b356e8a441|objectset.rio.cattle.io/hash=6183899f57b94d51df0180e32d29a0b356e8a441>

Robert Moucha

11/01/2023, 5:10 PM

Check using:

Copy code

# service selector
kubectl get service -n  gooddata-cn gooddata-cn-result-cache-headless -o jsonpath='{.spec.selector}'

# deployment (pod template)
kubectl get deployments.apps -n  gooddata-cn gooddata-cn-result-cache  -o jsonpath='{.spec.template.metadata.labels}'

Robert Moucha

11/01/2023, 5:11 PM

labels on deployment are not important. pod labels are important

Pete Lorenz

11/01/2023, 5:11 PM

Copy code

kubectl get service -n  gooddata-cn gooddata-cn-result-cache-headless -o jsonpath='{.spec.selector}'
{"<http://app.kubernetes.io/component|app.kubernetes.io/component>":"resultCache","<http://app.kubernetes.io/instance|app.kubernetes.io/instance>":"gooddata-cn","<http://app.kubernetes.io/name|app.kubernetes.io/name>":"gooddata-cn","<http://app.kubernetes.io/owner|app.kubernetes.io/owner>":"gooddata-cn"}>

Robert Moucha

11/01/2023, 5:11 PM

so you added this extra label to service selector, not to service itself

Pete Lorenz

11/01/2023, 5:12 PM

Copy code

kubectl get service -n  gooddata-cn gooddata-cn-result-cache-headless -o jsonpath='{.spec.selector}'
{"<http://app.kubernetes.io/component|app.kubernetes.io/component>":"resultCache","<http://app.kubernetes.io/instance|app.kubernetes.io/instance>":"gooddata-cn","<http://app.kubernetes.io/name|app.kubernetes.io/name>":"gooddata-cn","<http://app.kubernetes.io/owner|app.kubernetes.io/owner>":"gooddata-cn"}>

Robert Moucha

11/01/2023, 5:12 PM

can you please show the output of the 2nd command?

Copy code

kubectl get deployments.apps -n  gooddata-cn gooddata-cn-result-cache  -o jsonpath='{.spec.template.metadata.labels}'

👍 1

Pete Lorenz

11/01/2023, 5:13 PM

We added it to _helpers.tpl:

Copy code

{{/*
Common labels
*/}}
{{- define "gooddata-cn.labels" -}}
<http://helm.sh/chart|helm.sh/chart>: {{ include "gooddata-cn.chart" . }}
<http://app.kubernetes.io/owner|app.kubernetes.io/owner>: {{ include "gooddata-cn.name" . }}
{{ include "gooddata-cn.selectorLabels" . }}
{{- if .Chart.AppVersion }}
<http://app.kubernetes.io/version|app.kubernetes.io/version>: {{ .Chart.AppVersion | quote }}
{{- end }}
<http://app.kubernetes.io/managed-by|app.kubernetes.io/managed-by>: {{ .Release.Service }}
{{- end -}}

Pete Lorenz

11/01/2023, 5:13 PM

Copy code

kubectl get deployments.apps -n  gooddata-cn gooddata-cn-result-cache  -o jsonpath='{.spec.template.metadata.labels}'
{"<http://app.kubernetes.io/component|app.kubernetes.io/component>":"resultCache","<http://app.kubernetes.io/instance|app.kubernetes.io/instance>":"gooddata-cn","<http://app.kubernetes.io/name|app.kubernetes.io/name>":"gooddata-cn"}>

Robert Moucha

11/01/2023, 5:14 PM

that's it - pod labels != service selector labels

👍 1

Robert Moucha

11/01/2023, 5:18 PM

It still doesn't explain the way how the extra label got into svc selector. You modified "gooddata-cn.labels" template, but it is not used in selectors

Pete Lorenz

11/01/2023, 5:20 PM

If I recall, we initially put the label in the wrong place, perhaps that's how it got in the service selectors. But it should have been removed from the service selectors when we redeployed.

Robert Moucha

11/01/2023, 5:21 PM

Apparently not. 😕

Pete Lorenz

11/01/2023, 5:22 PM

maybe we should redeploy

Robert Moucha

11/01/2023, 5:24 PM

just remove the extra label from selector (by any means) and the deployment should start working again

👍 1

Pete Lorenz

11/01/2023, 5:24 PM

ok, will give it a try

Robert Moucha

11/01/2023, 5:25 PM

fine, let us know and have a nice day, Pete.

Pete Lorenz

11/01/2023, 5:25 PM

Thanks Robert, you too

Pete Lorenz

11/02/2023, 2:28 PM

Removing the extra service selector resolved the issue, thanks so much @Robert Moucha!

👍 1

✅ 1

Open in Slack

Previous Next