We're seeing an issue in our dev cluster where the...
# gooddata-cn
We're seeing an issue in our dev cluster where the calcique and afm-exec-api pods (2 replicas) appear as running but unready. In the calcique logs, we see messages that the pods cannot resolve gooddata-cn-result-cache-headless:
"level":"ERROR","logger":"com.gooddata.tiger.grpc.healthcheck.GrpcHealthCheck","thread":"boundedElastic-2","traceId":"9e2839743de46645","spanId":"9e2839743de46645","msg":"Error during GRPC Healthcheck call","action":"grpcHealthCheck","exc":"io.grpc.StatusRuntimeException: UNAVAILABLE: Unable to resolve host gooddata-cn-result-cache-headless
And warnings that the host is not resolvable:
"level":"WARN","logger":"io.grpc.internal.ManagedChannelImpl","thread":"grpc-default-executor-4","msg":"[Channel<3>: (gooddata-cn-result-cache-headless:6567)] Failed to resolve name. status=Status{code=UNAVAILABLE, description=Unable to resolve host gooddata-cn-result-cache-headless, cause=java.lang.RuntimeException: java.net.UnknownHostException: gooddata-cn-result-cache-headless: Name or service not known\n\tat io.grpc.internal.DnsNameResolver.resolveAddresses(DnsNameResolver.java:223)
There are similar errors regarding gooddata-cn-metadata-api-headless:
"level":"ERROR","logger":"com.gooddata.tiger.grpc.healthcheck.GrpcHealthCheck","thread":"boundedElastic-1","traceId":"756f885e0c24d41d","spanId":"756f885e0c24d41d","msg":"Error during GRPC Healthcheck call","action":"grpcHealthCheck","exc":"io.grpc.StatusRuntimeException: UNAVAILABLE: Unable to resolve host gooddata-cn-metadata-api-headless
The afm-exec-api pods have errors that they cannot resolve gooddata-cn-calcique-headless. Attaching the pod logs for the relevant services. I've tried restarting the pods but the issue remains. Any suggestions as to what might be the cause or how to resolve?
Hi @Pete Lorenz have you found a resolution for this issue? I’ve noticed today that I’m unable to spin up the gooddata container edition docker image and am running into the same problem with the healthcheck error.
Thanks for getting back, Marley. We still haven't found a resolution.
Hi, we are looking into this
Unable to resolve host gooddata-cn-result-cache-headless
means that this service has no
records (no IP addresses). It happens when none of pods belonging to the deployment
are Ready. You sent log from one of result-cache pods (gooddata-cn-result-cache-7f6bdbcf6f-h5s4w_result-cache.log) but there are no errors visible. According to timestamps, the pod was recently restarted, so I assume it crashed some time ago. If both pods are repeatedly crashing, it would explain why headless service doesn't serve any addresses. Please check the following: 1. How often are the pods restarting.
kubectl -n gooddata-cn get pod --selector app.kubernetes.io/component=resultCache

NAME                                       READY   STATUS    RESTARTS   AGE
gooddata-cn-result-cache-584b5d67b-287gw   1/1     Running   0          31h
gooddata-cn-result-cache-584b5d67b-w6t75   1/1     Running   0          31h
If the
column in non-zero, there should also be a time when the last restart occurred. 2. Why the pod restarted. The
kubectl describe pod --selector <http://app.kubernetes.io/component=resultCache|app.kubernetes.io/component=resultCache>
will show details for both pods. If the pod crashed, you can see valuable details in Events section, and also in Containers/result-cache/Last State secion. For example:
    Container ID:   <containerd://1dd40a3efc53369eea5a45cb30f5f14538f827b4a263fb3e6fc606e332192e5>b
    Image:          xxxx/sql-executor:VVVV
    Image ID:       xxxx/sql-executor@sha256:92901e08171c11f5a11f407a9d18a3b6a2c2703e52bb32bfcee6c9c26ea7ffcd
    Ports:          6567/TCP, 9040/TCP, 9041/TCP
    Host Ports:     0/TCP, 0/TCP, 0/TCP
    State:          Running
      Started:      Wed, 01 Nov 2023 01:38:39 +0100    
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Wed, 01 Nov 2023 01:40:29 +0100
      Finished:     Wed, 01 Nov 2023 16:23:24 +0100
    Ready:          True
    Restart Count:  1
In the example above, the reason is
- meaning the JVM process ran out of memory reserved by Kubernetes. But there may be other reasons as well.
3. May be useful to get previous logs - what happened BEFORE container was restarted. To get previous logs, add
to kubectl logs command, like:
kubectl -n gooddata-cn logs -p gooddata-cn-result-cache-7f6bdbcf6f-h5s4w
this is extremely useful if there was an issue with application itself, e.g. to discover and fix Java OutOfMemoryError exceptions.
The pods do seem to be restarting frequently:
> kubectl -n gooddata-cn get pod --selector <http://app.kubernetes.io/component=resultCache|app.kubernetes.io/component=resultCache>
NAME                                        READY   STATUS             RESTARTS         AGE
gooddata-cn-result-cache-7f6bdbcf6f-9hphl   0/1     CrashLoopBackOff   13 (83s ago)     51m
gooddata-cn-result-cache-7f6bdbcf6f-bh9mg   0/1     CrashLoopBackOff   72 (3m54s ago)   6h17m
The CrashLoopBackOff is new however. We were not seeing this yesterday. The cause seems to be a startup probe failure. Attaching the result of "kubectl describe pod --selector app.kubernetes.io/component=resultCache -n gooddata-cn"
Attaching the results of kubectl logs -p for one of the result-cache pods
These errors seem new as of today. The situation seems a bit more unstable than yesterday.
One thing we observed yesterday is that none of our GD services have endpoints in our dev cluster:
kubectl get endpoints -n gooddata-cn
NAME                                    ENDPOINTS                            AGE
gooddata-cn-afm-exec-api                <none>                               54d
gooddata-cn-analytical-designer         <none>                               54d
gooddata-cn-api-gateway                 <none>                               54d
gooddata-cn-api-gateway-headless        <none>                               54d
gooddata-cn-apidocs                     <none>                               54d
gooddata-cn-auth-service                <none>                               54d
gooddata-cn-auth-service-headless       <none>                               54d
gooddata-cn-calcique-headless           <none>                               54d
gooddata-cn-dashboards                  <none>                               54d
gooddata-cn-export-controller           <none>                               54d
gooddata-cn-home-ui                     <none>                               54d
gooddata-cn-ldm-modeler                 <none>                               54d
gooddata-cn-measure-editor              <none>                               54d
gooddata-cn-metadata-api                <none>                               54d
gooddata-cn-metadata-api-headless       <none>                               54d
gooddata-cn-result-cache-headless       <none>                               54d
gooddata-cn-scan-model                  <none>                               54d
gooddata-cn-sql-executor-headless       <none>                               54d
gooddata-cn-tabular-exporter-headless   <none>                               54d
gooddata-cn-web-components              <none>                               54d
ingress-nginx-controller      ,   54d
ingress-nginx-controller-admission                   54d
Perhaps this is a symptom of the other issues but it's concerning to us that even the services that are running without errors do not have endpoints.
if NONE of services have endpoints, this is definitely wrong. All these services should have 2 endpoints. Did you modified pod labels or service selectors somehow?
regd the log, there's ugly exception that happened during pulsar client initialization. If the pod restart count is low, I would not care much about it.
Yes, we added an owner label "app.kubernetes.io/owner : gooddata-cn" which our ops team says is required to integrate with Dynatrace: However, we added the same label in our staging environment without issues.
adding labels is ok, but service selector must match pod labels. For example: In service, there's selector for the following labels:
In pods, there must be all these three labels and must have the same value as the service expects. Otherwise, the service will not have any endpoint
We did not change the selector labels and they are still present:
Copy code
kubectl describe deployment -n gooddata-cn gooddata-cn-afm-exec-api
Name:                   gooddata-cn-afm-exec-api
Namespace:              gooddata-cn
CreationTimestamp:      Thu, 07 Sep 2023 21:52:12 +0000
Labels:                 <http://app.kubernetes.io/component=afmExecApi|app.kubernetes.io/component=afmExecApi>
Check using:
# service selector
kubectl get service -n  gooddata-cn gooddata-cn-result-cache-headless -o jsonpath='{.spec.selector}'

# deployment (pod template)
kubectl get deployments.apps -n  gooddata-cn gooddata-cn-result-cache  -o jsonpath='{.spec.template.metadata.labels}'
labels on deployment are not important. pod labels are important
kubectl get service -n  gooddata-cn gooddata-cn-result-cache-headless -o jsonpath='{.spec.selector}'
so you added this extra label to service selector, not to service itself
kubectl get service -n  gooddata-cn gooddata-cn-result-cache-headless -o jsonpath='{.spec.selector}'
can you please show the output of the 2nd command?
kubectl get deployments.apps -n  gooddata-cn gooddata-cn-result-cache  -o jsonpath='{.spec.template.metadata.labels}'
We added it to _helpers.tpl:
Copy code
Common labels
{{- define "gooddata-cn.labels" -}}
<http://helm.sh/chart|helm.sh/chart>: {{ include "gooddata-cn.chart" . }}
<http://app.kubernetes.io/owner|app.kubernetes.io/owner>: {{ include "gooddata-cn.name" . }}
{{ include "gooddata-cn.selectorLabels" . }}
{{- if .Chart.AppVersion }}
<http://app.kubernetes.io/version|app.kubernetes.io/version>: {{ .Chart.AppVersion | quote }}
{{- end }}
<http://app.kubernetes.io/managed-by|app.kubernetes.io/managed-by>: {{ .Release.Service }}
{{- end -}}
kubectl get deployments.apps -n  gooddata-cn gooddata-cn-result-cache  -o jsonpath='{.spec.template.metadata.labels}'
that's it - pod labels != service selector labels
It still doesn't explain the way how the extra label got into svc selector. You modified "gooddata-cn.labels" template, but it is not used in selectors
If I recall, we initially put the label in the wrong place, perhaps that's how it got in the service selectors. But it should have been removed from the service selectors when we redeployed.
Apparently not. 😕
maybe we should redeploy
just remove the extra label from selector (by any means) and the deployment should start working again
ok, will give it a try
fine, let us know and have a nice day, Pete.
Thanks Robert, you too
Removing the extra service selector resolved the issue, thanks so much @Robert Moucha!
