We're seeing an issue in our dev cluster where the...
# gooddata-cn
p
We're seeing an issue in our dev cluster where the calcique and afm-exec-api pods (2 replicas) appear as running but unready. In the calcique logs, we see messages that the pods cannot resolve gooddata-cn-result-cache-headless:
Copy code
"level":"ERROR","logger":"com.gooddata.tiger.grpc.healthcheck.GrpcHealthCheck","thread":"boundedElastic-2","traceId":"9e2839743de46645","spanId":"9e2839743de46645","msg":"Error during GRPC Healthcheck call","action":"grpcHealthCheck","exc":"io.grpc.StatusRuntimeException: UNAVAILABLE: Unable to resolve host gooddata-cn-result-cache-headless
And warnings that the host is not resolvable:
Copy code
"level":"WARN","logger":"io.grpc.internal.ManagedChannelImpl","thread":"grpc-default-executor-4","msg":"[Channel<3>: (gooddata-cn-result-cache-headless:6567)] Failed to resolve name. status=Status{code=UNAVAILABLE, description=Unable to resolve host gooddata-cn-result-cache-headless, cause=java.lang.RuntimeException: java.net.UnknownHostException: gooddata-cn-result-cache-headless: Name or service not known\n\tat io.grpc.internal.DnsNameResolver.resolveAddresses(DnsNameResolver.java:223)
There are similar errors regarding gooddata-cn-metadata-api-headless:
Copy code
"level":"ERROR","logger":"com.gooddata.tiger.grpc.healthcheck.GrpcHealthCheck","thread":"boundedElastic-1","traceId":"756f885e0c24d41d","spanId":"756f885e0c24d41d","msg":"Error during GRPC Healthcheck call","action":"grpcHealthCheck","exc":"io.grpc.StatusRuntimeException: UNAVAILABLE: Unable to resolve host gooddata-cn-metadata-api-headless
The afm-exec-api pods have errors that they cannot resolve gooddata-cn-calcique-headless. Attaching the pod logs for the relevant services. I've tried restarting the pods but the issue remains. Any suggestions as to what might be the cause or how to resolve?
m
Hi @Pete Lorenz have you found a resolution for this issue? I’ve noticed today that I’m unable to spin up the gooddata container edition docker image and am running into the same problem with the healthcheck error.
p
Thanks for getting back, Marley. We still haven't found a resolution.
b
Hi, we are looking into this
r
Unable to resolve host gooddata-cn-result-cache-headless
means that this service has no
A
records (no IP addresses). It happens when none of pods belonging to the deployment
gooddata-cn-result-cache
are Ready. You sent log from one of result-cache pods (gooddata-cn-result-cache-7f6bdbcf6f-h5s4w_result-cache.log) but there are no errors visible. According to timestamps, the pod was recently restarted, so I assume it crashed some time ago. If both pods are repeatedly crashing, it would explain why headless service doesn't serve any addresses. Please check the following: 1. How often are the pods restarting.
Copy code
kubectl -n gooddata-cn get pod --selector app.kubernetes.io/component=resultCache

NAME                                       READY   STATUS    RESTARTS   AGE
gooddata-cn-result-cache-584b5d67b-287gw   1/1     Running   0          31h
gooddata-cn-result-cache-584b5d67b-w6t75   1/1     Running   0          31h
If the
RESTARTS
column in non-zero, there should also be a time when the last restart occurred. 2. Why the pod restarted. The
kubectl describe pod --selector <http://app.kubernetes.io/component=resultCache|app.kubernetes.io/component=resultCache>
will show details for both pods. If the pod crashed, you can see valuable details in Events section, and also in Containers/result-cache/Last State secion. For example:
Copy code
Containers:
  result-cache:
    Container ID:   <containerd://1dd40a3efc53369eea5a45cb30f5f14538f827b4a263fb3e6fc606e332192e5>b
    Image:          xxxx/sql-executor:VVVV
    Image ID:       xxxx/sql-executor@sha256:92901e08171c11f5a11f407a9d18a3b6a2c2703e52bb32bfcee6c9c26ea7ffcd
    Ports:          6567/TCP, 9040/TCP, 9041/TCP
    Host Ports:     0/TCP, 0/TCP, 0/TCP
    State:          Running
      Started:      Wed, 01 Nov 2023 01:38:39 +0100    
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Wed, 01 Nov 2023 01:40:29 +0100
      Finished:     Wed, 01 Nov 2023 16:23:24 +0100
    Ready:          True
    Restart Count:  1
👍 1
In the example above, the reason is
OOMKilled
- meaning the JVM process ran out of memory reserved by Kubernetes. But there may be other reasons as well.
3. May be useful to get previous logs - what happened BEFORE container was restarted. To get previous logs, add
-p
to kubectl logs command, like:
Copy code
kubectl -n gooddata-cn logs -p gooddata-cn-result-cache-7f6bdbcf6f-h5s4w
this is extremely useful if there was an issue with application itself, e.g. to discover and fix Java OutOfMemoryError exceptions.
p
The pods do seem to be restarting frequently:
Copy code
> kubectl -n gooddata-cn get pod --selector <http://app.kubernetes.io/component=resultCache|app.kubernetes.io/component=resultCache>
NAME                                        READY   STATUS             RESTARTS         AGE
gooddata-cn-result-cache-7f6bdbcf6f-9hphl   0/1     CrashLoopBackOff   13 (83s ago)     51m
gooddata-cn-result-cache-7f6bdbcf6f-bh9mg   0/1     CrashLoopBackOff   72 (3m54s ago)   6h17m
>
The CrashLoopBackOff is new however. We were not seeing this yesterday. The cause seems to be a startup probe failure. Attaching the result of "kubectl describe pod --selector app.kubernetes.io/component=resultCache -n gooddata-cn"
Attaching the results of kubectl logs -p for one of the result-cache pods
These errors seem new as of today. The situation seems a bit more unstable than yesterday.
One thing we observed yesterday is that none of our GD services have endpoints in our dev cluster:
Copy code
kubectl get endpoints -n gooddata-cn
NAME                                    ENDPOINTS                            AGE
gooddata-cn-afm-exec-api                <none>                               54d
gooddata-cn-analytical-designer         <none>                               54d
gooddata-cn-api-gateway                 <none>                               54d
gooddata-cn-api-gateway-headless        <none>                               54d
gooddata-cn-apidocs                     <none>                               54d
gooddata-cn-auth-service                <none>                               54d
gooddata-cn-auth-service-headless       <none>                               54d
gooddata-cn-calcique-headless           <none>                               54d
gooddata-cn-dashboards                  <none>                               54d
gooddata-cn-export-controller           <none>                               54d
gooddata-cn-home-ui                     <none>                               54d
gooddata-cn-ldm-modeler                 <none>                               54d
gooddata-cn-measure-editor              <none>                               54d
gooddata-cn-metadata-api                <none>                               54d
gooddata-cn-metadata-api-headless       <none>                               54d
gooddata-cn-result-cache-headless       <none>                               54d
gooddata-cn-scan-model                  <none>                               54d
gooddata-cn-sql-executor-headless       <none>                               54d
gooddata-cn-tabular-exporter-headless   <none>                               54d
gooddata-cn-web-components              <none>                               54d
ingress-nginx-controller                10.163.145.73:443,10.163.145.73:80   54d
ingress-nginx-controller-admission      10.163.145.73:8443                   54d
Perhaps this is a symptom of the other issues but it's concerning to us that even the services that are running without errors do not have endpoints.
😱 1
r
if NONE of services have endpoints, this is definitely wrong. All these services should have 2 endpoints. Did you modified pod labels or service selectors somehow?
regd the log, there's ugly exception that happened during pulsar client initialization. If the pod restart count is low, I would not care much about it.
p
Yes, we added an owner label "app.kubernetes.io/owner : gooddata-cn" which our ops team says is required to integrate with Dynatrace: However, we added the same label in our staging environment without issues.
r
adding labels is ok, but service selector must match pod labels. For example: In service, there's selector for the following labels:
Copy code
<http://app.kubernetes.io/component|app.kubernetes.io/component>
<http://app.kubernetes.io/instance|app.kubernetes.io/instance>
<http://app.kubernetes.io/name|app.kubernetes.io/name>
In pods, there must be all these three labels and must have the same value as the service expects. Otherwise, the service will not have any endpoint
p
We did not change the selector labels and they are still present:
Copy code
kubectl describe deployment -n gooddata-cn gooddata-cn-afm-exec-api
Name:                   gooddata-cn-afm-exec-api
Namespace:              gooddata-cn
CreationTimestamp:      Thu, 07 Sep 2023 21:52:12 +0000
Labels:                 <http://app.kubernetes.io/component=afmExecApi|app.kubernetes.io/component=afmExecApi>
                        <http://app.kubernetes.io/instance=gooddata-cn|app.kubernetes.io/instance=gooddata-cn>
                        <http://app.kubernetes.io/managed-by=Helm|app.kubernetes.io/managed-by=Helm>
                        <http://app.kubernetes.io/name=gooddata-cn|app.kubernetes.io/name=gooddata-cn>
                        <http://app.kubernetes.io/owner=gooddata-cn|app.kubernetes.io/owner=gooddata-cn>
                        <http://app.kubernetes.io/version=2.5.1|app.kubernetes.io/version=2.5.1>
                        <http://helm.sh/chart=gooddata-cn-2.5.1|helm.sh/chart=gooddata-cn-2.5.1>
                        <http://objectset.rio.cattle.io/hash=6183899f57b94d51df0180e32d29a0b356e8a441|objectset.rio.cattle.io/hash=6183899f57b94d51df0180e32d29a0b356e8a441>
r
Check using:
Copy code
# service selector
kubectl get service -n  gooddata-cn gooddata-cn-result-cache-headless -o jsonpath='{.spec.selector}'

# deployment (pod template)
kubectl get deployments.apps -n  gooddata-cn gooddata-cn-result-cache  -o jsonpath='{.spec.template.metadata.labels}'
labels on deployment are not important. pod labels are important
p
Copy code
kubectl get service -n  gooddata-cn gooddata-cn-result-cache-headless -o jsonpath='{.spec.selector}'
{"<http://app.kubernetes.io/component|app.kubernetes.io/component>":"resultCache","<http://app.kubernetes.io/instance|app.kubernetes.io/instance>":"gooddata-cn","<http://app.kubernetes.io/name|app.kubernetes.io/name>":"gooddata-cn","<http://app.kubernetes.io/owner|app.kubernetes.io/owner>":"gooddata-cn"}>
r
so you added this extra label to service selector, not to service itself
p
Copy code
kubectl get service -n  gooddata-cn gooddata-cn-result-cache-headless -o jsonpath='{.spec.selector}'
{"<http://app.kubernetes.io/component|app.kubernetes.io/component>":"resultCache","<http://app.kubernetes.io/instance|app.kubernetes.io/instance>":"gooddata-cn","<http://app.kubernetes.io/name|app.kubernetes.io/name>":"gooddata-cn","<http://app.kubernetes.io/owner|app.kubernetes.io/owner>":"gooddata-cn"}>
r
can you please show the output of the 2nd command?
Copy code
kubectl get deployments.apps -n  gooddata-cn gooddata-cn-result-cache  -o jsonpath='{.spec.template.metadata.labels}'
👍 1
p
We added it to _helpers.tpl:
Copy code
{{/*
Common labels
*/}}
{{- define "gooddata-cn.labels" -}}
<http://helm.sh/chart|helm.sh/chart>: {{ include "gooddata-cn.chart" . }}
<http://app.kubernetes.io/owner|app.kubernetes.io/owner>: {{ include "gooddata-cn.name" . }}
{{ include "gooddata-cn.selectorLabels" . }}
{{- if .Chart.AppVersion }}
<http://app.kubernetes.io/version|app.kubernetes.io/version>: {{ .Chart.AppVersion | quote }}
{{- end }}
<http://app.kubernetes.io/managed-by|app.kubernetes.io/managed-by>: {{ .Release.Service }}
{{- end -}}
Copy code
kubectl get deployments.apps -n  gooddata-cn gooddata-cn-result-cache  -o jsonpath='{.spec.template.metadata.labels}'
{"<http://app.kubernetes.io/component|app.kubernetes.io/component>":"resultCache","<http://app.kubernetes.io/instance|app.kubernetes.io/instance>":"gooddata-cn","<http://app.kubernetes.io/name|app.kubernetes.io/name>":"gooddata-cn"}>
r
that's it - pod labels != service selector labels
👍 1
It still doesn't explain the way how the extra label got into svc selector. You modified "gooddata-cn.labels" template, but it is not used in selectors
p
If I recall, we initially put the label in the wrong place, perhaps that's how it got in the service selectors. But it should have been removed from the service selectors when we redeployed.
r
Apparently not. 😕
p
maybe we should redeploy
r
just remove the extra label from selector (by any means) and the deployment should start working again
👍 1
p
ok, will give it a try
r
fine, let us know and have a nice day, Pete.
p
Thanks Robert, you too
Removing the extra service selector resolved the issue, thanks so much @Robert Moucha!
👍 1
✅ 1