We are seeing a lots error in afm-exec-api pod aft...
# gooddata-cn
k
We are seeing a lots error in afm-exec-api pod after 2.3.2 upgrade. Here is the logs from one execution.
ERROR 2023-08-21T14:17:51.234126113Z [resource.labels.containerName: afm-exec-api] {"action":"grpcClientCall", "exc":"errorType=com.gooddata.tiger.grpc.error.GrpcPropagatedServerException, message=UNAVAILABLE: Network closed for unknown reason,<no detail> at com.gooddata.tiger.grpc.error.ErrorPropagationKt.convertFromUnknownException(ErrorPropagation.kt:222) Suppressed: The stacktrace has been enhanced by Reactor, refer to additional information below: Error has been observed at the following site(s): *__checkpoint ⇢ Handler com.gooddata.tiger.afm.controller.ExecuteController#processAfmRequest(String, AfmExecution, boolean, String, ServerHttpRequest, Continuation) [DispatcherHandler] *__checkpoint ⇢ com.gooddata.tiger.tracing.TracingWebFilter [DefaultWebFilterChain] *__checkpoint ⇢ com.gooddata.tiger.oapi.validator.OpenApiRequestValidatorFilter [DefaultWebFilterChain] *__checkpoint ⇢ com.gooddata.tiger.license.reactive.LicenseCheckFilter [DefaultWebFilterChain] Original Stack Trace: at com.gooddata.tiger.grpc.error.ErrorPropagationKt.convertFromUnknownException(ErrorPropagation.kt:222) at com.gooddata.tiger.grpc.error.ErrorPropagationKt.convertToTransferableException(ErrorPropagation.kt:208) at com.gooddata.tiger.grpc.error.ErrorPropagationKt.clientCatching(ErrorPropagation.kt:105) at com.gooddata.tiger.grpc.error.ErrorPropagationKt$clientCatching$1.invokeSuspend(ErrorPropagation.kt) at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33) at kotlinx.coroutines.DispatchedTaskKt.resume(DispatchedTask.kt:175) at kotlinx.coroutines.DispatchedTaskKt.resumeUnconfined(DispatchedTask.kt:137) at kotlinx.coroutines.DispatchedTaskKt.dispatch(DispatchedTask.kt:108) at kotlinx.coroutines.CancellableContinuationImpl.dispatchResume(CancellableContinuationImpl.kt:308) at kotlinx.coroutines.CancellableContinuationImpl.resumeImpl(CancellableContinuationImpl.kt:318) at kotlinx.coroutines.CancellableContinuationImpl.resumeWith(CancellableContinuationImpl.kt:250) at com.github.marcoferrer.krotoplus.coroutines.client.SuspendingUnaryObserver.onError(SuspendingUnaryObserver.kt:34) at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:487) at brave.grpc.TracingClientInterceptor$TracingClientCallListener.onClose(TracingClientInterceptor.java:202) at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39) at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23) at io.grpc.ForwardingClien…
{ "insertId": "s9fskk7574qmboxx", "jsonPayload": { "ts": "2023-08-21 141751.225", "userId": "", "spanId": "c00fc674d4a6e2c4", "level": "ERROR", "thread": "grpc-default-executor-221", "exc": "errorType=com.gooddata.tiger.grpc.error.GrpcPropagatedServerException, message=UNAVAILABLE: Network closed for unknown reason,<no detail>\n\tat com.gooddata.tiger.grpc.error.ErrorPropagationKt.convertFromUnknownException(ErrorPropagation.kt:222)\n\tSuppressed: The stacktrace has been enhanced by Reactor, refer to additional information below: \nError has been observed at the following site(s):\n\t*__checkpoint ⇢ Handler com.gooddata.tiger.afm.controller.ExecuteController#processAfmRequest(String, AfmExecution, boolean, String, ServerHttpRequest, Continuation) [DispatcherHandler]\n\t*__checkpoint ⇢ com.gooddata.tiger.tracing.TracingWebFilter [DefaultWebFilterChain]\n\t*__checkpoint ⇢ com.gooddata.tiger.oapi.validator.OpenApiRequestValidatorFilter [DefaultWebFilterChain]\n\t*__checkpoint ⇢ com.gooddata.tiger.license.reactive.LicenseCheckFilter [DefaultWebFilterChain]\nOriginal Stack Trace:\n\t\tat com.gooddata.tiger.grpc.error.ErrorPropagationKt.convertFromUnknownException(ErrorPropagation.kt:222)\n\t\tat com.gooddata.tiger.grpc.error.ErrorPropagationKt.convertToTransferableException(ErrorPropagation.kt:208)\n\t\tat com.gooddata.tiger.grpc.error.ErrorPropagationKt.clientCatching(ErrorPropagation.kt:105)\n\t\tat com.gooddata.tiger.grpc.error.ErrorPropagationKt$clientCatching$1.invokeSuspend(ErrorPropagation.kt)\n\t\tat kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)\n\t\tat kotlinx.coroutines.DispatchedTaskKt.resume(DispatchedTask.kt:175)\n\t\tat kotlinx.coroutines.DispatchedTaskKt.resumeUnconfined(DispatchedTask.kt:137)\n\t\tat kotlinx.coroutines.DispatchedTaskKt.dispatch(DispatchedTask.kt:108)\n\t\tat kotlinx.coroutines.CancellableContinuationImpl.dispatchResume(CancellableContinuationImpl.kt:308)\n\t\tat kotlinx.coroutines.CancellableContinuationImpl.resumeImpl(CancellableContinuationImpl.kt:318)\n\t\tat kotlinx.coroutines.CancellableContinuationImpl.resumeWith(CancellableContinuationImpl.kt:250)\n\t\tat com.github.marcoferrer.krotoplus.coroutines.client.SuspendingUnaryObserver.onError(SuspendingUnaryObserver.kt:34)\n\t\tat io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:487)\n\t\tat brave.grpc.TracingClientInterceptor$TracingClientCallListener.onClose(TracingClientInterceptor.java:202)\n\t\tat io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)\n\t\tat io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)\n\t\tat io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)\n\t\tat net.devh.boot.grpc.client.metric.MetricCollectingClientCallListener.onClose(MetricCollectingClientCallListener.java:59)\n\t\tat io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:562)\n\t\tat io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70)\n\t\tat io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:743)\n\t\tat io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:722)\n\t\tat io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)\n\t\tat io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)\n\t\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)\n\t\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)\n\t\tat java.base/java.lang.Thread.run(Unknown Source)\n", "logger": "com.gooddata.tiger.grpc.client.calcique.CalciqueClient", "traceId": "c00fc674d4a6e2c4", "action": "grpcClientCall", "msg": "Failed to start computation", "orgId": "9b94409d-ecd2-4dba-926f-e607b29e488d" }, "resource": { "type": "k8s_container", "labels": { "location": "us-west4", "container_name": "afm-exec-api", "namespace_name": "gooddata-cn", "project_id": "", "pod_name": "gooddata-cn-afm-exec-api-65555b869f-h4f7c", "cluster_name": "primary" } }, "timestamp": "2023-08-21T141751.234126113Z", "severity": "ERROR", "labels": { "k8s-pod/app_kubernetes_io/name": "gooddata-cn", "k8s-pod/app_kubernetes_io/component": "afmExecApi", "k8s-pod/pod-template-hash": "65555b869f", "k8s-pod/app_kubernetes_io/instance": "gooddata-cn", "compute.googleapis.com/resource_name": "gke-primary-prd-irc-gd-1-9f76-8190b555-s2iz" }, "logName": "projects/logs/stdout", "receiveTimestamp": "2023-08-21T141752.043475567Z" }
Around that time other error we see in the gooddata-cn-visual-exporter pod log. Are these two related?
ERROR 2023-08-21T14:17:37.404396100Z [resource.labels.containerName: chromium] [0821/141737.404269:ERROR:<http://zygote_host_impl_linux.cc|zygote_host_impl_linux.cc>(272)] Failed to adjust OOM score of renderer with pid 348521: Permission denied (13)
{ "textPayload": "[0821/141737.404269ERRORzygote_host_impl_linux.cc(272)] Failed to adjust OOM score of renderer with pid 348521: Permission denied (13)", "insertId": "2f8v1e2006xqipc2", "resource": { "type": "k8s_container", "labels": { "namespace_name": "gooddata-cn", "pod_name": "gooddata-cn-visual-exporter-service-695ff86d54-9jdhh", "project_id": "", "container_name": "chromium", "location": "", "cluster_name": "primary" } }, "timestamp": "2023-08-21T141737.404396100Z", "severity": "ERROR", "labels": { "compute.googleapis.com/resource_name": "", "k8s-pod/app_kubernetes_io/name": "gooddata-cn", "k8s-pod/pod-template-hash": "695ff86d54", "k8s-pod/app_kubernetes_io/component": "visualExporterService", "k8s-pod/app_kubernetes_io/instance": "gooddata-cn" }, "logName": "projects//logs/stderr", "receiveTimestamp": "2023-08-21T141741.194601935Z" }
j
Hi, the second error looks like an access related issue. Has anything else changed in your set up aside from just upgrading? Furthermore, the latest edition is GoodData CN 2.4.0
r
The errors like
Failed to adjust OOM score of renderer
are harmless, they originate from chromium container that attempts to modify
oom_score_adj
but it is impossible to do in container. Chromium works fine even without this setting. Can you please check logs from calcique pods? The real error happens there.
k
Thanks @Joseph Heun and @Robert Moucha. Here is the logs for a traceId.
r
Thank you, unfortunately the log snippet doesn't contain the relevant records. Are there any other records in "gooddata-cn-calcique-b7b7b9667-ll6w8" pod, close to timestamp ""2023-08-21 141751"? The "Network closed for unknown reason" suggests that the pod abruptly stopped working or was restarted. Can you check if it is the case? If the problem persists and is reproducible, you may watch for container restart events and identify the root cause. The usual suspect is insufficient memory assigned to JVM or to Container. I also recommend upgrading to 2.4.0 that substantially decreased memory requirements for some types of operations running in Calcique service.
k
Hi @Robert Moucha thanks for the quick response. We will apply these changes today and will get back to you with all the findings in next 2 days.