We re sometimes seeing this exception in our metadata api lo GoodData #gooddata-cn

We're sometimes seeing this exception in our metad...

Pete Lorenz

09/19/2023, 2:57 PM

We're sometimes seeing this exception in our metadata-api logs:

Copy code

errorType=com.gooddata.tiger.common.exception.NotFoundException, message=No organization found for hostname 10.163.131.200
	at com.gooddata.tiger.grpc.error.ExceptionsKt.buildClientException(Exceptions.kt:34)
	at com.gooddata.tiger.grpc.error.ErrorPropagationKt.convertFromKnownException(ErrorPropagation.kt:244)
	at com.gooddata.tiger.grpc.error.ErrorPropagationKt.convertToTransferableException(ErrorPropagation.kt:210)
	at com.gooddata.tiger.grpc.error.ErrorPropagationKt.clientCatching(ErrorPropagation.kt:105)
	at com.gooddata.tiger.grpc.error.ErrorPropagationKt$clientCatching$1.invokeSuspend(ErrorPropagation.kt)
	at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
	at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:55)
	at kotlinx.coroutines.EventLoopImplBase.processNextEvent(EventLoop.common.kt:274)
	at kotlinx.coroutines.BlockingCoroutine.joinBlocking(Builders.kt:84)
	at kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking(Builders.kt:59)
	at kotlinx.coroutines.BuildersKt.runBlocking(Unknown Source)
	at kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking$default(Builders.kt:38)
	at kotlinx.coroutines.BuildersKt.runBlocking$default(Unknown Source)
	at com.gooddata.tiger.apigateway.cors.OrganizationCorsConfigurationSource.organizationAllowedOrigins(OrganizationCorsConfigurationSource.kt:67)
	at com.gooddata.tiger.apigateway.cors.OrganizationCorsConfigurationSource.handleNonNullHostname(OrganizationCorsConfigurationSource.kt:41)
	at com.gooddata.tiger.apigateway.cors.OrganizationCorsConfigurationSource.getCorsConfiguration(OrganizationCorsConfigurationSource.kt:33)

However, the organization we're calling against exists. Any insights about why we might see this error even though the organization exists or how to debug this further?

Martin Burian

09/20/2023, 7:47 AM

Hello Pete, could you please confirm that you have an organization with following hostname? "10.163.131.200". Do these exception causes any issues with GD.CN or you just see them in log and you don't see any issues?

Pete Lorenz

09/20/2023, 2:47 PM

Hi Martin, our orgs do not have IP hostnames, they have DNS hostnames that map to our load balancer:

Martin Burian

09/20/2023, 4:13 PM

For some reason we find organizations with following hostname 10.163.131.200, which obviously does not exist.

Martin Burian

09/20/2023, 4:13 PM

Do you know what is 10.163.131.200 in your network?

Pete Lorenz

09/20/2023, 4:36 PM

That particular IP seems assigned at the moment, but from the CIDR range it appears to be a node in our cluster. Let me look for a more recent message and confirm.

Pete Lorenz

09/20/2023, 5:01 PM

I spoke too soon, we're still seeing the org-not-found error. Also, yesterday when we observed the error in our logs, we also saw an error "Sorry, we can't display this insight" in the GD UI, which I'm still seeing today (not sure if it's related):

Pete Lorenz

09/20/2023, 5:02 PM

This error results from a 400 Bad Request from the afm service:

Pete Lorenz

09/20/2023, 5:23 PM

Attaching logs for afmExecApi which show the org not found error at 10:06am. About 5 minutes later at 10:11am, there appears to be a successful actuator check to this IP which suggests the endpoint exists and is assigned to a host running afm-exec-api:

Copy code

uri:<http://10.163.226.87:9001/actuator/health/readiness> @timestamp:Sep 20, 2023 @ 10:11:46.724 remote:/10.163.230.101:60496 state:200 time:Sep 20, 2023 @ 10:11:46.724 user-agent:kube-probe/1.24+ eks_index_prefix:beta-eks _p:F level:INFO thread:reactor-http-epoll-4 logger:org.zalando.logbook.Logbook traceId:95150cc8db2813d6 spanId:95150cc8db2813d6 msg:HTTP response accept:*/* action:httpResponse tier:beta correlationId:c46d8dc8a234f3a5 stream:stdout durationMs:7 kubernetes.annotations.prometheus_io/port:9001 kubernetes.annotations.prometheus_io/scrape:true kubernetes.annotations.prometheus_io/path:/actuator/prometheus kubernetes.annotations.kubernetes_io/psp:eks.privileged kubernetes.namespace_name:gooddata-cn

afmExecApi-org-not-found.txt

Martin Burian

09/21/2023, 11:03 AM

Hi Pete, I consulted with our infrastructure engineers and there has to be some miss-configuration in your environment - specifically in ingress controller or in the load balancer, because an ip of an internal services is leaking as hostname instead of the real hostname. Could you please check your config? Was there any recent change which could cause generating this error? About the error "Sorry, we cant display this insight", does it happen for all insights or any particular insight? Did you started to see this error after a change or suddenly without any reason some insights which were possible to compute before are not possible to compute now? I recommend to check the trace ID in the logs, it will help to understand what is the issue about. Is it a production environment or a testing one? I am little bit confused because of the hostname generated by AWS.

Martin Burian

09/21/2023, 7:05 PM

Or do you use the 10.163.131.200 in OIDC configuration by any chance?

Pete Lorenz

09/21/2023, 7:07 PM

We're not using any IP's explicitly in our configuration as the nodes in our cluster are ephemeral anyway, they are auto-scaled via Karpenter so we have no guarantees that any IP will continue to exist.

Pete Lorenz

09/21/2023, 7:08 PM

This is our dev environment. Our stage and prod environments have identical configuration and are working with no issues.

Robert Moucha

09/22/2023, 2:41 PM

As I wrote in another thread, the addresses from 10.163.x.x range are pod IPs - something running within the cluster is trying to access pod's port directly, without containing valid "Host:" header with existing organization hostname.

Robert Moucha

09/22/2023, 2:45 PM

As far as the 400 Bad Request errors mentioned above are concerned, please check the "Response" tab in browser's developer tools. There's explanation what was wrong with given request. Alternatively, you may check backend logs for traceId stored in "x-gdc-trace-id" response header.

Pete Lorenz

09/27/2023, 8:36 PM

This is what we see in the Response tab:

Copy code

{
  "title": "Bad Request",
  "status": 400,
  "detail": "A result cache error has occurred during the calculation of the result",
  "resultId": "3b490d40b32e399c9938e194efa63bf51d10d7c3",
  "reason": "General error",
  "traceId": "d074e62da630b523"
}

Robert Moucha

09/28/2023, 3:52 PM

and now try to find relevant backend logs with traceId=d074e62da630b523

👍 1

Pete Lorenz

09/28/2023, 4:40 PM

We seem to have an issue with how are logs are ingested by OpenSearch that's omitting messages with the traceId. I followed up with our ops team and they're saying that they'll need a custom fluentbit parser for GD.CN logs to convert them to a format that's ingestible into OpenSearch

Pete Lorenz

09/28/2023, 6:14 PM

The ops team is asking if it's possible for GD services to log in JSON format rather than plaintext. I looked in the Helm chart and can't find a setting for log format. Is this possible to do (put log output in JSON)?

Robert Moucha

09/30/2023, 2:45 PM

Strange, we're logging in JSON format, at least the most of services (all Java services). There's env variable

LOGGING_APPENDER

set to

json

in all java pods (afm-exec-api, api-gateway, auth-service, calcique, export-controller, metadata-api, result-cache, scan-model, sql-executor, pdf-stapler-service, visual-exporter-service) Also Python services like tabular-exporter or quiver do log in json format (unfortunately, quiver logs

trace_id

instead of

traceId

). There's no direct way how to customize

LOGGING_APPENDER

value, so as long as it is set in pods, the logs are emitted in JSON format. If you remove this variable, logs will use

logfmt

format (basically key=value pairs).

Pete Lorenz

10/02/2023, 2:33 PM

I'll follow up with the ops team about why/if they're missing some json logs. For us, the main issue is that we're not seeing any logs that have the "traceId". I assume this would be excluding some Java logs (in json) with the traceId. Would you happen to know if the Pulsar logs are also emitted in json?

Robert Moucha

10/02/2023, 3:13 PM

Hi, I'm afraid Apache Pulsar doesn't emit json-formatted logs. I briefly checked and it seems this image doesn't support simple changing the log4j message format

👍 1

Pete Lorenz

10/02/2023, 3:14 PM

thanks for checking 😀

Open in Slack

Previous Next