We're sometimes seeing this exception in our metad...
# gooddata-cn
p
We're sometimes seeing this exception in our metadata-api logs:
Copy code
errorType=com.gooddata.tiger.common.exception.NotFoundException, message=No organization found for hostname 10.163.131.200
	at com.gooddata.tiger.grpc.error.ExceptionsKt.buildClientException(Exceptions.kt:34)
	at com.gooddata.tiger.grpc.error.ErrorPropagationKt.convertFromKnownException(ErrorPropagation.kt:244)
	at com.gooddata.tiger.grpc.error.ErrorPropagationKt.convertToTransferableException(ErrorPropagation.kt:210)
	at com.gooddata.tiger.grpc.error.ErrorPropagationKt.clientCatching(ErrorPropagation.kt:105)
	at com.gooddata.tiger.grpc.error.ErrorPropagationKt$clientCatching$1.invokeSuspend(ErrorPropagation.kt)
	at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
	at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:55)
	at kotlinx.coroutines.EventLoopImplBase.processNextEvent(EventLoop.common.kt:274)
	at kotlinx.coroutines.BlockingCoroutine.joinBlocking(Builders.kt:84)
	at kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking(Builders.kt:59)
	at kotlinx.coroutines.BuildersKt.runBlocking(Unknown Source)
	at kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking$default(Builders.kt:38)
	at kotlinx.coroutines.BuildersKt.runBlocking$default(Unknown Source)
	at com.gooddata.tiger.apigateway.cors.OrganizationCorsConfigurationSource.organizationAllowedOrigins(OrganizationCorsConfigurationSource.kt:67)
	at com.gooddata.tiger.apigateway.cors.OrganizationCorsConfigurationSource.handleNonNullHostname(OrganizationCorsConfigurationSource.kt:41)
	at com.gooddata.tiger.apigateway.cors.OrganizationCorsConfigurationSource.getCorsConfiguration(OrganizationCorsConfigurationSource.kt:33)
However, the organization we're calling against exists. Any insights about why we might see this error even though the organization exists or how to debug this further?
m
Hello Pete, could you please confirm that you have an organization with following hostname? "10.163.131.200". Do these exception causes any issues with GD.CN or you just see them in log and you don't see any issues?
p
Hi Martin, our orgs do not have IP hostnames, they have DNS hostnames that map to our load balancer:
m
For some reason we find organizations with following hostname 10.163.131.200, which obviously does not exist.
Do you know what is 10.163.131.200 in your network?
p
That particular IP seems assigned at the moment, but from the CIDR range it appears to be a node in our cluster. Let me look for a more recent message and confirm.
I spoke too soon, we're still seeing the org-not-found error. Also, yesterday when we observed the error in our logs, we also saw an error "Sorry, we can't display this insight" in the GD UI, which I'm still seeing today (not sure if it's related):
This error results from a 400 Bad Request from the afm service:
Attaching logs for afmExecApi which show the org not found error at 10:06am. About 5 minutes later at 10:11am, there appears to be a successful actuator check to this IP which suggests the endpoint exists and is assigned to a host running afm-exec-api:
Copy code
uri:<http://10.163.226.87:9001/actuator/health/readiness> @timestamp:Sep 20, 2023 @ 10:11:46.724 remote:/10.163.230.101:60496 state:200 time:Sep 20, 2023 @ 10:11:46.724 user-agent:kube-probe/1.24+ eks_index_prefix:beta-eks _p:F level:INFO thread:reactor-http-epoll-4 logger:org.zalando.logbook.Logbook traceId:95150cc8db2813d6 spanId:95150cc8db2813d6 msg:HTTP response accept:*/* action:httpResponse tier:beta correlationId:c46d8dc8a234f3a5 stream:stdout durationMs:7 kubernetes.annotations.prometheus_io/port:9001 kubernetes.annotations.prometheus_io/scrape:true kubernetes.annotations.prometheus_io/path:/actuator/prometheus kubernetes.annotations.kubernetes_io/psp:eks.privileged kubernetes.namespace_name:gooddata-cn
m
Hi Pete, I consulted with our infrastructure engineers and there has to be some miss-configuration in your environment - specifically in ingress controller or in the load balancer, because an ip of an internal services is leaking as hostname instead of the real hostname. Could you please check your config? Was there any recent change which could cause generating this error? About the error "Sorry, we cant display this insight", does it happen for all insights or any particular insight? Did you started to see this error after a change or suddenly without any reason some insights which were possible to compute before are not possible to compute now? I recommend to check the trace ID in the logs, it will help to understand what is the issue about. Is it a production environment or a testing one? I am little bit confused because of the hostname generated by AWS.
Or do you use the 10.163.131.200 in OIDC configuration by any chance?
p
We're not using any IP's explicitly in our configuration as the nodes in our cluster are ephemeral anyway, they are auto-scaled via Karpenter so we have no guarantees that any IP will continue to exist.
This is our dev environment. Our stage and prod environments have identical configuration and are working with no issues.
r
As I wrote in another thread, the addresses from 10.163.x.x range are pod IPs - something running within the cluster is trying to access pod's port directly, without containing valid "Host:" header with existing organization hostname.
As far as the 400 Bad Request errors mentioned above are concerned, please check the "Response" tab in browser's developer tools. There's explanation what was wrong with given request. Alternatively, you may check backend logs for traceId stored in "x-gdc-trace-id" response header.
p
This is what we see in the Response tab:
Copy code
{
  "title": "Bad Request",
  "status": 400,
  "detail": "A result cache error has occurred during the calculation of the result",
  "resultId": "3b490d40b32e399c9938e194efa63bf51d10d7c3",
  "reason": "General error",
  "traceId": "d074e62da630b523"
}
r
and now try to find relevant backend logs with traceId=d074e62da630b523
👍 1
p
We seem to have an issue with how are logs are ingested by OpenSearch that's omitting messages with the traceId. I followed up with our ops team and they're saying that they'll need a custom fluentbit parser for GD.CN logs to convert them to a format that's ingestible into OpenSearch
The ops team is asking if it's possible for GD services to log in JSON format rather than plaintext. I looked in the Helm chart and can't find a setting for log format. Is this possible to do (put log output in JSON)?
r
Strange, we're logging in JSON format, at least the most of services (all Java services). There's env variable
LOGGING_APPENDER
set to
json
in all java pods (afm-exec-api, api-gateway, auth-service, calcique, export-controller, metadata-api, result-cache, scan-model, sql-executor, pdf-stapler-service, visual-exporter-service) Also Python services like tabular-exporter or quiver do log in json format (unfortunately, quiver logs
trace_id
instead of
traceId
). There's no direct way how to customize
LOGGING_APPENDER
value, so as long as it is set in pods, the logs are emitted in JSON format. If you remove this variable, logs will use
logfmt
format (basically key=value pairs).
p
I'll follow up with the ops team about why/if they're missing some json logs. For us, the main issue is that we're not seeing any logs that have the "traceId". I assume this would be excluding some Java logs (in json) with the traceId. Would you happen to know if the Pulsar logs are also emitted in json?
r
Hi, I'm afraid Apache Pulsar doesn't emit json-formatted logs. I briefly checked and it seems this image doesn't support simple changing the log4j message format
👍 1
p
thanks for checking 😀