We re seeing an error in the afm exec api logs related to ti GoodData #gooddata-cn

We're seeing an error in the afm-exec-api logs rel...

Pete Lorenz

01/22/2024, 9:19 PM

We're seeing an error in the afm-exec-api logs related to timing out requests in the GD.CN UI:

Copy code

{
  "ts": "2024-01-22 20:14:55.724",
  "level": "ERROR",
  "logger": "com.gooddata.tiger.grpc.client.resultcache.ResultCacheClient",
  "thread": "DefaultDispatcher-worker-1",
  "traceId": "2098d9d948e8cfb5",
  "spanId": "2098d9d948e8cfb5",
  "userId": "james.lee",
  "orgId": "eb6f13c6-4b73-4a09-af89-adfb572d972e",
  "tokenId": "",
  "msg": "getResultCacheResponse",
  "action": "grpcClientCall",
  "exc": "kotlinx.coroutines.JobCancellationException: MonoCoroutine was cancelled; job=MonoCoroutine{Cancelling}@77b61a6c\n"
}

Repro steps: 1. Log into a particular tenant with LDM referencing our Snowflake data source 2. Click the "Analyze" tab, observe "Untitled insight" and a list of fields on the left 3. Drag one of the columns with data to the "ROWS" field in the new insight. In the display area, observe the "Computing" progress indicator 4. The operation appears to time out with the above error in the logs and 404 response from the following endpoint:

Copy code

https://[hostname]/api/v1/actions/workspaces/b922ec833042440287f8e8837607fb6b/execution/afm/execute/result/6a0c80d11e875a869be6dd2f3c8b727b05046bb4?offset=0%2C0&limit=100%2C1000

The 404 response has the following content:

Copy code

{
  "title": "Not Found",
  "status": 404,
  "detail": "Result not found in result cache",
  "resultId": "6a0c80d11e875a869be6dd2f3c8b727b05046bb4",
  "traceId": "76ad754f13aaf80a"
}

Note: this procedure works with the same data source on our GCP instance of the same tenant / organization, but times out on our AWS instance of this tenant / organization. The insights were migrated from v2.3.2 on GCP to v3.1.0 on AWS using the AIO method provided in the above thread. The timeout occurs at exactly 3 minutes. I'm attaching the logs for afm-exec-api and result-cache for the relevant time period. Please let us know if any additional info would be helpful to debug this issue. Thanks for any help.

gooddata-cn-afm-exec-api-fc765849c-srz5v_afm-exec-api.log gooddata-cn-result-cache-5d8f9cb8b-xnjvf_result-cache.log

Pete Lorenz

01/22/2024, 9:30 PM

Timeout after 3 minutes:

Jan Kos

01/23/2024, 1:53 PM

Hi Pete, I believe 3 minutes is timeout set on nginx and when appropriate response is not received the request will timeout. Question is why the insight took that long to compute. To that I’m not sure though. Result ID (

6a0c80d11e875a869be6dd2f3c8b727b05046bb4

in this case) is generated by afm-exec-api during insight execution and it is registered into result cache. If a result is already in result cache in it’ll get picked up immediately and isight is displayed in the UI. If result is not in the result cache sql queries are generated and executed against a datasource. In the attached logs there is only the afm execution registering the result id into result cache and polling for result which ends up in 404 because it can’t be found after the said 3 minutes. I would recommend to: • for the sake of completeness clear datasource caches hitting uploadNotification API to clear caches • Try to execute the insight in analytical designer again (clearing caches often times to resolve various issues) and you can replace

/edit

part of URL by

/debug

. It will download a zip file with afm explanation resources including sql file containing query which is being executed in datasource. • You can try to execute the query directly in the datasource to see how long it takes to finish and/or compare it with the explain resource in your other deployment where it works • It might be worth to check SQL_EXECUTOR pods logs if there are any timeouts on sql queries.

Pete Lorenz

01/23/2024, 3:21 PM

Thanks, Jan. We will investigate.

Pete Lorenz

01/23/2024, 3:40 PM

I called the "uploadNotification" to clear caches, then tried again. Well before the 3 minute nginx timeout, we're seeing this "upstream timed out" error:

Copy code

2024/01/23 15:33:22 [error] 38#38: *13805101 upstream timed out (110: Operation timed out) while reading response header from upstream, client: 10.101.3.89, server: stg1-gooddata-cn-zwi2zjezyz.clearwateranalytics.com, request: "GET /api/v1/actions/workspaces/b922ec833042440287f8e8837607fb6b/execution/afm/execute/result/d1392dfba7cd1a12615cfedc61e82e315108039d?offset=0%2C0&limit=100%2C1000 HTTP/1.1", upstream: "<http://10.162.4.4:9000/api/v1/actions/workspaces/b922ec833042440287f8e8837607fb6b/execution/afm/execute/result/d1392dfba7cd1a12615cfedc61e82e315108039d?offset=0%2C0&limit=100%2C1000>", host: "stg1-gooddata-cn-zwi2zjezyz.clearwateranalytics.com", referrer: "<https://stg1-gooddata-cn-zwi2zjezyz.clearwateranalytics.com/analyze/>"

Trying with /debug but not seeing where I can download a zip file, nothing is downloaded, just getting 404 ...

Pete Lorenz

01/23/2024, 4:55 PM

@Jan Kos the /debug url in analytical designer is not downloading a zip for me, is there a special step I need to take to download the zip file? Does it download automatically? Never mind, I can see the zip file downloaded on our GCP instance, but there's no zip being generated on AWS ... it appears there is something wrong with our AWS instance

Pete Lorenz

01/23/2024, 5:15 PM

I'm not sure if /api/v1/actions/dataSources/d54e5267-1cc2-4f0a-b85d-c7ab2ea2ddfa/uploadNotification is actually clearing the caches as expected, because after I call this endpoint, I don't see any query being executed in the Snowflake query log when I drag the "Account Id" field into the analytical designer. However, when I hover over the question mark next to "Account Id" on the left pane, I do see a query in the Snowflake query logs and the values are populated in the small popup. The query does not take much time, just 2.4 seconds to scan 8.5K rows. So it appears that the "Untitled insight" pane is not triggering any queries in Snowflake and the 3-minute timeout is occurring without any queries being executed. Looking at our GCP instance, I can see the queries in the Snowflake query log when I drag "Account Id" to the insight, but when I perform the same task on the AWS instance, I see no query (and there is no zip file generated).

Jan Kos

01/23/2024, 5:32 PM

under the normal circumstances the debug file is downloaded automatically and no other steps are needed, but it looks like there is something else going on… by the nginx log entry it looks like the requests are timing out further down on upstream server in this case I guess result cache service.. but to be honest I’m not sure what could be causing it, I’ll check internally.

👍 1

Pete Lorenz

01/23/2024, 5:36 PM

Right, we're seeing the debug file on GCP but not on AWS, trying to figure out which part of our AWS deploy is not configured properly

Pete Lorenz

01/23/2024, 5:44 PM

We were able to generate an insight successfully on AWS 2 weeks ago, but it seems something has changed. The same insight that generated results 2 weeks ago when we tested is now timing out. So it seems no insights are generating on our AWS deploy and they are all encountering this timeout issue. (Note: our GCP version is 2.3.2, but our AWS version is 3.1.0)

Pete Lorenz

04/01/2024, 3:53 PM

@Jan Kos we're seeing this issue again in production where we are not able to load any insights. The network tab in the browser debugger times out after 3 minutes and there is the error "Result not found in result cache" in the afmExecApi logs. Once again, the /debug endpoint in the analytical designer does not download a zip file. We've tried invalidating the caches, but this hasn't resolved the issue. Would you have any ideas on what we should try?

Pete Lorenz

04/01/2024, 4:45 PM

I should also note this is happening for two organizations, for another two organizations it is not happening

Pete Lorenz

04/01/2024, 10:26 PM

It seems this can happen with an error in sql-executor that prevents the result from populating in redis. The error seems to have been caused by a library conflict with an (unrelated) Dynatrace agent running on the k8s node. Once we mitigated this issue, our results are rendering again.

Jan Kos

04/02/2024, 12:22 PM

Hi Pete, sorry for being late here, I’m glad you were able to resolve the issue.

2 Views

Open in Slack

Previous Next