Hi Team, All of a sudden, we see the below errors ...
# gooddata-cn
k
Hi Team, All of a sudden, we see the below errors in afm-exec-api log for the average load on GoodData-cn, it is causing lots of slowness in GoodData-cn. Any idea what is this error and any quick fix for it? {"exc":"kotlinx.coroutines.JobCancellationException: MonoCoroutine was cancelled; job=MonoCoroutine{Cancelling}@690f491 ", "level":"ERROR", "logger":"com.gooddata.tiger.grpc.client.resultcache.ResultCacheClient", "msg":"getResultCacheResponse", "spanId":"ef4d670de5af5ff0", "thread":"reactor-http-epoll-1", "traceId":"ef4d670de5af5ff0", "ts":"2022-10-17 235519.975"} { insertId: "3rgglj1fewcaj2m0" jsonPayload: { exc: "kotlinx.coroutines.JobCancellationException: MonoCoroutine was cancelled; job=MonoCoroutine{Cancelling}@690f491 " level: "ERROR" logger: "com.gooddata.tiger.grpc.client.resultcache.ResultCacheClient" msg: "getResultCacheResponse" spanId: "ef4d670de5af5ff0" thread: "reactor-http-epoll-1" traceId: "ef4d670de5af5ff0" ts: "2022-10-17 235519.975" } labels: {5} logName: "xxxx/logs/stdout" receiveTimestamp: "2022-10-17T235524.874567330Z" resource: {2} severity: "ERROR" timestamp: "2022-10-17T235519.976262607Z" }
j
@Martin Blazik please, help
m
@Kshirod Mohanty would you give me a clue where do you run it? I need more detail log to investigate the issue
k
We have installed GoodData-CN in GCP K8s environment. Our client is a GCP CF function which makes a series of call to get the insights from GoodData-cn. Please let me know which log you are looking for. I can provide.
r
the error message above is just a grpc timeout while retrieving data from result-cache component that retrieves data from redis and also performs final data transformation (cross-tabulation). Is it possible that this error is somehow related to redis error you had 3 days ago? BTW, did you checked or set the redis maxmemory-policy to
allkeys-lru
as I suggested? Please check logs from result-cache pods for exceptions. it's also possible to trace down the request path using the same
traceId
that is present in the most of logs.
k
Here you go. @Martin Blazik @Robert Moucha
@Robert Moucha Yes its related to redis cache. We increased redis memory and deleted all the keys manually. After that we hardly see any error like Redis Command TimedOut or Invalid Cache state. We keep on deleting keys from Redis manually when number of keys exceeded 100K. But this particular error never disappears even after restarting afm service deployment. Our earlier load was sailing up to 500~700 test scenarios with number of keys in redis less than 100K, but now it is reduced to 100~200 test scenarios . The client service that uses GD-CN api/sdk is extremely slow as the response time from GD-CN increases.
@rohit mathur @Vivek M ^^
r
@Kshirod Mohanty Do you have the Redis maxmemory-policy set to
allkeys-lru
as I suggested? It will configure Redis instance to delete least recently used keys so you will not need to delete the keys manually. According to the log, it looks there is inconsistent status among several keys in Redis that coordinate report computation and result delivery. In order to recover from this error perform the following steps: 1. flush all keys in redis. 2. exec into pulsar/pulsar-pulsar-0 pod and run:
Copy code
apache-pulsar/bin/pulsar-admin topics delete "<persistent://gooddata-cn/gooddata-cn/sql.select>"
apache-pulsar/bin/pulsar-admin topics delete "<persistent://gooddata-cn/gooddata-cn/compute.calcique>"
apache-pulsar/bin/pulsar-admin topics delete "<persistent://gooddata-cn/gooddata-cn/result.xtab>"
Pulsar topics will be recreated immediately. Unprocessed messages will be lost and it will cause all running computations to fail. But when you reload your dashboard, data will be recalculated and redis cache populated.
k
@Robert Moucha thanks for the quick response. I could not find any backlogs for these topics. But I could see good different between msgInCounter and msgOutCounter for both result.xtab and compute.calcique topics. What is the reason behind delete/recreate of these topics? Just curious can we skip-all old messages instead?
Precondition failed while trying to delete topics. root@pulsar-broker-0:/pulsar# bin/pulsar-admin topics delete "persistent://gooddata-cn/gooddata-cn/sql.select" 2022-10-20T005825,357+0000 [AsyncHttpClient-7-1] WARN org.apache.pulsar.client.admin.internal.BaseResource - [http://localhost:8080/admin/v2/persistent/gooddata-cn/gooddata-cn/sql.select?force=false&amp;deleteSchema=false] Failed to perform http delete request: javax.ws.rs.ClientErrorException: HTTP 412 Precondition Failed Topic has active producers/subscriptions Reason: Topic has active producers/subscriptions
r
HTTP 412 Precondition Failed
Right, when apps are connected to the topics, force flag needs to be set (apache-pulsar/bin/pulsar-admin topics delete -f <<some-topic>>) msgInCounter and msgOutCounter may differ because a single incoming message may be redelivered when error ocurred during the first processing attempts. But if backlogSize is zero, it's not necessary to recreate topic. It's possible to clear backlog of topic subscription (without actual topic deletion) using pulsar-admin topics clear-backlog, but you need to pass the subscription name as well (these can be seen in "topics stats" output. As far as backlog is empty, it's not necessary to truncate the backlog by any means. Please respond to my question: did you checked or set the redis maxmemory-policy to
allkeys-lru
as I suggested?
k
@Robert Moucha I have changed maxmemory-policy to allkeys-lru. I did a very small load test, I don't see any error in afm service log. Here is our configuration of redis: Single node Capacity 150 GB, Max memory 150 GB. Eviction policy takes effect when maxmemory-gb reached the limit that is 150GB. I am sure this is not optimal. We bumped up memory from 5GB to 50GB ,then to 150GB for the recent gooddata-cn slowness we have during load testing (Test scenarios 500). From the metrics below, for test scenarios of size 20, I see the number of keys is 1.5K. I ran the test 5PM San Francisco time. I can still see(6 hours after the test run) the keys are active and only few with smaller TTL were evicted. Our client creates 75 number of keys for each test scenario.So running a load of 500, it will create 35K keys in a span of 30mins. With this trend(keys are not actually evicted) and from our recent experience we will again hit by the gooddata-cn slowness when the number of keys will be around 20K. Can you please recommend the capacity of Redis we should use for gooddata-cn ?
image.png
r
our application uses only basic Hash and Set operations in Redis, most of them (may be all of them) have
O(1)
algorithmic complexity, so the processing time is not affected by the amount of stored keys. So there's no reason why would the amount of keys affect performance. For your reference, here are charts from one regular perf test run in our environment. While the memory is "full" and number of keys reach to 700K keys, the redis response time usually remains below 20μs.
The recommended Redis size depends on multiple factors. You can start small and watch the hit rate and eviction rate based on your traffic. When user generates report, both raw result set and crosstabulated result are stored in cache so if a user wants to see the same or similar report again, these cached results are checked for presence and returned immediately, without the need to query datasource. On the other hand, if every report is used exactly once and never retrieved again, the cache can be small at the expense of performance impact - every new report will query data source and cached results will be never reused (cache hit will be zero). But this is not a usual behavioral pattern.
We also discovered some some bugs in cache managing code, but these were fixed in versions 2.x. Maybe you hit these but I can't tell more without knowing more context. In order to proceed with resolving this issue, I suggest you to follow steps described in https://community.gooddata.com/administration-kb-articles-49/gooddata-cn-collecting-diagnostics-support-bundle-363 and send us the resulting logs package that you collect right after the slowdown happens.