Hello, we're on <GoodData.CN> 3.29. We're seeing t...
# gooddata-cn
p
Hello, we're on GoodData.CN 3.29. We're seeing this failure in the sqlExecutor logs which seems linked to certain insights not rendering:
Copy code
action=nonRetryableCallFailed location=Location{uri=<grpc://10.161.93.82:16001>} flightAction=GetFlightInfo message="Call failed. Reason: Flight 'cache/raw_tmp/93ddf4b6-9448-45c5-af8d-a51bde47f676/21248fc8a6ff45a9b19445b64cbb76be/31eff337-aaf7-4738-89f2-f8008ff63009/2723cb2c-65d7-4fd8-b124-ad2ec64bfe2c/535c5a3c8308d2cfb5762ee85ce587a1' does not exist.. Detail: Failed"
To help us debug this issue, we'd like to know what causes this failure and how serious it is. Is there anything we can do to make our calls retryable (instead of nonretryable) so we can avoid these failures? 🙏
r
Hello Pete, may I ask what
durableStorageType
type for quiver pods do you use? https://www.gooddata.com/docs/cloud-native/3.35/install/installation-configuration/#storage We support AWS S3 or file system. In case you use file system (
FS
), you need to attach volume with accessType
ReadWriteMany
(single volume can be mounted to multiple pods, e.g. NFS, AWS EFS, or similar). If you don't specify storageclass supporting
ReadWriteMany
access type, files will be kept locally in pod that handled the request. When subsequent request tries to access this file, you may face this issue if request is handled by a different pod.
p
Hi Robert, thanks for getting back. We use "S3" as the durableStorageType.
d
Hi Pete, is this issue persistent or does it resolve on retry? Are you running any cache invalidations at the same time by any chance? We have seen such errors occasionaly when running the uploadNotification API during a running report execution 😕
p
Hi Dan, no, we are triggering no active cache invalidations at the time when we see these errors. The error does seem to resolve on retry, but this is very disconcerting for our clients who are generating large batches of reports. When they find that several reports fail due to this exception, they then have to hunt through a long list of reports to identify which ones failed and re-run a smaller batch of failed reports. They do not know that we are using GoodData, so they complain to us that our system is not reliable.
d
I see 😕 we have had this issue crop up from time to time and especially when invalidations were at play at the same time (that is why I asked about that). However, since it was relatively rare and solvable by a refresh, it has not been prioritized yet, unfortunately
p
This is not a rare or sporadic occurrence for us. We are seeing thousands of such events in our logs. We are seeing failures in reports where this is the only issue visible in the logs with the associated traceIds. We will follow up in a support request.
r
Hi Pete, Radek from L2 Technical Support here - when you do open the support ticket, can you include any log examples of this happening you have, with the complete log context (following one traceId start to finish)? As Dan mentioned, since this is relatively rare, it's a little hard to catch - so the more we have, the easier it will be to get to the bottom of this. Many thanks! 🙂
p
Hi Radek, absolutely. I will include the full logs with our request.