Hey team I m getting an `500 internal server error` when usi GoodData #gooddata-cloud

Hey team, I’m getting an `500 internal server err...

Evangelos Malandrakis

11/11/2024, 12:12 PM

Hey team, I’m getting an

500 internal server error

when using the

requests

library in Python to hit the

/api/v1/layout/workspaces/:workspace_id/logicalModel

endpoint. The trace ID for this is

3b3ad7ae-1228484

. Weirdly, when I run the exact same request in Postman, it takes a while but eventually succeeds. This difference is causing issues on our end since our deployment process relies on this API working smoothly. Could this be related to the issue mentioned here? https://gooddataconnect.slack.com/archives/C04S1MSLEAW/p1730391356547959?thread_ts=1730103507.185409&cid=C04S1MSLEAW Thanks in advance!

Branislav Slávik

11/12/2024, 5:06 PM

Hello @Evangelos Malandrakis, Thank you for sharing the trace ID. Searching for the error in our logs, I can see that the error occurred several times yesterday (11.11.2024) between ~ 12:30 and ~ 13:50 CET. Also, it occurred only in your organisation and nowhere else. With that in mind, have you tried the same PUT call after this time frame? If so, did the call finish successfully or do you get the same 500 error every time? 🤔

Evangelos Malandrakis

11/12/2024, 6:02 PM

Hey @Branislav Slávik, I was performing deployments in that timeframe and here is what I noticed: • some of the api calls were successful • others were not successful but after several retries got successful I didn't any other similar api calls after this timeframe. Let me know if you need more details 😊

Branislav Slávik

11/13/2024, 1:51 PM

Thank you for the additional details. As far as I can see, the error related is the following:

Copy code

org.postgresql.util.PSQLException: ERROR: canceling statement due to conflict with recovery
  Detail: User query might have needed to see row versions that must be removed.

I found the following explanation of when such issue(s) usually happen:

This error will occur when the standby server gets updates/deletes in the WAL stream that will invalidate data currently being accessed by a running query.

You will primary see this with long-running queries accessing tables with significant activity on the primary.

as well as:

Some old row versions were removed by
VACUUM
on the primary server, but these row versions could still be required by the query on the standby. So there is a conflict between the running query and the startup process applying changes that come from the primary.

and:

Running queries on hot-standby server is somewhat tricky — it can fail, because during querying some needed rows might be updated or deleted on primary. As a primary does not know that a query is started on secondary it thinks it can clean up (vacuum) old versions of its rows. Then secondary has to replay this cleanup, and has to forcibly cancel all queries which can use these rows.

Based on the above, I would say that this might be some sort of a "race condition" that could have happened at the time. Moreover so since you mentioned you were doing some deployments. I am glad the issue is not persistent and you were able to get a successful response. With that in mind, I hope that it won't occur again with your next deployment(s). If by any chance it does, please let us know and we can dig in deeper to your deployment procedure, etc. in order to investigate this further.

Evangelos Malandrakis

11/14/2024, 10:01 AM

Hey @Branislav Slávik, Thank you for the detailed breakdown! As I understand it, the issue could indeed be tied to the "race condition" you mentioned, as users logged in and performing queries might be triggering this conflict. This would align with the behavior we’re seeing, where some requests succeed while others require retries. To help avoid this in the future, we'll make sure to perform deployments during non-active hours. Thanks again for your help!

11 Views

Open in Slack

Previous Next