We re seeing an issue with the python SDK v1 7 0 call for `s GoodData #gooddata-cn

We're seeing an issue with the python SDK (v1.7.0)...

Cayce Collins

03/15/2024, 3:20 AM

We're seeing an issue with the python SDK (v1.7.0) call for `sdk.insights.get_insight`:

Copy code

Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='<gooddata-host>.com', port=443): Read timed out. (read timeout=None)")': /api/v1/entities/workspaces/<workspaceId>/visualizationObjects/<insightId>?include=ALL

We're making the call as one might expect:

Copy code

sdk = GoodDataSdk.create(host, token)
insight = sdk.insights.get_insight(workspace, insight_id)

It is not consistent with every request we make, so we're trying to understand this error a bit more. This was also done at times when our GD CN (3.1.0 - we are a paying customer) services had no substantial load. cc: @James Lee, @Phanindra, @Pete Lorenz, @Kshirod Mohanty, @Sunil Kumar Vanapalli

Boris

03/15/2024, 10:20 AM

Hello, this is strange... the python SDK just generates simple API call (https://<gooddata-host>.com//api/v1/entities/workspaces/<workspaceId>/visualizationObjects/<insightId>?include=ALL) and runs it against your deployment. The error message suggests that the connection could not be established. I believe that this isn't strictly related to python SDK, you would get the same error when calling the same API from the host where you run the python SDK.

Boris

03/15/2024, 10:23 AM

However it's possible that there could be some typo or mistake in one of the parameters/variables which would make the generated endpoint non-existent. I'd check if the incoming API call appears in the logs and is cut for some reason or doesn't even make it to the deployment.

Cayce Collins

03/15/2024, 10:41 PM

is there any timeout for this api?

Kshirod Mohanty

03/15/2024, 10:47 PM

and can we overwrite the timeout?

Cayce Collins

03/16/2024, 8:30 PM

it's worth noting that we may get a different message when this happens:

Copy code

Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))': /api/v1/entities/workspaces/<workspaceId>/visualizationObjects/<insightId>?include=ALL

also, I find it strange that it is immediately logging the

total=2

retry message but we never see the first two (0 an d 1). When we've had URL issues in the past due to params/vars, it will show us multiple retry attempts with incrementing totals, such as this:

Copy code

[2024-03-18 00:01:44.408 warning] Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x125eedf70>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known')': /api/v1/entities/workspaces/<workspaceId>/visualizationObjects/<insightId>?include=ALL [urllib3.connectionpool - connectionpool.py:824 - urlopen()]
[2024-03-18 00:01:44.410 warning] Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x125f3bf70>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known')': /api/v1/entities/workspaces/<workspaceId>/visualizationObjects/<insightId>?include=ALL [urllib3.connectionpool - connectionpool.py:824 - urlopen()]
[2024-03-18 00:01:44.411 warning] Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x125f3be80>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known')': /api/v1/entities/workspaces/<workspaceId>/visualizationObjects/<insightId>?include=ALL [urllib3.connectionpool - connectionpool.py:824 - urlopen()]
[2024-03-18 00:01:44.413 error] Failed to query gooddata table insight <insightId>: HTTPSConnectionPool(host='<gooddata-host>', port=443): Max retries exceeded with url: /api/v1/entities/workspaces/<workspaceId>/visualizationObjects/<insightId>?include=ALL (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x125f3bbe0>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known')) [gooddata_tables.py:120 - get_table_insight()]

Cayce Collins

03/16/2024, 8:34 PM

and when we check these URLs manually (via browser or otherwise curl), we do get the JSON data as expected. It is the same request being made several times in concurrent processes/threads as we are generating multiple documents with the same insights in them. In some cases it succeeds just fine within 100ms while other times we see this message and it can hang for upwards of 15m (and for some odd reason we see the 15m more often when it hangs)

Cayce Collins

03/16/2024, 8:42 PM

some additional info about our python service: we are using gunicorn, configured to use 3 workers having 8 threads each (we've tried lowering workers to 2 but still see the issue). Our EKS containers are configured with 2 cpu and 2Gi for memory, and we're running 12 pods.

Cayce Collins

03/18/2024, 8:01 AM

here's what shows up in the logs, just a single retry message with `total=2`:

Boris

03/18/2024, 9:19 AM

there is some default timeout for all APIs, but I don't think it's relevant here (definitely not the root cause). This seems to me, since you mention that you are calling the same API repeatedly, it could be some built in rate limiting/DOS protection of ingress controller. We don't have any rate limiting implemented on the application level (yet).

Cayce Collins

05/15/2024, 7:51 PM

@Boris just an FYI, I am resurrecting this thread 🙂 We've spent a lot of resources attempting to determine the root cause of the above issue to no avail. We've checked Cloudflare and the nginx-ingress-controller logs, but the problem is that the request never makes it outside of our application and seems to get hung in the application. We've attempted various configurations via

gunicorn

and different worker classes (such as

gevent

and

gthread

) but no matter what configurations we attempt they all run into the same issue. My guess is some type of urllib3 connection pool issue with threading, but I can't be sure. I've explained all of this in greater detail in a support case I put in today: https://support.gooddata.com/hc/en-us/requests/121068 We've been manually handling the issue but it is starting to block our progress as we are scaling up our statement generation for clients

Cayce Collins

05/15/2024, 8:05 PM

also wanted to ping @Radek Novacek here as well just in case, as this has turned into an urgent matter. We are available to meet between 7am - 11pm during PST timezone if that helps

4 Views

Open in Slack

Previous Next