We're seeing an issue with the python SDK (v1.7.0)...
# gooddata-cn
c
We're seeing an issue with the python SDK (v1.7.0) call for `sdk.insights.get_insight`:
Copy code
Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='<gooddata-host>.com', port=443): Read timed out. (read timeout=None)")': /api/v1/entities/workspaces/<workspaceId>/visualizationObjects/<insightId>?include=ALL
We're making the call as one might expect:
Copy code
sdk = GoodDataSdk.create(host, token)
insight = sdk.insights.get_insight(workspace, insight_id)
It is not consistent with every request we make, so we're trying to understand this error a bit more. This was also done at times when our GD CN (3.1.0 - we are a paying customer) services had no substantial load. cc: @James Lee, @Phanindra, @Pete Lorenz, @Kshirod Mohanty, @Sunil Kumar Vanapalli
b
Hello, this is strange... the python SDK just generates simple API call (https://<gooddata-host>.com//api/v1/entities/workspaces/<workspaceId>/visualizationObjects/<insightId>?include=ALL) and runs it against your deployment. The error message suggests that the connection could not be established. I believe that this isn't strictly related to python SDK, you would get the same error when calling the same API from the host where you run the python SDK.
However it's possible that there could be some typo or mistake in one of the parameters/variables which would make the generated endpoint non-existent. I'd check if the incoming API call appears in the logs and is cut for some reason or doesn't even make it to the deployment.
c
is there any timeout for this api?
k
and can we overwrite the timeout?
c
it's worth noting that we may get a different message when this happens:
Copy code
Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))': /api/v1/entities/workspaces/<workspaceId>/visualizationObjects/<insightId>?include=ALL
also, I find it strange that it is immediately logging the
total=2
retry message but we never see the first two (0 an d 1). When we've had URL issues in the past due to params/vars, it will show us multiple retry attempts with incrementing totals, such as this:
Copy code
[2024-03-18 00:01:44.408 warning] Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x125eedf70>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known')': /api/v1/entities/workspaces/<workspaceId>/visualizationObjects/<insightId>?include=ALL [urllib3.connectionpool - connectionpool.py:824 - urlopen()]
[2024-03-18 00:01:44.410 warning] Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x125f3bf70>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known')': /api/v1/entities/workspaces/<workspaceId>/visualizationObjects/<insightId>?include=ALL [urllib3.connectionpool - connectionpool.py:824 - urlopen()]
[2024-03-18 00:01:44.411 warning] Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x125f3be80>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known')': /api/v1/entities/workspaces/<workspaceId>/visualizationObjects/<insightId>?include=ALL [urllib3.connectionpool - connectionpool.py:824 - urlopen()]
[2024-03-18 00:01:44.413 error] Failed to query gooddata table insight <insightId>: HTTPSConnectionPool(host='<gooddata-host>', port=443): Max retries exceeded with url: /api/v1/entities/workspaces/<workspaceId>/visualizationObjects/<insightId>?include=ALL (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x125f3bbe0>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known')) [gooddata_tables.py:120 - get_table_insight()]
and when we check these URLs manually (via browser or otherwise curl), we do get the JSON data as expected. It is the same request being made several times in concurrent processes/threads as we are generating multiple documents with the same insights in them. In some cases it succeeds just fine within 100ms while other times we see this message and it can hang for upwards of 15m (and for some odd reason we see the 15m more often when it hangs)
some additional info about our python service: we are using gunicorn, configured to use 3 workers having 8 threads each (we've tried lowering workers to 2 but still see the issue). Our EKS containers are configured with 2 cpu and 2Gi for memory, and we're running 12 pods.
here's what shows up in the logs, just a single retry message with `total=2`:
b
there is some default timeout for all APIs, but I don't think it's relevant here (definitely not the root cause). This seems to me, since you mention that you are calling the same API repeatedly, it could be some built in rate limiting/DOS protection of ingress controller. We don't have any rate limiting implemented on the application level (yet).
c
@Boris just an FYI, I am resurrecting this thread 🙂 We've spent a lot of resources attempting to determine the root cause of the above issue to no avail. We've checked Cloudflare and the nginx-ingress-controller logs, but the problem is that the request never makes it outside of our application and seems to get hung in the application. We've attempted various configurations via
gunicorn
and different worker classes (such as
gevent
and
gthread
) but no matter what configurations we attempt they all run into the same issue. My guess is some type of urllib3 connection pool issue with threading, but I can't be sure. I've explained all of this in greater detail in a support case I put in today: https://support.gooddata.com/hc/en-us/requests/121068 We've been manually handling the issue but it is starting to block our progress as we are scaling up our statement generation for clients
also wanted to ping @Radek Novacek here as well just in case, as this has turned into an urgent matter. We are available to meet between 7am - 11pm during PST timezone if that helps