Hi GoodData! We are currently testing the GoodData...
# gooddata-cn
m
Hi GoodData! We are currently testing the GoodData CN and we have ran into a peculiar issue. After an attempt to create a new dashboard via API, we have completely broken down our deployment. First, an error started to occur:
Copy code
2023-05-05T09:09:37,307+0000 [pulsar-io-18-7] ERROR org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers - [<persistent://public/default/compute.calcique.DLQ> / Calcique listener-dead-letter] Error reading entries at 263:2 : Cursor was already closed, Read Type Normal - Retrying to read in 58.336 seconds
Later on more errors appeared, repeatedly manifesting:
Copy code
ts="2023-05-05 14:46:43.043" level=ERROR msg="gRPC server call" logger=com.gooddata.tiger.metadata.grpc.MetadataStoreGrpcService thread=DefaultDispatcher-worker-21 action=grpcServerCall orgId=<undefined> spanId=f53219cd692f24ae traceId=d395d0b2d670736c userId=<undefined> exc="org.springframework.transaction.CannotCreateTransactionException: Could not open JDBC Connection for transaction; nested exception is java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30304ms.

ts="2023-05-05 14:49:14.024" level=ERROR msg="Internal Server Error" logger=com.gooddata.tiger.web.exception.ProblemExceptionHandling thread=grpc-default-executor-8399 orgId=default spanId=11fde5c0b85fb8cc traceId=11fde5c0b85fb8cc userId=33e2c1a3-f34b-4855-b187-3739f04a43db exc="errorType=com.gooddata.tiger.grpc.error.GrpcPropagatedServerException, message=org.springframework.transaction.CannotCreateTransactionException: Could not open JDBC Connection for transaction; nested exception is java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30811ms.,<no detail>

2023-05-05T14:49:57,221+0000 [pulsar-io-18-3] ERROR org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers - [<persistent://public/default/compute.calcique.DLQ> / Calcique listener-dead-letter] Error reading entries at 263:2 : Cursor was already closed, Read Type Normal - Retrying to read in 58.369 seconds
That is where we have started to observe 500s responses. A few minutes later we started to receive
Copy code
2023-05-05T14:51:13,372+0000 [pulsar-web-48-8] ERROR org.apache.pulsar.broker.admin.impl.PersistentTopicsBase - [null] Topic <persistent://public/default/data-source.change> already exists

2023-05-05 14:51:25.374 UTC [1394] mduser@md ERROR:  type "dashboard_permissions" already exists
After some time we have tried to restart the container but that made things even worse, as the container cannot boot up, reporting
Copy code
2023-05-08 09:02:44.820 UTC [13622] PANIC:  could not locate a valid checkpoint record
2023-05-08 09:02:45.926 UTC [13645] postgres@postgres FATAL:  the database system is starting up
Each time we attempt to start a container, it allocates about 20GB of new space on shared EFS. The container starts, however it is not capable of handling requests and gets killed by LoadBalancer. Our deployment is on AWS, using ECS (Fargate), EFS and RDS (postgre).
j
Hi Martin. Could you let me know what exactly you did before this started happening? From experience, creating a dashboard incorrectly via API shouldn’t result in anything more than that particular dashboard failing with an error. Which API endpoint did you use?
t
Hi Jan. The only requests we were sending were POST and GET to
api/v1/entities/workspaces/{workspaceId}/analyticalDashboards
. I don’t think that should be the cause of the issue, but that’s just what we were doing that day.
j
At first glance this looks like an issue with connecting to the data source, but I’ll need to get a second opinion on this and on how you could resolve it.
Are you using the community edition Docker image?
m
We are using community edition 2.3.0. We have replaced the shared drive (/data) with a clean state. The container boots up, but we get some
unregistered_redirect_url
, I can send you the link in DM if you would like. However it seems to be a different error.
j
That’s an error coming from the Dex OIDC configuration.
m
looks so, we are starting with a clean state, so we did not expect this.
previously we have migrated from v2.2.1 to 2.3.0, so we have not observed clean state on this version. Is there any additional configuration we need to provide for this to work? The auth service has successfully started, showing the demo@example.com user credentials
j
I am not sure, so I need to look into it a bit longer. I’ll update you as soon as I can.
m
Thanks, we are dealing with two step scenario: First we need to recover to working state then we would like to figure what went wrong to prevent the future issues. We have spared the shared drive, which managed to allocate about 221GB of space, even though that the data we had in the original deployment were quite small
j
One question: Can you let me know what your ALLOW_REDIRECT_URL env variable is set to?
m
not defined at the moment
Shall I provide it? What should be the value?
I also have big issues with the mounted volume "/data" as there are some state locks, that prevent the correct boot if I replace the container
Ok, I've managed to pinpoint some of the issues. If the container undergoes "graceful shutdown" the replacement can safely pickup the EFS mounted drive files and continue. However if the termination is abrupt, the ECS deployment ends up with PANIC error from postgres. I am not sure what are we doing wrong here, but we can overcome this issue for now. We still have issues with the
unregistered_redirect_url
though.
Also we do not have visibility on why the original issue occurred.
After bit of searching and poking in the container, I've found that contrary to local docker run, the ECS deployment lacks entry in
auth_request
table in
/data/dex.db
. Redirect Url is on of the columns there so I suppose that is the culprit. I've so far failed to pinpoint what went wrong on the clean bootstrap. I have logged:
Copy code
Empty volume detected, creating data directory
in the current deployment
j
Hello Martin. I’m sorry about the delay, but I am trying to find you some help from colleagues who are more familiar with the internals of setting up a CN deployment and what could have gone wrong with the Dex config.
m
In case of need I can ssh into the container
I can confirm that clean start in our env with v2.2.1 has not manifested the issue. I will try to upgrade
After upgrade we are in running state. We are however unsure what causes the issue. Also we have lost logs to previous issue due to too short log retention. We will attempt to break things down the same way as before, will keep you posted about that.
j
It is good to hear that you are running. It’d be good to understand what happened. If you manage to reproduce the issue with exact steps, could you please send them over to us?
v
It is kind of strange that 2.3.0 in blank does not start. But installing 2.2.1 and them upgrading to 2.3.0 works. Seems like migration handles something that clean 2.3.0 does not.
m
The only difference on startup is actually the instance URL on https. Of course apart from docker that provides the start. The
/data
volume is persistent in both cases. I cannot figure, what could go wrong on the boot. I can export the logs if you are interested
r
Sorry being late to the party, please allow me to add a few notes explaining the behaviour: 1. As you already noticed, all persistent data is stored in
/data
directory. This data contains (among other things) Postgresql data dir and Dex db in sqlite3 format (dex.db). 2. When running directly from docker, it's possible to mount volume into this directory using
-v somevolume:/data
so data will survive various container lifecycle events (including stop and delete). This is the only way how to support gooddata-cn-ce upgrades - simply stop old container and start a new one with the data volume mounted. 3. Downgrades are currently not possible - some components perform upgrade of db schema and if you start older image version with such updated volume, it will not work (in most cases). You can copy docker volume data to safe place before running upgrade, to make sure you still have older data copy you may use in case of troubles and start it with previous image version. 4. As far as ECS is concerned - I don't know your exact configuration, but remember the data volume contains databases. Errors you're describing suggest the volume was forcefully detached while the container was running. 5. I don't have in-depth experience with EFS and how it allocates space when used with ECS. But 20GB right after container start looks really suspicious. Empty PG db has less than 100MB and even with big data model the size hardly exceeds 500MB.
t
Hi @Jan Rehanek. We ended up with the same problem after setting up OIDC and then reverting back to the default one with a PUT request to
/api/v1/entities/admin/organizations/{organization}
with
Copy code
"data": {
    "id": "default",
    "type": "organization",
    "attributes": {
      "name": "Default Organization",
      "hostname": "<our_hostname>"
    }
  }
j
What exactly does the ‘same problem’ mean in this case? Is it the
unregistered_redirect_url
error popping when you’re trying to access the hostname or is it more?
t
yes, exactly that.
j
All right. I’ll see if I can reproduce the issue.
🙏 1
Let me just make sure that I’m understanding the steps correctly. 1. Run community edition from the provided docker image on our website. 2. Update the OIDC config and hostname at
/api/v1/entities/admin/organizations/default
with PUT to some new OIDC provider. 3. Update
/api/v1/entities/admin/organizations/default
with PUT that only contains:
Copy code
{
  "data": {
    "id": "default",
    "type": "organization",
    "attributes": {
      "name": "Default Organization",
      "hostname": "{{custom_hostname}}"
    }
  }
}
Is that all or am I missing some intermediate step?
t
Yes, that’s exactly what’s happened. @Martin Váňa, anything extra to add?
j
Out of curiosity, how did you manage to change the hostname in the first place? I’m ending up with
Copy code
{
  "detail": "Organization hostname cannot be changed",
  "status": 400,
  "title": "Bad Request",
  "traceId": "14b5c6b83d65da75"
}
👀 1
j
Curious as well as I want to change the hostname for https 🙂
r
The only way how to change organization hostname on GoodData CN Community Edition is by setting
GDCN_PUBLIC_URL
environment variable on container start. Unfortunately, the version 2.3.0 contains bug that generates invalid redirect_uri for dex oauth2 client. This error makes practically impossible to use public url with default port for given protocol (80 for http, 443 for https) 😞 The error was already fixed and will not be present in the next release. Or you may use some recent development build (Apr 12th is the first containing the fix).
j
It was actually an error that I was including a line break in the command, but the issue I'm having now is that when I start the docker it throws errors and ends up four stopping the docker container. That may be related to the error that you're mentioning here so all attempt to use the more recent build that's in preview to see if that gets me around.
r
If you're using released version (like 2.3.0) AND using GDCN_PUBLIC_URL with url without port (like
<https://whatever.com>
) THEN you're affected
j
Got it.. so the force stop is probably related. When I get home I'll try with a different declared version.
r
the container starts, but you can't login, getting error like Unregistred redirect url
<https://whatever.com>:*\n*/login/oauth2/code/whatever.com
j
Ah it's not getting that far, the container stops and the logs indicate errors.
r
if the container doesn't start at all, this a different issue. pls share the log output we'll check it
this is the very end of logs. I need to know what happened earlier. You can redirect the whole 'docker logs' output to file and send it to me. I will check what's wrong with your container.
j
@Robert Moucha
this was my command3:
Copy code
sudo docker run -i -t -p 3000:3000 -p 5432:5432 -v gooddata-dev:/data \
-e GDCN_TOKEN_SECRET=XXXXXXXXXXXX \
-e GDCN_PUBLIC_URL=<https://analytics.novelcx.com> \
-e LICENSE_AND_PRIVACY_POLICY_ACCEPTED=YES \
gooddata/gooddata-cn-ce:dev_20230517.a881988f
r
you pressed
Ctrl-C
in container's terminal. This action sends SIGINT signal to supervisor that shuts down the whole application stack:
Copy code
\============= All services of <http://GoodData.CN|GoodData.CN> are ready =============

127.0.0.1 - - [22/May/2023:11:54:26 +0000] "GET / HTTP/1.1" 200 5639 "-" "curl/7.74.0"
Nginx: ready
s6-rc: info: service nginx successfully started
s6-rc: info: service legacy-services: starting
s6-rc: info: service legacy-services successfully started
172.31.90.40 - - [22/May/2023:11:54:38 +0000] "GET / HTTP/1.1" 200 2587 "-" "ELB-HealthChecker/2.0"
^C
exiting...
s6-rc: info: service legacy-services: stopping
s6-rc: info: service legacy-services successfully stopped
s6-rc: info: service nginx: stopping
^^^ Note the
^C
in the output
j
i noticed that thinking it was part of the script
made it to step 4. using bootstrap token, upload Organization layout with data source passwords (PUT /api/v1/layout/organization). Since the organization still has internal oidc set up, ui logins won't work at this moment. However, i get a series of errors from an incompatible put which is a copy paste of what I backed up in the last org. So far using my old org layout, i replaced the authId with my current. Deleted oauthIssuerLocation since i'm attempting to revert to the internal OIDC. Updated my old org layout field oauthClientId to match the current layout. I've also had to replace my hostname with my updated <https://a|https> domain since that also threw an error. I'm out of ideas for resolving the below error. For anyone else trying to follow, i had to add "password": "database_password_here" under the type, url and username fields within the json schema.
Copy code
{
  "detail": "Invalid combination of auth properties. Specify either none or user + password or token. Datasource: XXXX, user: <yes>, password: <no>, token: <no>",
  "status": 400,
  "title": "Bad Request",
  "traceId": "7df800d37c2eea44"
}
@Robert Moucha I think i'm almost back.. using gooddata-cn-ce:dev_20230517.a881988f since I changed the domain to https://analytics.novelcx.com. I have completed the put restoring my Auth0 info. I then updated Auth0 client to the following values per the guides I used when this worked prior. After saving and waiting several minutes, I am getting this error which did not happen prior with a redirect to address https://analytics.novelcx.com/oauth2/authorization/analytics.novelcx.com:
Copy code
{
  "title": "Unauthorized",
  "status": 401,
  "detail": "401 UNAUTHORIZED \"Authorization failed for given issuer \"<https://ncx-prod.auth0.com/authorize/>\"\"",
  "traceId": "4b70fd5992b55eca"
}
r
the culprit is the issuer is set to
Copy code
<https://ncx-prod.auth0.com/authorize/>
oauthIssuerLocation
needs to be set to:
Copy code
<https://ncx-prod.auth0.com/>
in oidc configuration. Do not forget adding the trailing slash, Auth0 requires it.
Explanation: The configuration of OIDC is taken from well-known cfg endpoint (https://ncx-prod.auth0.com/.well-known/openid-configuration in your case). Issuer is
<https://ncx-prod.auth0.com/>
and authorization endpoint is retrieved automatically from openid-configuration document. So do not append
/authorize/
or whatever else to Issuer URL. See https://www.gooddata.com/developers/cloud-native/doc/2.3/manage-organization/set-up-authentication/#SetUpAuthenticationUsi[…]tIdentityProvider-Auth0 for Auth0-specific comments.
❤️ 1
regarding the database credentials - see https://www.gooddata.com/developers/cloud-native/doc/2.3/manage-organization/organization-api/backups/ , you need to add either
username
and
password
OR `token`(for db like bigquery) to every exported datasource in your Organization layout.