Hi GoodData We are currently testing the GoodData CN and we GoodData #gooddata-cn

Hi GoodData! We are currently testing the GoodData...

Martin Váňa

05/08/2023, 9:27 AM

Hi GoodData! We are currently testing the GoodData CN and we have ran into a peculiar issue. After an attempt to create a new dashboard via API, we have completely broken down our deployment. First, an error started to occur:

Copy code

2023-05-05T09:09:37,307+0000 [pulsar-io-18-7] ERROR org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers - [<persistent://public/default/compute.calcique.DLQ> / Calcique listener-dead-letter] Error reading entries at 263:2 : Cursor was already closed, Read Type Normal - Retrying to read in 58.336 seconds

Later on more errors appeared, repeatedly manifesting:

Copy code

ts="2023-05-05 14:46:43.043" level=ERROR msg="gRPC server call" logger=com.gooddata.tiger.metadata.grpc.MetadataStoreGrpcService thread=DefaultDispatcher-worker-21 action=grpcServerCall orgId=<undefined> spanId=f53219cd692f24ae traceId=d395d0b2d670736c userId=<undefined> exc="org.springframework.transaction.CannotCreateTransactionException: Could not open JDBC Connection for transaction; nested exception is java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30304ms.

ts="2023-05-05 14:49:14.024" level=ERROR msg="Internal Server Error" logger=com.gooddata.tiger.web.exception.ProblemExceptionHandling thread=grpc-default-executor-8399 orgId=default spanId=11fde5c0b85fb8cc traceId=11fde5c0b85fb8cc userId=33e2c1a3-f34b-4855-b187-3739f04a43db exc="errorType=com.gooddata.tiger.grpc.error.GrpcPropagatedServerException, message=org.springframework.transaction.CannotCreateTransactionException: Could not open JDBC Connection for transaction; nested exception is java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30811ms.,<no detail>

2023-05-05T14:49:57,221+0000 [pulsar-io-18-3] ERROR org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers - [<persistent://public/default/compute.calcique.DLQ> / Calcique listener-dead-letter] Error reading entries at 263:2 : Cursor was already closed, Read Type Normal - Retrying to read in 58.369 seconds

That is where we have started to observe 500s responses. A few minutes later we started to receive

Copy code

2023-05-05T14:51:13,372+0000 [pulsar-web-48-8] ERROR org.apache.pulsar.broker.admin.impl.PersistentTopicsBase - [null] Topic <persistent://public/default/data-source.change> already exists

2023-05-05 14:51:25.374 UTC [1394] mduser@md ERROR:  type "dashboard_permissions" already exists

After some time we have tried to restart the container but that made things even worse, as the container cannot boot up, reporting

Copy code

2023-05-08 09:02:44.820 UTC [13622] PANIC:  could not locate a valid checkpoint record
2023-05-08 09:02:45.926 UTC [13645] postgres@postgres FATAL:  the database system is starting up

Each time we attempt to start a container, it allocates about 20GB of new space on shared EFS. The container starts, however it is not capable of handling requests and gets killed by LoadBalancer. Our deployment is on AWS, using ECS (Fargate), EFS and RDS (postgre).

Jan Rehanek

05/09/2023, 8:35 AM

Hi Martin. Could you let me know what exactly you did before this started happening? From experience, creating a dashboard incorrectly via API shouldn’t result in anything more than that particular dashboard failing with an error. Which API endpoint did you use?

Tomáš Gajdoš

05/09/2023, 8:45 AM

Hi Jan. The only requests we were sending were POST and GET to

api/v1/entities/workspaces/{workspaceId}/analyticalDashboards

. I don’t think that should be the cause of the issue, but that’s just what we were doing that day.

Jan Rehanek

05/09/2023, 8:54 AM

At first glance this looks like an issue with connecting to the data source, but I’ll need to get a second opinion on this and on how you could resolve it.

Jan Rehanek

05/09/2023, 8:59 AM

Are you using the community edition Docker image?

Martin Váňa

05/09/2023, 9:03 AM

We are using community edition 2.3.0. We have replaced the shared drive (/data) with a clean state. The container boots up, but we get some

unregistered_redirect_url

, I can send you the link in DM if you would like. However it seems to be a different error.

Jan Rehanek

05/09/2023, 9:09 AM

That’s an error coming from the Dex OIDC configuration.

Martin Váňa

05/09/2023, 9:10 AM

looks so, we are starting with a clean state, so we did not expect this.

Martin Váňa

05/09/2023, 9:19 AM

previously we have migrated from v2.2.1 to 2.3.0, so we have not observed clean state on this version. Is there any additional configuration we need to provide for this to work? The auth service has successfully started, showing the demo@example.com user credentials

Jan Rehanek

05/09/2023, 9:21 AM

I am not sure, so I need to look into it a bit longer. I’ll update you as soon as I can.

Martin Váňa

05/09/2023, 9:23 AM

Thanks, we are dealing with two step scenario: First we need to recover to working state then we would like to figure what went wrong to prevent the future issues. We have spared the shared drive, which managed to allocate about 221GB of space, even though that the data we had in the original deployment were quite small

Jan Rehanek

05/09/2023, 2:05 PM

One question: Can you let me know what your ALLOW_REDIRECT_URL env variable is set to?

Martin Váňa

05/09/2023, 2:14 PM

not defined at the moment

Martin Váňa

05/09/2023, 2:16 PM

Shall I provide it? What should be the value?

Martin Váňa

05/09/2023, 3:32 PM

I also have big issues with the mounted volume "/data" as there are some state locks, that prevent the correct boot if I replace the container

Martin Váňa

05/09/2023, 6:44 PM

Ok, I've managed to pinpoint some of the issues. If the container undergoes "graceful shutdown" the replacement can safely pickup the EFS mounted drive files and continue. However if the termination is abrupt, the ECS deployment ends up with PANIC error from postgres. I am not sure what are we doing wrong here, but we can overcome this issue for now. We still have issues with the

unregistered_redirect_url

though.

Martin Váňa

05/09/2023, 6:45 PM

Also we do not have visibility on why the original issue occurred.

Martin Váňa

05/10/2023, 6:52 AM

After bit of searching and poking in the container, I've found that contrary to local docker run, the ECS deployment lacks entry in

auth_request

table in

/data/dex.db

. Redirect Url is on of the columns there so I suppose that is the culprit. I've so far failed to pinpoint what went wrong on the clean bootstrap. I have logged:

Copy code

Empty volume detected, creating data directory

in the current deployment

Jan Rehanek

05/10/2023, 7:12 AM

Hello Martin. I’m sorry about the delay, but I am trying to find you some help from colleagues who are more familiar with the internals of setting up a CN deployment and what could have gone wrong with the Dex config.

Martin Váňa

05/10/2023, 7:15 AM

In case of need I can ssh into the container

Martin Váňa

05/10/2023, 7:31 AM

I can confirm that clean start in our env with v2.2.1 has not manifested the issue. I will try to upgrade

Martin Váňa

05/10/2023, 7:41 AM

After upgrade we are in running state. We are however unsure what causes the issue. Also we have lost logs to previous issue due to too short log retention. We will attempt to break things down the same way as before, will keep you posted about that.

Jan Rehanek

05/10/2023, 8:07 AM

It is good to hear that you are running. It’d be good to understand what happened. If you manage to reproduce the issue with exact steps, could you please send them over to us?

Václav Slováček

05/10/2023, 8:29 AM

It is kind of strange that 2.3.0 in blank does not start. But installing 2.2.1 and them upgrading to 2.3.0 works. Seems like migration handles something that clean 2.3.0 does not.

Martin Váňa

05/10/2023, 11:26 AM

The only difference on startup is actually the instance URL on https. Of course apart from docker that provides the start. The

/data

volume is persistent in both cases. I cannot figure, what could go wrong on the boot. I can export the logs if you are interested

Robert Moucha

05/10/2023, 3:08 PM

Sorry being late to the party, please allow me to add a few notes explaining the behaviour: 1. As you already noticed, all persistent data is stored in

/data

directory. This data contains (among other things) Postgresql data dir and Dex db in sqlite3 format (dex.db). 2. When running directly from docker, it's possible to mount volume into this directory using

-v somevolume:/data

so data will survive various container lifecycle events (including stop and delete). This is the only way how to support gooddata-cn-ce upgrades - simply stop old container and start a new one with the data volume mounted. 3. Downgrades are currently not possible - some components perform upgrade of db schema and if you start older image version with such updated volume, it will not work (in most cases). You can copy docker volume data to safe place before running upgrade, to make sure you still have older data copy you may use in case of troubles and start it with previous image version. 4. As far as ECS is concerned - I don't know your exact configuration, but remember the data volume contains databases. Errors you're describing suggest the volume was forcefully detached while the container was running. 5. I don't have in-depth experience with EFS and how it allocates space when used with ECS. But 20GB right after container start looks really suspicious. Empty PG db has less than 100MB and even with big data model the size hardly exceeds 500MB.

Tomáš Gajdoš

05/19/2023, 12:30 PM

Hi @Jan Rehanek. We ended up with the same problem after setting up OIDC and then reverting back to the default one with a PUT request to

/api/v1/entities/admin/organizations/{organization}

with

Copy code

"data": {
    "id": "default",
    "type": "organization",
    "attributes": {
      "name": "Default Organization",
      "hostname": "<our_hostname>"
    }
  }

Jan Rehanek

05/19/2023, 12:35 PM

What exactly does the ‘same problem’ mean in this case? Is it the

unregistered_redirect_url

error popping when you’re trying to access the hostname or is it more?

Tomáš Gajdoš

05/19/2023, 12:35 PM

yes, exactly that.

Jan Rehanek

05/19/2023, 12:36 PM

All right. I’ll see if I can reproduce the issue.

🙏 1

Jan Rehanek

05/19/2023, 1:40 PM

Let me just make sure that I’m understanding the steps correctly. 1. Run community edition from the provided docker image on our website. 2. Update the OIDC config and hostname at

/api/v1/entities/admin/organizations/default

with PUT to some new OIDC provider. 3. Update

/api/v1/entities/admin/organizations/default

with PUT that only contains:

Copy code

{
  "data": {
    "id": "default",
    "type": "organization",
    "attributes": {
      "name": "Default Organization",
      "hostname": "{{custom_hostname}}"
    }
  }
}

Is that all or am I missing some intermediate step?

Tomáš Gajdoš

05/19/2023, 1:42 PM

Yes, that’s exactly what’s happened. @Martin Váňa, anything extra to add?

Jan Rehanek

05/19/2023, 2:18 PM

Out of curiosity, how did you manage to change the hostname in the first place? I’m ending up with

Copy code

{
  "detail": "Organization hostname cannot be changed",
  "status": 400,
  "title": "Bad Request",
  "traceId": "14b5c6b83d65da75"
}

👀 1

Jeffrey Craig

05/19/2023, 10:54 PM

Curious as well as I want to change the hostname for https 🙂

Robert Moucha

05/20/2023, 4:21 PM

The only way how to change organization hostname on GoodData CN Community Edition is by setting

GDCN_PUBLIC_URL

environment variable on container start. Unfortunately, the version 2.3.0 contains bug that generates invalid redirect_uri for dex oauth2 client. This error makes practically impossible to use public url with default port for given protocol (80 for http, 443 for https) 😞 The error was already fixed and will not be present in the next release. Or you may use some recent development build (Apr 12th is the first containing the fix).

Jeffrey Craig

05/20/2023, 4:23 PM

It was actually an error that I was including a line break in the command, but the issue I'm having now is that when I start the docker it throws errors and ends up four stopping the docker container. That may be related to the error that you're mentioning here so all attempt to use the more recent build that's in preview to see if that gets me around.

Robert Moucha

05/20/2023, 4:25 PM

If you're using released version (like 2.3.0) AND using GDCN_PUBLIC_URL with url without port (like

<https://whatever.com>

) THEN you're affected

Jeffrey Craig

05/20/2023, 4:26 PM

Got it.. so the force stop is probably related. When I get home I'll try with a different declared version.

Robert Moucha

05/20/2023, 4:26 PM

the container starts, but you can't login, getting error like Unregistred redirect url

<https://whatever.com>:*\n*/login/oauth2/code/whatever.com

Jeffrey Craig

05/20/2023, 4:26 PM

Ah it's not getting that far, the container stops and the logs indicate errors.

Robert Moucha

05/20/2023, 4:27 PM

if the container doesn't start at all, this a different issue. pls share the log output we'll check it

Robert Moucha

05/20/2023, 10:25 PM

this is the very end of logs. I need to know what happened earlier. You can redirect the whole 'docker logs' output to file and send it to me. I will check what's wrong with your container.

Jeffrey Craig

05/22/2023, 12:06 PM

@Robert Moucha

Jeffrey Craig

05/22/2023, 12:11 PM

this was my command3:

Copy code

sudo docker run -i -t -p 3000:3000 -p 5432:5432 -v gooddata-dev:/data \
-e GDCN_TOKEN_SECRET=XXXXXXXXXXXX \
-e GDCN_PUBLIC_URL=<https://analytics.novelcx.com> \
-e LICENSE_AND_PRIVACY_POLICY_ACCEPTED=YES \
gooddata/gooddata-cn-ce:dev_20230517.a881988f

Robert Moucha

05/22/2023, 12:13 PM

you pressed

Ctrl-C

in container's terminal. This action sends SIGINT signal to supervisor that shuts down the whole application stack:

Copy code

\============= All services of <http://GoodData.CN|GoodData.CN> are ready =============

127.0.0.1 - - [22/May/2023:11:54:26 +0000] "GET / HTTP/1.1" 200 5639 "-" "curl/7.74.0"
Nginx: ready
s6-rc: info: service nginx successfully started
s6-rc: info: service legacy-services: starting
s6-rc: info: service legacy-services successfully started
172.31.90.40 - - [22/May/2023:11:54:38 +0000] "GET / HTTP/1.1" 200 2587 "-" "ELB-HealthChecker/2.0"
^C
exiting...
s6-rc: info: service legacy-services: stopping
s6-rc: info: service legacy-services successfully stopped
s6-rc: info: service nginx: stopping

^^^ Note the

^C

in the output

Jeffrey Craig

05/22/2023, 12:14 PM

i noticed that thinking it was part of the script

Jeffrey Craig

05/23/2023, 10:00 PM

made it to step 4. using bootstrap token, upload Organization layout with data source passwords (PUT /api/v1/layout/organization). Since the organization still has internal oidc set up, ui logins won't work at this moment. However, i get a series of errors from an incompatible put which is a copy paste of what I backed up in the last org. So far using my old org layout, i replaced the authId with my current. Deleted oauthIssuerLocation since i'm attempting to revert to the internal OIDC. Updated my old org layout field oauthClientId to match the current layout. I've also had to replace my hostname with my updated <https://a|https> domain since that also threw an error. I'm out of ideas for resolving the below error. For anyone else trying to follow, i had to add "password": "database_password_here" under the type, url and username fields within the json schema.

Copy code

{
  "detail": "Invalid combination of auth properties. Specify either none or user + password or token. Datasource: XXXX, user: <yes>, password: <no>, token: <no>",
  "status": 400,
  "title": "Bad Request",
  "traceId": "7df800d37c2eea44"
}

Jeffrey Craig

05/23/2023, 10:59 PM

@Robert Moucha I think i'm almost back.. using gooddata-cn-ce:dev_20230517.a881988f since I changed the domain to https://analytics.novelcx.com. I have completed the put restoring my Auth0 info. I then updated Auth0 client to the following values per the guides I used when this worked prior. After saving and waiting several minutes, I am getting this error which did not happen prior with a redirect to address https://analytics.novelcx.com/oauth2/authorization/analytics.novelcx.com:

Copy code

{
  "title": "Unauthorized",
  "status": 401,
  "detail": "401 UNAUTHORIZED \"Authorization failed for given issuer \"<https://ncx-prod.auth0.com/authorize/>\"\"",
  "traceId": "4b70fd5992b55eca"
}

Robert Moucha

05/25/2023, 2:18 PM

the culprit is the issuer is set to

Copy code

<https://ncx-prod.auth0.com/authorize/>

oauthIssuerLocation

needs to be set to:

Copy code

<https://ncx-prod.auth0.com/>

in oidc configuration. Do not forget adding the trailing slash, Auth0 requires it.

Robert Moucha

05/25/2023, 2:25 PM

Explanation: The configuration of OIDC is taken from well-known cfg endpoint (https://ncx-prod.auth0.com/.well-known/openid-configuration in your case). Issuer is

<https://ncx-prod.auth0.com/>

and authorization endpoint is retrieved automatically from openid-configuration document. So do not append

/authorize/

or whatever else to Issuer URL. See https://www.gooddata.com/developers/cloud-native/doc/2.3/manage-organization/set-up-authentication/#SetUpAuthenticationUsi[…]tIdentityProvider-Auth0 for Auth0-specific comments.

❤️ 1

Robert Moucha

05/25/2023, 2:29 PM

regarding the database credentials - see https://www.gooddata.com/developers/cloud-native/doc/2.3/manage-organization/organization-api/backups/ , you need to add either

username

and

password

OR `token`(for db like bigquery) to every exported datasource in your Organization layout.

2 Views

Open in Slack

Previous Next