We are failing to run docker container `gooddata gooddata cn GoodData #gooddata-cn

We are failing to run docker container `gooddata/g...

Tomáš Kačur

04/29/2024, 10:01 AM

We are failing to run docker container

gooddata/gooddata-cn-ce:3.x

orchestrated in a k8s cluster exposing the gdcn api service (this way we are trying to simplify the deployment and resources consumption). It works if we start a fresh new k8s pod with a fresh new persistent volume (pvc provisioned as GKE GCE persistent disk), however it starts to fail after we restart the deployment and never reaches "running" state (keeps on restarting). I suspect the issue is in the state that is persited to the disk however I couldn't find where could be the problem. I can see some errors during start in the bookkeeper/pulsar:

Copy code

2024-04-29T09:58:05,754+0000 [BookKeeperClientWorker-OrderedExecutor-3-0] ERROR org.apache.bookkeeper.client.ReadLastConfirmedOp - While readLastConfirmed ledger: 31 did not hear success responses from all quorums, QuorumCoverage(e:1,w:1,a:1) = [-8]
2024-04-29T09:58:05,754+0000 [BookKeeperClientWorker-OrderedExecutor-2-0] DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [pulsar/standalone/localhost:8080/persistent/healthcheck] Opened ledger 31: Error while recovering ledger
2024-04-29T09:58:05,754+0000 [BookKeeperClientWorker-OrderedExecutor-2-0] ERROR org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [pulsar/standalone/localhost:8080/persistent/healthcheck] Failed to open ledger 31: Error while recovering ledger
2024-04-29T09:58:05,755+0000 [BookKeeperClientWorker-OrderedExecutor-2-0] ERROR org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl - [pulsar/standalone/localhost:8080/persistent/healthcheck] Failed to initialize managed ledger: Error while recovering ledger error code: -10
2024-04-29T09:58:05,755+0000 [BookKeeperClientWorker-OrderedExecutor-2-0] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [pulsar/standalone/localhost:8080/persistent/healthcheck] Closing managed ledger
2024-04-29T09:58:05,755+0000 [BookKeeperClientWorker-OrderedExecutor-2-0] WARN  org.apache.pulsar.broker.service.BrokerService - Failed to create topic <persistent://pulsar/standalone/localhost:8080/healthcheck>

This scenario however works on older version such as 2.5 but now moving to version 3.x (e.g. 3.7.0) it doesn't work. I'm attaching the k8s deployment snippet and logs of the pod (with PULSAR_LOGLEVEL=DEBUG). Can you please suggest how to debug it and would could we do to fix it? Maybe adjust the initialization script? We are ok with temporal service outage (few minutes).

Tomáš Kačur

04/29/2024, 10:03 AM

k8s deployment snippet.yaml

Tomáš Kačur

04/29/2024, 10:04 AM

here are logs with

PULSAR_LOGLEVEL=DEBUG

gooddata-cn-ce with DEBUG pulsar level.log

Tomáš Kačur

04/29/2024, 10:29 AM

logs of

successfull first deploy

with PULSAR_LOGLEVEL=WARN

gooddata-cn-ce succesfull first deploy.log

Tomáš Kačur

04/29/2024, 10:31 AM

logs of

failing restarted deployment

- keeps on restarting and never reaches running state

gooddata-cn-ce failing restart deploy.log

Radek Novacek

04/29/2024, 2:19 PM

Hi Tomas, Radek from the GoodData technical team here! We're currently having a look at this, and will come back to you with guidance as soon as we have more 🙂

👍 1

Tomáš Kačur

04/29/2024, 3:05 PM

Thank you Radek, I will wait then 🙂 btw I noticed that for kubernetes deploy via helm chart installation it is recommended to set

securityContext.fsGroupChangePolicy

Always

with

fsGroup

for pulsar pods (zookeeper and bookkeeper), so I did the same for "my docker k8s deployment" but it failed since there are more users that have different guids (e.g. for postgresql and redis..) so it even failed on permissions. It was just a blind/naive shot to fix it but I thought I would let you know..

Robert Moucha

04/30/2024, 7:13 AM

fsGroupChangePolicy

needs to be set to

Always

only if you upgrade existing pulsar chart from 2.x to 3.x image version (the new images are running app as non-root user so persistent data had to change ownership). For running CE image as k8s Pod, this setting should not be used - there's one volume that contains data belinging to multiple users (root, postgres, ...) so changing group ownership is not desired.

Robert Moucha

04/30/2024, 7:20 AM

one thing I noticed while checking your pod template. The variable LICENSE_AND_PRIVACY_POLICY_ACCEPTED is deprecated. And since version 3.0.0, CE image requires license key. See https://www.gooddata.com/docs/cloud-native/3.8/deploy-and-install/community-edition/#InstallGoodData.CNContainerEdition-Installation

Tomáš Kačur

04/30/2024, 7:39 AM

yes, LICENSE_AND_PRIVACY_POLICY_ACCEPTED is leftover, we use GDCN_LICENSE_KEY load from secret, its not included in the snippet but it works, as I said, the first run works..

Tomáš Kačur

04/30/2024, 7:46 AM

fsGroupChangePolicy

- I just tried it, thought it would help here, however I see its not the way here so I don't specify it anymore.

Tomáš Kačur

04/30/2024, 7:50 AM

is there a way how to run CE images as k8s pod then? or you would recommend to go the full kubernetes install via helm?

Robert Moucha

05/02/2024, 6:09 AM

We regularly run CE with volume that keeps data. There's no objective reason why it shouldn't work in k8s. I will spin up GKE and try it. Then, I will let you know.

Robert Moucha

05/02/2024, 12:44 PM

I managed to reproduce your problem, even on locally running k3d cluster. The root cause is that rocksdb is using IP addresses for nodes - this data is persisted on volume and when pod is recreated, it gets a different ip address and bookkeeper can't recover a ledger presumably located on different node (IP adrress) so it fails. The fix is rather simple and will be included in some future release. Meanwhile, you can even make it work in k8s if you add the following env variable to your statefulset running CE image:

Copy code

- name: PULSAR_STANDALONE_USE_ZOOKEEPER
              value: '1'

🙌 1

Robert Moucha

05/02/2024, 12:58 PM

@Tomáš Kačur In attached file you can find complete setup I used for tests. It's very similar to your setup with few tweaks: • fixed probes • added service and ingress • defined GDCN_PUBLIC_URL

gd-cn-ce-sts.yaml

👍 1

Tomáš Kačur

05/06/2024, 11:32 AM

@Robert Moucha I have tried it and it is working, thank you! can I set this env PULSAR_STANDALONE_USE_ZOOKEEPER also to 2.5 deployment? I'm just wondering if I can do it as part of preparation before the actual upgrade.

Robert Moucha

05/06/2024, 11:40 AM

the older gooddata-cn-ce images use pulsar 2.10.5. The change that caused pulsar to start using rocksdb instead of ZK was introduced in 2.11.0. So it's safe to set this variable on older deployments; it will be simply ignored.

👍 1

Robert Moucha

05/06/2024, 11:42 AM

I wonder how will Pulsar developers handle this situation in future - they want to remove ZK dependency, but reliance on static IP addresses or static hostnames in rocksdb will cause issues in dockerized standalone pulsar installations... But for now, setting PULSAR_STANDALONE_USE_ZOOKEEPER will solve this issue.

15 Views

Open in Slack

Previous Next