We are failing to run docker container `gooddata/g...
# gooddata-cn
t
We are failing to run docker container
gooddata/gooddata-cn-ce:3.x
orchestrated in a k8s cluster exposing the gdcn api service (this way we are trying to simplify the deployment and resources consumption). It works if we start a fresh new k8s pod with a fresh new persistent volume (pvc provisioned as GKE GCE persistent disk), however it starts to fail after we restart the deployment and never reaches "running" state (keeps on restarting). I suspect the issue is in the state that is persited to the disk however I couldn't find where could be the problem. I can see some errors during start in the bookkeeper/pulsar:
Copy code
2024-04-29T09:58:05,754+0000 [BookKeeperClientWorker-OrderedExecutor-3-0] ERROR org.apache.bookkeeper.client.ReadLastConfirmedOp - While readLastConfirmed ledger: 31 did not hear success responses from all quorums, QuorumCoverage(e:1,w:1,a:1) = [-8]
2024-04-29T09:58:05,754+0000 [BookKeeperClientWorker-OrderedExecutor-2-0] DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [pulsar/standalone/localhost:8080/persistent/healthcheck] Opened ledger 31: Error while recovering ledger
2024-04-29T09:58:05,754+0000 [BookKeeperClientWorker-OrderedExecutor-2-0] ERROR org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [pulsar/standalone/localhost:8080/persistent/healthcheck] Failed to open ledger 31: Error while recovering ledger
2024-04-29T09:58:05,755+0000 [BookKeeperClientWorker-OrderedExecutor-2-0] ERROR org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl - [pulsar/standalone/localhost:8080/persistent/healthcheck] Failed to initialize managed ledger: Error while recovering ledger error code: -10
2024-04-29T09:58:05,755+0000 [BookKeeperClientWorker-OrderedExecutor-2-0] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [pulsar/standalone/localhost:8080/persistent/healthcheck] Closing managed ledger
2024-04-29T09:58:05,755+0000 [BookKeeperClientWorker-OrderedExecutor-2-0] WARN  org.apache.pulsar.broker.service.BrokerService - Failed to create topic <persistent://pulsar/standalone/localhost:8080/healthcheck>
This scenario however works on older version such as 2.5 but now moving to version 3.x (e.g. 3.7.0) it doesn't work. I'm attaching the k8s deployment snippet and logs of the pod (with PULSAR_LOGLEVEL=DEBUG). Can you please suggest how to debug it and would could we do to fix it? Maybe adjust the initialization script? We are ok with temporal service outage (few minutes).
k8s deployment snippet.yaml
here are logs with
PULSAR_LOGLEVEL=DEBUG
logs of
successfull first deploy
with PULSAR_LOGLEVEL=WARN
logs of
failing restarted deployment
- keeps on restarting and never reaches running state
r
Hi Tomas, Radek from the GoodData technical team here! We're currently having a look at this, and will come back to you with guidance as soon as we have more 🙂
👍 1
t
Thank you Radek, I will wait then 🙂 btw I noticed that for kubernetes deploy via helm chart installation it is recommended to set
securityContext.fsGroupChangePolicy
to
Always
with
fsGroup
for pulsar pods (zookeeper and bookkeeper), so I did the same for "my docker k8s deployment" but it failed since there are more users that have different guids (e.g. for postgresql and redis..) so it even failed on permissions. It was just a blind/naive shot to fix it but I thought I would let you know..
r
fsGroupChangePolicy
needs to be set to
Always
only if you upgrade existing pulsar chart from 2.x to 3.x image version (the new images are running app as non-root user so persistent data had to change ownership). For running CE image as k8s Pod, this setting should not be used - there's one volume that contains data belinging to multiple users (root, postgres, ...) so changing group ownership is not desired.
one thing I noticed while checking your pod template. The variable LICENSE_AND_PRIVACY_POLICY_ACCEPTED is deprecated. And since version 3.0.0, CE image requires license key. See https://www.gooddata.com/docs/cloud-native/3.8/deploy-and-install/community-edition/#InstallGoodData.CNContainerEdition-Installation
t
yes, LICENSE_AND_PRIVACY_POLICY_ACCEPTED is leftover, we use GDCN_LICENSE_KEY load from secret, its not included in the snippet but it works, as I said, the first run works..
re
fsGroupChangePolicy
- I just tried it, thought it would help here, however I see its not the way here so I don't specify it anymore.
is there a way how to run CE images as k8s pod then? or you would recommend to go the full kubernetes install via helm?
r
We regularly run CE with volume that keeps data. There's no objective reason why it shouldn't work in k8s. I will spin up GKE and try it. Then, I will let you know.
I managed to reproduce your problem, even on locally running k3d cluster. The root cause is that rocksdb is using IP addresses for nodes - this data is persisted on volume and when pod is recreated, it gets a different ip address and bookkeeper can't recover a ledger presumably located on different node (IP adrress) so it fails. The fix is rather simple and will be included in some future release. Meanwhile, you can even make it work in k8s if you add the following env variable to your statefulset running CE image:
Copy code
- name: PULSAR_STANDALONE_USE_ZOOKEEPER
              value: '1'
🙌 1
@Tomáš Kačur In attached file you can find complete setup I used for tests. It's very similar to your setup with few tweaks: • fixed probes • added service and ingress • defined GDCN_PUBLIC_URL
👍 1
t
@Robert Moucha I have tried it and it is working, thank you! can I set this env PULSAR_STANDALONE_USE_ZOOKEEPER also to 2.5 deployment? I'm just wondering if I can do it as part of preparation before the actual upgrade.
r
the older gooddata-cn-ce images use pulsar 2.10.5. The change that caused pulsar to start using rocksdb instead of ZK was introduced in 2.11.0. So it's safe to set this variable on older deployments; it will be simply ignored.
👍 1
I wonder how will Pulsar developers handle this situation in future - they want to remove ZK dependency, but reliance on static IP addresses or static hostnames in rocksdb will cause issues in dockerized standalone pulsar installations... But for now, setting PULSAR_STANDALONE_USE_ZOOKEEPER will solve this issue.