The pulsar bookie StatefulSet on our dev cluster has recentl GoodData #gooddata-cn

The pulsar-bookie StatefulSet on our dev cluster h...

Pete Lorenz

11/14/2023, 5:16 PM

The pulsar-bookie StatefulSet on our dev cluster has recently become unstable with one of the 3 pods in the StatefulSet stuck in CrashLoopBackoff, but the other 2 pods are running without error. We're not sure why it is just one of the pods that's unable to attain Ready status. One of the errors in the logs is "Space left on device /pulsar/data/bookkeeper/ledgers/current : 0". If this is the cause, how would we increase the space and/or clean the existing space? I'm attaching the pod describe and logs of the 2 containers in the failing pod.

pulsar-bookie-2_pulsar-bookie.log pulsar-bookie-2_pulsar-bookkeeper-verify-clusterid.log pulsar-bookie-describe.txt

Tomas Rohrer

11/16/2023, 12:49 PM

Hello @Pete Lorenz 👋, I took a look, and the reason behind your pod endlessly restarting/crashing is related to the error you found, good catch 👍. One way to solve this is to manually expand that volume. The generic command looks like

kubectl edit pvc -n pulsar PersistentVolumeClaim_name

. In your case, your

PersistentVolumeClaim_name

should be

pulsar-bookie-ledgers-pulsar-bookie-2

based on your

describe

commands's output. This simple article may be useful if you are struggling with resizing your PVC. I hope it helps 🤞.

👍 2

Martin Burian

11/16/2023, 3:28 PM

Hey @Pete Lorenz, it seems like an isolated issue to Pulsar and not related to GD.CN. I hope you will be able to resolve it with the advice provided above.

Pete Lorenz

11/16/2023, 3:29 PM

Thank you @Tomas Rohrer and @Martin Burian, increasing the PVC capacity has resolved the issue

👍 1

Robert Moucha

11/18/2023, 3:41 PM

Hi, the error:

Copy code

ERROR org.apache.bookkeeper.util.DiskChecker - Space left on device /pulsar/data/bookkeeper/ledgers/current : 0, Used space fraction: 1.0 > threshold 0.95.

means the ledgers volume of one of your bookie servers is full. That should never happen under normal conditions, because GoodData CN doesn't store data in Pulsar for a long time. It's possible that some Pulsar topic contains many messages that are not being picked up by application. You need to identify which pulsar topic grows. 1. Connect to one of pulsar brokers (e.g. pulsar-broker-0) using

bash

shell.

kubectl -n pulsar exec -it pulsar-broker-0 -- bash

2. using command

bin/pulsar-admin

, discover all available topics: a.

bin/pulsar-admin topics list <<gdcn-namespace>>/<<gdcn-release>>

(replace

<<gdcn-namespace>>

with namespace where gooddata-cn chart is installed, typically

gooddata-cn

, and replace

<<gdcn-release>>

with name of helm release used when GoodData CN chart was installed, typically also

gooddata-cn

). So the command should look similar to

bin/pulsar-admin topics list gooddata-cn/gooddata-cn

b. The

topics list

subcommand will return list of Pulsar topics, that will look like:

<persistent://gooddata-cn/gooddata-cn/caches.garbage-collect>

3. For each of these topics, EXCEPT system topic

__change_events

, run

topics stats

subcommand to see backlog length. The full command will look like

bin/pulsar-admin topics stats <persistent://gooddata-cn/gooddata-cn/caches.garbage-collect>

(repeat for every topic). a. topic stats output is JSON formatted report. Look for

backlogSize

top-level key - it should be zero or close to zero. If not, please send us the full output of

topics stats

command that has high backlog size. 4. truncate topics with

backogSize

>> 0 using subcommand

topics truncate

, e.g.

bin/pulsar-admin topics truncate <persistent://gooddata-cn/gooddata-cn/caches.garbage-collect>

👍 1

Open in Slack

Previous Next