The pulsar-bookie StatefulSet on our dev cluster h...
# gooddata-cn
p
The pulsar-bookie StatefulSet on our dev cluster has recently become unstable with one of the 3 pods in the StatefulSet stuck in CrashLoopBackoff, but the other 2 pods are running without error. We're not sure why it is just one of the pods that's unable to attain Ready status. One of the errors in the logs is "Space left on device /pulsar/data/bookkeeper/ledgers/current : 0". If this is the cause, how would we increase the space and/or clean the existing space? I'm attaching the pod describe and logs of the 2 containers in the failing pod.
t
Hello @Pete Lorenz 👋, I took a look, and the reason behind your pod endlessly restarting/crashing is related to the error you found, good catch 👍. One way to solve this is to manually expand that volume. The generic command looks like
kubectl edit pvc -n pulsar PersistentVolumeClaim_name
. In your case, your
PersistentVolumeClaim_name
should be
pulsar-bookie-ledgers-pulsar-bookie-2
based on your
describe
commands's output. This simple article may be useful if you are struggling with resizing your PVC. I hope it helps 🤞.
👍 2
m
Hey @Pete Lorenz, it seems like an isolated issue to Pulsar and not related to GD.CN. I hope you will be able to resolve it with the advice provided above.
p
Thank you @Tomas Rohrer and @Martin Burian, increasing the PVC capacity has resolved the issue
👍 1
r
Hi, the error:
Copy code
ERROR org.apache.bookkeeper.util.DiskChecker - Space left on device /pulsar/data/bookkeeper/ledgers/current : 0, Used space fraction: 1.0 > threshold 0.95.
means the ledgers volume of one of your bookie servers is full. That should never happen under normal conditions, because GoodData CN doesn't store data in Pulsar for a long time. It's possible that some Pulsar topic contains many messages that are not being picked up by application. You need to identify which pulsar topic grows. 1. Connect to one of pulsar brokers (e.g. pulsar-broker-0) using
bash
shell.
kubectl -n pulsar exec -it pulsar-broker-0 -- bash
2. using command
bin/pulsar-admin
, discover all available topics: a.
bin/pulsar-admin topics list <<gdcn-namespace>>/<<gdcn-release>>
(replace
<<gdcn-namespace>>
with namespace where gooddata-cn chart is installed, typically
gooddata-cn
, and replace
<<gdcn-release>>
with name of helm release used when GoodData CN chart was installed, typically also
gooddata-cn
). So the command should look similar to
bin/pulsar-admin topics list gooddata-cn/gooddata-cn
b. The
topics list
subcommand will return list of Pulsar topics, that will look like:
<persistent://gooddata-cn/gooddata-cn/caches.garbage-collect>
3. For each of these topics, EXCEPT system topic
__change_events
, run
topics stats
subcommand to see backlog length. The full command will look like
bin/pulsar-admin topics stats <persistent://gooddata-cn/gooddata-cn/caches.garbage-collect>
(repeat for every topic). a. topic stats output is JSON formatted report. Look for
backlogSize
top-level key - it should be zero or close to zero. If not, please send us the full output of
topics stats
command that has high backlog size. 4. truncate topics with
backogSize
>> 0 using subcommand
topics truncate
, e.g.
bin/pulsar-admin topics truncate <persistent://gooddata-cn/gooddata-cn/caches.garbage-collect>
👍 1