We're seeing a PulsarAdminException:ServerSideErro...
# gooddata-cn
p
We're seeing a PulsarAdminException:ServerSideErrorException in our calcique, metadata-api, afm-exec, sql-executor, and other service logs:
Copy code
{"ts":"2024-01-02 20:56:00.496","level":"ERROR","logger":"org.springframework.boot.SpringApplication","thread":"main","msg":"Application run failed","exc":"org.apache.pulsar.client.admin.PulsarAdminException$ServerSideErrorException: HTTP 500 Internal Server Error\n\tat org.apache.pulsar.client.admin.PulsarAdminException.wrap(PulsarAdminException.java:252)\n\tat org.apache.pulsar.client.admin.internal.BaseResource.sync(BaseResource.java:302)\n\tat org.apache.pulsar.client.admin.internal.TopicsImpl.createNonPartitionedTopic(TopicsImpl.java:340)\n\tat org.apache.pulsar.client.admin.Topics.createNonPartitionedTopic(Topics.java:482)\n\tat com.gooddata.tiger.pulsar.PulsarAutoConfiguration.producerBeanFactory$lambda-2(PulsarAutoConfiguration.kt:131)\n\tat org.springframework.context.support.PostProcessorRegistrationDelegate.invokeBeanFactoryPostProcessors(PostProcessorRegistrationDelegate.java:325)\n\tat
...
In the pulsar namespace, we're seeing a pulsar-bookie-1 pod in Crashloopbackoff, but pulsar-bookie-0 and pulsar-bookie-2 are running and available. I've tried restarting the pulsar broker and bookie pods as well as the affected gooddata pods but the same error occurs. I'm attaching the logs to the failing bookie pod as well as a failing zookeeper pod. Please let us know any ideas we can try to resolve this. Thanks so much!
r
There's exception in bookie log:
Copy code
Caused by: org.rocksdb.RocksDBException: While appending to file: /pulsar/data/bookkeeper/ledgers/current/ledgers/023002.dbtmp: No space left on device
This error says that volume bound to PVC
pulsar-bookie-ledgers-pulsar-bookie-1
is full. Usually it has 5Gi capacity and it should be sufficient. If the bookie ran out of space, it suggests some problem that messages are not correctly dispatched and stay in topics. I recommend to perform deeper investigation to see which topics are causing troubles. You can connect to one of the brokers and use
bin/pulsar-admin
command to inspect backlog size of topics. Refer to https://pulsar.apache.org/reference/#/2.11.x/pulsar-admin/topics where you can get information how to use pulsar-admin command CLI tool. The most important subcommands are
bin/pulsar-admin topics list <<tenant>>/<<namespace>>
and
bin/pulsar-admin topics stats persistent://<<tenant>>/<<namespace>>/<<topic>>
(where
<<tenant>>/<<namespace>>
are pulsar tenant and namespace, typically
gooddata-cn/gooddata-cn
stats
sub-command returns json-formatted output for given topic, look for
backlogSize
that is non-zero (or much higher than zero).
p
Thanks Robert. Will investigate.
r
Please refer to https://gooddataconnect.slack.com/archives/C01P3H2HTDL/p1700322114104059?thread_ts=1699982207.551759&amp;cid=C01P3H2HTDL where I already suggested these steps a few months ago. As I wrote, there's something fishy in your deployment that causes the Pulsar bookies to fill up. I recommend resolving the root cause rather than expanding the ledger volume.
p
Thank you, Robert. I've listed the topics on our single instance of the broker and run the following command for each topic except the system topic as follows:
Copy code
bin/pulsar-admin topics stats <persistent://gooddata-cn/gooddata-cn/[topic]>
It appears that "backlogSize" is 0 for every topic (as of now). It appears that the list of topics changes. Is this expected? For example, I ran topics list at first and got:
Copy code
root@pulsar-broker-0:/pulsar# bin/pulsar-admin topics list gooddata-cn/gooddata-cn
"<persistent://gooddata-cn/gooddata-cn/__change_events>"
"<persistent://gooddata-cn/gooddata-cn/metadata.model.calcique.DLQ>"
"<persistent://gooddata-cn/gooddata-cn/export-tabular.request.DLQ>"
"<persistent://gooddata-cn/gooddata-cn/cache-settings.change>"
"<persistent://gooddata-cn/gooddata-cn/export-visual.request>"
"<persistent://gooddata-cn/gooddata-cn/metadata.model.DLQ>"
"<persistent://gooddata-cn/gooddata-cn/metadata.model>"
"<persistent://gooddata-cn/gooddata-cn/export-visual.request.DLQ>"
"<persistent://gooddata-cn/gooddata-cn/export-tabular.request>"
"<persistent://gooddata-cn/gooddata-cn/cache-settings.bootstrap>"
"<persistent://gooddata-cn/gooddata-cn/compute.calcique.DLQ>"
"<persistent://gooddata-cn/gooddata-cn/result.xtab.DLQ>"
"<persistent://gooddata-cn/gooddata-cn/cache-settings.change.DLQ>"
"<persistent://gooddata-cn/gooddata-cn/metadata.cache-command>"
"<persistent://gooddata-cn/gooddata-cn/caches.garbage-collect.DLQ>"
"<persistent://gooddata-cn/gooddata-cn/data-source.change.DLQ>"
"<persistent://gooddata-cn/gooddata-cn/data-source.change.calcique.DLQ>"
"<persistent://gooddata-cn/gooddata-cn/compute.calcique>"
About 20 minutes later, I ran the same command and the list of topics is different:
Copy code
root@pulsar-broker-0:/pulsar# bin/pulsar-admin topics list gooddata-cn/gooddata-cn
"<persistent://gooddata-cn/gooddata-cn/__change_events>"
"<persistent://gooddata-cn/gooddata-cn/export-tabular.request.DLQ>"
"<persistent://gooddata-cn/gooddata-cn/cache-settings.change>"
"<persistent://gooddata-cn/gooddata-cn/export-visual.request>"
"<persistent://gooddata-cn/gooddata-cn/export-visual.request.DLQ>"
"<persistent://gooddata-cn/gooddata-cn/export-tabular.request>"
"<persistent://gooddata-cn/gooddata-cn/cache-settings.bootstrap>"
"<persistent://gooddata-cn/gooddata-cn/cache-settings.change.DLQ>"
"<persistent://gooddata-cn/gooddata-cn/metadata.cache-command>"
"<persistent://gooddata-cn/gooddata-cn/sql.select.DLQ>"
"<persistent://gooddata-cn/gooddata-cn/caches.garbage-collect.DLQ>"
"<persistent://gooddata-cn/gooddata-cn/data-source.change.DLQ>"
"<persistent://gooddata-cn/gooddata-cn/data-source.change.calcique.DLQ>"
"<persistent://gooddata-cn/gooddata-cn/compute.calcique>"
Note that the "metadata-model" topic is in the first list but not the second. I'm wondering if this is expected.
Another thing, since our volume claims were 1Gi and you mentioned that 5Gi is normal, we've increased the size of the claims to 5Gi for journal and ledger volumes.
Our pulsar deployment is now up with full availability. I'll keep an eye on the storage usage on its volumes for any anomalies.
r
Missing topics should not happen, maybe it's consequence of bookie recovery after full disk space condition. From the output you provided, I can see there are multiple topics missing:
Copy code
<persistent://gooddata-cn/gooddata-cn/caches.garbage-collect>
<persistent://gooddata-cn/gooddata-cn/compute.calcique.DLQ>  
<persistent://gooddata-cn/gooddata-cn/data-source.change>    
<persistent://gooddata-cn/gooddata-cn/metadata.model>        
<persistent://gooddata-cn/gooddata-cn/result.xtab>           
<persistent://gooddata-cn/gooddata-cn/sql.select>
Topics are created dynamically by application. To make sure the messaging stack is working correctly, please restart both pulsar brokers first, and when they come up, perform rolling restart of the following gooddata-cn deployments: calcique, sql-executor, result-cache This is the list of topics that should exist:
Copy code
<persistent://gooddata-cn/gooddata-cn/cache-settings.bootstrap>
<persistent://gooddata-cn/gooddata-cn/cache-settings.change>
<persistent://gooddata-cn/gooddata-cn/cache-settings.change.DLQ>
<persistent://gooddata-cn/gooddata-cn/caches.garbage-collect>
<persistent://gooddata-cn/gooddata-cn/caches.garbage-collect.DLQ>
<persistent://gooddata-cn/gooddata-cn/compute.calcique>
<persistent://gooddata-cn/gooddata-cn/compute.calcique.DLQ>
<persistent://gooddata-cn/gooddata-cn/data-source.change>
<persistent://gooddata-cn/gooddata-cn/data-source.change.calcique.DLQ>
<persistent://gooddata-cn/gooddata-cn/data-source.change.DLQ>
<persistent://gooddata-cn/gooddata-cn/export-tabular.request>
<persistent://gooddata-cn/gooddata-cn/export-tabular.request.DLQ>
<persistent://gooddata-cn/gooddata-cn/export-visual.request>
<persistent://gooddata-cn/gooddata-cn/export-visual.request.DLQ>
<persistent://gooddata-cn/gooddata-cn/metadata.cache-command>
<persistent://gooddata-cn/gooddata-cn/metadata.model>
<persistent://gooddata-cn/gooddata-cn/metadata.model.calcique.DLQ>
<persistent://gooddata-cn/gooddata-cn/metadata.model.DLQ>
<persistent://gooddata-cn/gooddata-cn/result.xtab>
<persistent://gooddata-cn/gooddata-cn/result.xtab.DLQ>
<persistent://gooddata-cn/gooddata-cn/sql.select>
<persistent://gooddata-cn/gooddata-cn/sql.select.DLQ>
(I don't mention system topic
__change_events
that is created by Pulsar itself).
👍 1
p
ok, thanks Robert. Will give this a try.