We re seeing a PulsarAdminException ServerSideErrorException GoodData #gooddata-cn

We're seeing a PulsarAdminException:ServerSideErro...

Pete Lorenz

01/02/2024, 10:03 PM

We're seeing a PulsarAdminException:ServerSideErrorException in our calcique, metadata-api, afm-exec, sql-executor, and other service logs:

Copy code

{"ts":"2024-01-02 20:56:00.496","level":"ERROR","logger":"org.springframework.boot.SpringApplication","thread":"main","msg":"Application run failed","exc":"org.apache.pulsar.client.admin.PulsarAdminException$ServerSideErrorException: HTTP 500 Internal Server Error\n\tat org.apache.pulsar.client.admin.PulsarAdminException.wrap(PulsarAdminException.java:252)\n\tat org.apache.pulsar.client.admin.internal.BaseResource.sync(BaseResource.java:302)\n\tat org.apache.pulsar.client.admin.internal.TopicsImpl.createNonPartitionedTopic(TopicsImpl.java:340)\n\tat org.apache.pulsar.client.admin.Topics.createNonPartitionedTopic(Topics.java:482)\n\tat com.gooddata.tiger.pulsar.PulsarAutoConfiguration.producerBeanFactory$lambda-2(PulsarAutoConfiguration.kt:131)\n\tat org.springframework.context.support.PostProcessorRegistrationDelegate.invokeBeanFactoryPostProcessors(PostProcessorRegistrationDelegate.java:325)\n\tat
...

In the pulsar namespace, we're seeing a pulsar-bookie-1 pod in Crashloopbackoff, but pulsar-bookie-0 and pulsar-bookie-2 are running and available. I've tried restarting the pulsar broker and bookie pods as well as the affected gooddata pods but the same error occurs. I'm attaching the logs to the failing bookie pod as well as a failing zookeeper pod. Please let us know any ideas we can try to resolve this. Thanks so much!

pulsar-bookie-1_pulsar-bookie (1).log pulsar-zookeeper-0_pulsar-zookeeper (1).log

Robert Moucha

01/03/2024, 8:23 AM

There's exception in bookie log:

Copy code

Caused by: org.rocksdb.RocksDBException: While appending to file: /pulsar/data/bookkeeper/ledgers/current/ledgers/023002.dbtmp: No space left on device

This error says that volume bound to PVC

pulsar-bookie-ledgers-pulsar-bookie-1

is full. Usually it has 5Gi capacity and it should be sufficient. If the bookie ran out of space, it suggests some problem that messages are not correctly dispatched and stay in topics. I recommend to perform deeper investigation to see which topics are causing troubles. You can connect to one of the brokers and use

bin/pulsar-admin

command to inspect backlog size of topics. Refer to https://pulsar.apache.org/reference/#/2.11.x/pulsar-admin/topics where you can get information how to use pulsar-admin command CLI tool. The most important subcommands are

bin/pulsar-admin topics list <<tenant>>/<<namespace>>

and

bin/pulsar-admin topics stats persistent://<<tenant>>/<<namespace>>/<<topic>>

(where

<<tenant>>/<<namespace>>

are pulsar tenant and namespace, typically

gooddata-cn/gooddata-cn

stats

sub-command returns json-formatted output for given topic, look for

backlogSize

that is non-zero (or much higher than zero).

Pete Lorenz

01/03/2024, 3:42 PM

Thanks Robert. Will investigate.

Robert Moucha

01/04/2024, 7:59 AM

Please refer to https://gooddataconnect.slack.com/archives/C01P3H2HTDL/p1700322114104059?thread_ts=1699982207.551759&cid=C01P3H2HTDL where I already suggested these steps a few months ago. As I wrote, there's something fishy in your deployment that causes the Pulsar bookies to fill up. I recommend resolving the root cause rather than expanding the ledger volume.

Pete Lorenz

01/04/2024, 8:35 PM

Thank you, Robert. I've listed the topics on our single instance of the broker and run the following command for each topic except the system topic as follows:

Copy code

bin/pulsar-admin topics stats <persistent://gooddata-cn/gooddata-cn/[topic]>

It appears that "backlogSize" is 0 for every topic (as of now). It appears that the list of topics changes. Is this expected? For example, I ran topics list at first and got:

Copy code

root@pulsar-broker-0:/pulsar# bin/pulsar-admin topics list gooddata-cn/gooddata-cn
"<persistent://gooddata-cn/gooddata-cn/__change_events>"
"<persistent://gooddata-cn/gooddata-cn/metadata.model.calcique.DLQ>"
"<persistent://gooddata-cn/gooddata-cn/export-tabular.request.DLQ>"
"<persistent://gooddata-cn/gooddata-cn/cache-settings.change>"
"<persistent://gooddata-cn/gooddata-cn/export-visual.request>"
"<persistent://gooddata-cn/gooddata-cn/metadata.model.DLQ>"
"<persistent://gooddata-cn/gooddata-cn/metadata.model>"
"<persistent://gooddata-cn/gooddata-cn/export-visual.request.DLQ>"
"<persistent://gooddata-cn/gooddata-cn/export-tabular.request>"
"<persistent://gooddata-cn/gooddata-cn/cache-settings.bootstrap>"
"<persistent://gooddata-cn/gooddata-cn/compute.calcique.DLQ>"
"<persistent://gooddata-cn/gooddata-cn/result.xtab.DLQ>"
"<persistent://gooddata-cn/gooddata-cn/cache-settings.change.DLQ>"
"<persistent://gooddata-cn/gooddata-cn/metadata.cache-command>"
"<persistent://gooddata-cn/gooddata-cn/caches.garbage-collect.DLQ>"
"<persistent://gooddata-cn/gooddata-cn/data-source.change.DLQ>"
"<persistent://gooddata-cn/gooddata-cn/data-source.change.calcique.DLQ>"
"<persistent://gooddata-cn/gooddata-cn/compute.calcique>"

About 20 minutes later, I ran the same command and the list of topics is different:

Copy code

root@pulsar-broker-0:/pulsar# bin/pulsar-admin topics list gooddata-cn/gooddata-cn
"<persistent://gooddata-cn/gooddata-cn/__change_events>"
"<persistent://gooddata-cn/gooddata-cn/export-tabular.request.DLQ>"
"<persistent://gooddata-cn/gooddata-cn/cache-settings.change>"
"<persistent://gooddata-cn/gooddata-cn/export-visual.request>"
"<persistent://gooddata-cn/gooddata-cn/export-visual.request.DLQ>"
"<persistent://gooddata-cn/gooddata-cn/export-tabular.request>"
"<persistent://gooddata-cn/gooddata-cn/cache-settings.bootstrap>"
"<persistent://gooddata-cn/gooddata-cn/cache-settings.change.DLQ>"
"<persistent://gooddata-cn/gooddata-cn/metadata.cache-command>"
"<persistent://gooddata-cn/gooddata-cn/sql.select.DLQ>"
"<persistent://gooddata-cn/gooddata-cn/caches.garbage-collect.DLQ>"
"<persistent://gooddata-cn/gooddata-cn/data-source.change.DLQ>"
"<persistent://gooddata-cn/gooddata-cn/data-source.change.calcique.DLQ>"
"<persistent://gooddata-cn/gooddata-cn/compute.calcique>"

Note that the "metadata-model" topic is in the first list but not the second. I'm wondering if this is expected.

Pete Lorenz

01/04/2024, 8:38 PM

Another thing, since our volume claims were 1Gi and you mentioned that 5Gi is normal, we've increased the size of the claims to 5Gi for journal and ledger volumes.

Pete Lorenz

01/04/2024, 9:41 PM

Our pulsar deployment is now up with full availability. I'll keep an eye on the storage usage on its volumes for any anomalies.

Robert Moucha

01/05/2024, 8:26 AM

Missing topics should not happen, maybe it's consequence of bookie recovery after full disk space condition. From the output you provided, I can see there are multiple topics missing:

Copy code

<persistent://gooddata-cn/gooddata-cn/caches.garbage-collect>
<persistent://gooddata-cn/gooddata-cn/compute.calcique.DLQ>  
<persistent://gooddata-cn/gooddata-cn/data-source.change>    
<persistent://gooddata-cn/gooddata-cn/metadata.model>        
<persistent://gooddata-cn/gooddata-cn/result.xtab>           
<persistent://gooddata-cn/gooddata-cn/sql.select>

Topics are created dynamically by application. To make sure the messaging stack is working correctly, please restart both pulsar brokers first, and when they come up, perform rolling restart of the following gooddata-cn deployments: calcique, sql-executor, result-cache This is the list of topics that should exist:

Copy code

<persistent://gooddata-cn/gooddata-cn/cache-settings.bootstrap>
<persistent://gooddata-cn/gooddata-cn/cache-settings.change>
<persistent://gooddata-cn/gooddata-cn/cache-settings.change.DLQ>
<persistent://gooddata-cn/gooddata-cn/caches.garbage-collect>
<persistent://gooddata-cn/gooddata-cn/caches.garbage-collect.DLQ>
<persistent://gooddata-cn/gooddata-cn/compute.calcique>
<persistent://gooddata-cn/gooddata-cn/compute.calcique.DLQ>
<persistent://gooddata-cn/gooddata-cn/data-source.change>
<persistent://gooddata-cn/gooddata-cn/data-source.change.calcique.DLQ>
<persistent://gooddata-cn/gooddata-cn/data-source.change.DLQ>
<persistent://gooddata-cn/gooddata-cn/export-tabular.request>
<persistent://gooddata-cn/gooddata-cn/export-tabular.request.DLQ>
<persistent://gooddata-cn/gooddata-cn/export-visual.request>
<persistent://gooddata-cn/gooddata-cn/export-visual.request.DLQ>
<persistent://gooddata-cn/gooddata-cn/metadata.cache-command>
<persistent://gooddata-cn/gooddata-cn/metadata.model>
<persistent://gooddata-cn/gooddata-cn/metadata.model.calcique.DLQ>
<persistent://gooddata-cn/gooddata-cn/metadata.model.DLQ>
<persistent://gooddata-cn/gooddata-cn/result.xtab>
<persistent://gooddata-cn/gooddata-cn/result.xtab.DLQ>
<persistent://gooddata-cn/gooddata-cn/sql.select>
<persistent://gooddata-cn/gooddata-cn/sql.select.DLQ>

(I don't mention system topic

__change_events

that is created by Pulsar itself).

👍 1

Pete Lorenz

01/05/2024, 6:49 PM

ok, thanks Robert. Will give this a try.

Open in Slack

Previous Next