[Bug] Brokers keep on restarting when not able to make connection with Read-only global zk #23838

Meet0861 · 2025-01-10T06:17:00Z

Search before asking

I searched in the issues and found nothing similar.

Read release policy

I understand that unsupported versions don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker.

Version

Pulsar version 3.0.7 which uses zookeeper version 3.9.2
Broker and configstore both are on the same pulsar version

Minimal reproduce step

Lets say, global zk in 2 regions, R1->3 participants, 2 observers and R2->2 participants, 2 observers. Leader zk is in R2.
Now, in some situation like network partition, global zk loses the quorum and the R2 zks went to Read only mode.

What did you expect to see?

Cluster should not went to unstable state[brokers kept on restarting] even if it is not able to establish connection with global zk for any reason and should handle this gracefully
Get calls should work even when quorum is lost and global zk is operating in RO mode with local sessions enabled
New produce/consume also works if it doesnt make any write call to global zk internally

What did you see instead?

Adding observations as per our testing based on the above mentioned global zk setup where global ZK is operating in RO mode:
pulsar v2.9.3 and Configstore v3.9.2(pulsar v3.0.7)
zookeeperStoreAllowReadOnlyOperations flag is not set in the brokers. Still cluster is in stable state and existing reads/writes works. And few admin get calls also works. Though, from configstore we still see the exceptions like "refusing the connection from not RO clients".

pulsar v3.0.7 and Configstore v3.9.2(pulsar v3.0.7)
metadataStoreAllowReadOnlyOperations flag is not set in the brokers. when global zk lost the quorum and is in RO mode, brokers will not able to make connection with configstore and keep on restarting.
if we enable the metadataStoreAllowReadOnlyOperations in the broker and local session in the configstore, RO session establishment works and existing reads/writes also works but any admin call, lets say even get tenants fails with the keeperSessionExpired exception.
As broker sends a close session call to the configstore on any call made via pulsar client or pulsar admin and fetching the admin policies fails.
Once the quorum is up again, session upgrades automatically and works as expected.

Anything else?

No response

Are you willing to submit a PR?

I'm willing to submit a PR!

Meet0861 added the type/bug The PR fixed a bug or issue reported a bug label Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Brokers keep on restarting when not able to make connection with Read-only global zk #23838

[Bug] Brokers keep on restarting when not able to make connection with Read-only global zk #23838

Meet0861 commented Jan 10, 2025

[Bug] Brokers keep on restarting when not able to make connection with Read-only global zk #23838

[Bug] Brokers keep on restarting when not able to make connection with Read-only global zk #23838

Comments

Meet0861 commented Jan 10, 2025

Search before asking

Read release policy

Version

Minimal reproduce step

What did you expect to see?

What did you see instead?

Anything else?

Are you willing to submit a PR?