[Bug] Brokers keep on restarting when not able to make connection with Read-only global zk #23838
Open
2 of 3 tasks
Labels
type/bug
The PR fixed a bug or issue reported a bug
Search before asking
Read release policy
Version
Pulsar version 3.0.7 which uses zookeeper version 3.9.2
Broker and configstore both are on the same pulsar version
Minimal reproduce step
Lets say, global zk in 2 regions, R1->3 participants, 2 observers and R2->2 participants, 2 observers. Leader zk is in R2.
Now, in some situation like network partition, global zk loses the quorum and the R2 zks went to Read only mode.
What did you expect to see?
What did you see instead?
Adding observations as per our testing based on the above mentioned global zk setup where global ZK is operating in RO mode:
pulsar v2.9.3 and Configstore v3.9.2(pulsar v3.0.7)
zookeeperStoreAllowReadOnlyOperations flag is not set in the brokers. Still cluster is in stable state and existing reads/writes works. And few admin get calls also works. Though, from configstore we still see the exceptions like "refusing the connection from not RO clients".
pulsar v3.0.7 and Configstore v3.9.2(pulsar v3.0.7)
metadataStoreAllowReadOnlyOperations flag is not set in the brokers. when global zk lost the quorum and is in RO mode, brokers will not able to make connection with configstore and keep on restarting.
if we enable the metadataStoreAllowReadOnlyOperations in the broker and local session in the configstore, RO session establishment works and existing reads/writes also works but any admin call, lets say even get tenants fails with the keeperSessionExpired exception.
As broker sends a close session call to the configstore on any call made via pulsar client or pulsar admin and fetching the admin policies fails.
Once the quorum is up again, session upgrades automatically and works as expected.
Anything else?
No response
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: