PostgreSQL cluster split-brain when network in unstable #712

alex-ramanau · 2025-01-07T08:25:00Z

Steps to reproduce

Prerequisites:

install Ubuntu from ubuntu/jammy64 box on Vagrant(2.4.3) + VirtualBox(7.0.22-165102~~Ubuntu~~jammy).
install juju inside ubuntu VM
install lxd inside Ubuntu VM
install PostgreSQL charm according to tutorial, scale it up to 2 replicas

Initial status of the system:

Model       Controller           Cloud/Region         Version  SLA          Timestamp
postgresql  localhost-localhost  localhost/localhost  3.4.6    unsupported  07:53:40Z

App         Version  Status  Scale  Charm       Channel    Rev  Exposed  Message
postgresql  14.12    active      3  postgresql  14/stable  468  no       

Unit           Workload  Agent  Machine  Public address  Ports     Message
postgresql/0   active    idle   0        10.232.17.70    5432/tcp  
postgresql/1*  active    idle   1        10.232.17.207   5432/tcp  
postgresql/2   active    idle   2        10.232.17.155   5432/tcp  Primary

Machine  State    Address        Inst id        Base          AZ  Message
0        started  10.232.17.70   juju-e99525-0  [email protected]      Running
1        started  10.232.17.207  juju-e99525-1  [email protected]      Running
2        started  10.232.17.155  juju-e99525-2  [email protected]      Running

Connect serial-vault app to DB
Run serial-vault-perf-tests on the rate 200 RPS
Wait 10-15 minute to be sure that app+DB work as expected (200 OKs are returned, and there are enough resources - CPU/RAM for this workload)
Break partially network on PostgreSQL primary:

juju ssh postgresql/2
sudo tc qdisc add dev eth0 root netem loss 80%

Wait 30-60 minutes and inspect cluster behavior

Expected behavior

The failed primary node is detected and kicked out of the cluster. The cluster works well in degraded mode with 1 replica until manual intervention.

Actual behavior

Cluster had 2 primary nodes, see juju status:

Model       Controller           Cloud/Region         Version  SLA          Timestamp
postgresql  localhost-localhost  localhost/localhost  3.4.6    unsupported  21:43:32Z

App         Version  Status  Scale  Charm       Channel    Rev  Exposed  Message
postgresql  14.12    active      3  postgresql  14/stable  468  no       

Unit           Workload  Agent  Machine  Public address  Ports     Message
postgresql/0   active    idle   0        10.232.17.70    5432/tcp  Primary
postgresql/1*  active    idle   1        10.232.17.207   5432/tcp  
postgresql/2   active    idle   2        10.232.17.155   5432/tcp  Primary

Machine  State    Address        Inst id        Base          AZ  Message
0        started  10.232.17.70   juju-e99525-0  [email protected]      Running
1        started  10.232.17.207  juju-e99525-1  [email protected]      Running
2        started  10.232.17.155  juju-e99525-2  [email protected]      Running

PostgreSQL cluster nodes were also reported two DB masters for postgresql/2 and postgresql/0:

$ psql -h 10.232.17.70 -U operator -p 5432 
postgres=# select pg_is_in_recovery();
 pg_is_in_recovery 
-------------------
 f

# with long delay due to slow network
$ psql -h 10.232.17.155 -U operator -p 5432 
postgres=# select pg_is_in_recovery();
 pg_is_in_recovery 
-------------------
 f

So it looks like a cluster split-brain.

Versions

Operating system: Ubuntu 22.04.4 LTS

Juju CLI: 3.6.1-genericlinux-amd64

Juju agent: 3.4.6

Charm revision: 468

LXD: 5.0.4

Log output

I've attached the juju, PostgreSQL and patroni logs:
database_logs_new_primary.tar.gz
database_logs_old_primary.tar.gz
juju_status_two_primaries.log
juju-debug-log.log

Additional context

Please note, that cluster self-repaired after rebooting the problem node.

The previous run on my local machine has a different outcome:

One ofPostgreSQL replicas was promoted to primary. Primary became the replica and status is green from charm point of view. However, actual replica status is broken

I'll reduce the case using pgbench and test it again soon.

The text was updated successfully, but these errors were encountered:

syncronize-issues-to-jira · 2025-01-07T08:25:08Z

Thank you for reporting your feedback to us!

The internal ticket has been created: https://warthogs.atlassian.net/browse/DPE-6272.

This message was autogenerated

alex-ramanau added the bug Something isn't working as expected label Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PostgreSQL cluster split-brain when network in unstable #712

PostgreSQL cluster split-brain when network in unstable #712

alex-ramanau commented Jan 7, 2025

syncronize-issues-to-jira bot commented Jan 7, 2025

PostgreSQL cluster split-brain when network in unstable #712

PostgreSQL cluster split-brain when network in unstable #712

Comments

alex-ramanau commented Jan 7, 2025

Steps to reproduce

Expected behavior

Actual behavior

Versions

Log output

Additional context

syncronize-issues-to-jira bot commented Jan 7, 2025