Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PostgreSQL cluster split-brain when network in unstable #712

Open
alex-ramanau opened this issue Jan 7, 2025 · 1 comment
Open

PostgreSQL cluster split-brain when network in unstable #712

alex-ramanau opened this issue Jan 7, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@alex-ramanau
Copy link

Steps to reproduce

Prerequisites:

  1. install Ubuntu from ubuntu/jammy64 box on Vagrant(2.4.3) + VirtualBox(7.0.22-165102Ubuntujammy).
  2. install juju inside ubuntu VM
  3. install lxd inside Ubuntu VM
  4. install PostgreSQL charm according to tutorial, scale it up to 2 replicas

Initial status of the system:

Model       Controller           Cloud/Region         Version  SLA          Timestamp
postgresql  localhost-localhost  localhost/localhost  3.4.6    unsupported  07:53:40Z

App         Version  Status  Scale  Charm       Channel    Rev  Exposed  Message
postgresql  14.12    active      3  postgresql  14/stable  468  no       

Unit           Workload  Agent  Machine  Public address  Ports     Message
postgresql/0   active    idle   0        10.232.17.70    5432/tcp  
postgresql/1*  active    idle   1        10.232.17.207   5432/tcp  
postgresql/2   active    idle   2        10.232.17.155   5432/tcp  Primary

Machine  State    Address        Inst id        Base          AZ  Message
0        started  10.232.17.70   juju-e99525-0  [email protected]      Running
1        started  10.232.17.207  juju-e99525-1  [email protected]      Running
2        started  10.232.17.155  juju-e99525-2  [email protected]      Running
  1. Connect serial-vault app to DB
  2. Run serial-vault-perf-tests on the rate 200 RPS
  3. Wait 10-15 minute to be sure that app+DB work as expected (200 OKs are returned, and there are enough resources - CPU/RAM for this workload)
  4. Break partially network on PostgreSQL primary:
juju ssh postgresql/2
sudo tc qdisc add dev eth0 root netem loss 80%
  1. Wait 30-60 minutes and inspect cluster behavior

Expected behavior

The failed primary node is detected and kicked out of the cluster. The cluster works well in degraded mode with 1 replica until manual intervention.

Actual behavior

Cluster had 2 primary nodes, see juju status:

Model       Controller           Cloud/Region         Version  SLA          Timestamp
postgresql  localhost-localhost  localhost/localhost  3.4.6    unsupported  21:43:32Z

App         Version  Status  Scale  Charm       Channel    Rev  Exposed  Message
postgresql  14.12    active      3  postgresql  14/stable  468  no       

Unit           Workload  Agent  Machine  Public address  Ports     Message
postgresql/0   active    idle   0        10.232.17.70    5432/tcp  Primary
postgresql/1*  active    idle   1        10.232.17.207   5432/tcp  
postgresql/2   active    idle   2        10.232.17.155   5432/tcp  Primary

Machine  State    Address        Inst id        Base          AZ  Message
0        started  10.232.17.70   juju-e99525-0  [email protected]      Running
1        started  10.232.17.207  juju-e99525-1  [email protected]      Running
2        started  10.232.17.155  juju-e99525-2  [email protected]      Running

PostgreSQL cluster nodes were also reported two DB masters for postgresql/2 and postgresql/0:

$ psql -h 10.232.17.70 -U operator -p 5432 
postgres=# select pg_is_in_recovery();
 pg_is_in_recovery 
-------------------
 f

# with long delay due to slow network
$ psql -h 10.232.17.155 -U operator -p 5432 
postgres=# select pg_is_in_recovery();
 pg_is_in_recovery 
-------------------
 f

So it looks like a cluster split-brain.

Versions

Operating system: Ubuntu 22.04.4 LTS

Juju CLI: 3.6.1-genericlinux-amd64

Juju agent: 3.4.6

Charm revision: 468

LXD: 5.0.4

Log output

I've attached the juju, PostgreSQL and patroni logs:
database_logs_new_primary.tar.gz
database_logs_old_primary.tar.gz
juju_status_two_primaries.log
juju-debug-log.log

Additional context

Please note, that cluster self-repaired after rebooting the problem node.

The previous run on my local machine has a different outcome:

One ofPostgreSQL replicas was promoted to primary. Primary became the replica and status is green from charm point of view. However, actual replica status is broken

I'll reduce the case using pgbench and test it again soon.

@alex-ramanau alex-ramanau added the bug Something isn't working label Jan 7, 2025
Copy link

Thank you for reporting your feedback to us!

The internal ticket has been created: https://warthogs.atlassian.net/browse/DPE-6272.

This message was autogenerated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant