You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
install Ubuntu from ubuntu/jammy64 box on Vagrant(2.4.3) + VirtualBox(7.0.22-165102Ubuntujammy).
install juju inside ubuntu VM
install lxd inside Ubuntu VM
install PostgreSQL charm according to tutorial, scale it up to 2 replicas
Initial status of the system:
Model Controller Cloud/Region Version SLA Timestamp
postgresql localhost-localhost localhost/localhost 3.4.6 unsupported 07:53:40Z
App Version Status Scale Charm Channel Rev Exposed Message
postgresql 14.12 active 3 postgresql 14/stable 468 no
Unit Workload Agent Machine Public address Ports Message
postgresql/0 active idle 0 10.232.17.70 5432/tcp
postgresql/1* active idle 1 10.232.17.207 5432/tcp
postgresql/2 active idle 2 10.232.17.155 5432/tcp Primary
Machine State Address Inst id Base AZ Message
0 started 10.232.17.70 juju-e99525-0 [email protected] Running
1 started 10.232.17.207 juju-e99525-1 [email protected] Running
2 started 10.232.17.155 juju-e99525-2 [email protected] Running
Wait 10-15 minute to be sure that app+DB work as expected (200 OKs are returned, and there are enough resources - CPU/RAM for this workload)
Break partially network on PostgreSQL primary:
juju ssh postgresql/2
sudo tc qdisc add dev eth0 root netem loss 80%
Wait 30-60 minutes and inspect cluster behavior
Expected behavior
The failed primary node is detected and kicked out of the cluster. The cluster works well in degraded mode with 1 replica until manual intervention.
Actual behavior
Cluster had 2 primary nodes, see juju status:
Model Controller Cloud/Region Version SLA Timestamp
postgresql localhost-localhost localhost/localhost 3.4.6 unsupported 21:43:32Z
App Version Status Scale Charm Channel Rev Exposed Message
postgresql 14.12 active 3 postgresql 14/stable 468 no
Unit Workload Agent Machine Public address Ports Message
postgresql/0 active idle 0 10.232.17.70 5432/tcp Primary
postgresql/1* active idle 1 10.232.17.207 5432/tcp
postgresql/2 active idle 2 10.232.17.155 5432/tcp Primary
Machine State Address Inst id Base AZ Message
0 started 10.232.17.70 juju-e99525-0 [email protected] Running
1 started 10.232.17.207 juju-e99525-1 [email protected] Running
2 started 10.232.17.155 juju-e99525-2 [email protected] Running
PostgreSQL cluster nodes were also reported two DB masters for postgresql/2 and postgresql/0:
$ psql -h 10.232.17.70 -U operator -p 5432
postgres=# select pg_is_in_recovery();
pg_is_in_recovery
-------------------
f
# with long delay due to slow network
$ psql -h 10.232.17.155 -U operator -p 5432
postgres=# select pg_is_in_recovery();
pg_is_in_recovery
-------------------
f
Please note, that cluster self-repaired after rebooting the problem node.
The previous run on my local machine has a different outcome:
One ofPostgreSQL replicas was promoted to primary. Primary became the replica and status is green from charm point of view. However, actual replica status is broken
I'll reduce the case using pgbench and test it again soon.
The text was updated successfully, but these errors were encountered:
Steps to reproduce
Prerequisites:
ubuntu/jammy64
box on Vagrant(2.4.3) + VirtualBox(7.0.22-165102Ubuntujammy).Initial status of the system:
Expected behavior
The failed primary node is detected and kicked out of the cluster. The cluster works well in degraded mode with 1 replica until manual intervention.
Actual behavior
Cluster had 2 primary nodes, see juju status:
PostgreSQL cluster nodes were also reported two DB masters for postgresql/2 and postgresql/0:
So it looks like a cluster split-brain.
Versions
Operating system: Ubuntu 22.04.4 LTS
Juju CLI: 3.6.1-genericlinux-amd64
Juju agent: 3.4.6
Charm revision: 468
LXD: 5.0.4
Log output
I've attached the juju, PostgreSQL and patroni logs:
database_logs_new_primary.tar.gz
database_logs_old_primary.tar.gz
juju_status_two_primaries.log
juju-debug-log.log
Additional context
Please note, that cluster self-repaired after rebooting the problem node.
The previous run on my local machine has a different outcome:
I'll reduce the case using pgbench and test it again soon.
The text was updated successfully, but these errors were encountered: