2/3 PG units stuck in waiting/idle state, not moving to active/idle #668

ethanmye-rs · 2024-11-11T04:30:18Z

Steps to reproduce

a. I do not have a firm reproducer, but I ran into this issue upgrading from rev 429 to rev 468 in a charmed landscape deployment. I originally encountered the issue in rev 429, and based on a prior bug, expected refreshing to 468 would fix the issue. However, I still see my pg units not starting, in a "awaiting for member to start" state.
b. I did not encounter this issue on another cluster in an identical environment, so it seems somewhat random. The machines in the juju model are manual machines in Azure.

Essentially, 2/3 postgres units stay stuck in a "awaiting for member to start" state. They cycle through different waiting and executing states, but the PG units never actually start.

Expected behavior

I expect the other 2 units to start and enter a active/idle state. They have been in this state for >48 hours.

Actual behavior

see logs below, but the machines cycle through waiting/executing states, but never enter active/idle as expected.

Versions

Operating system: 22.04.4

Juju CLI: 3.5.4

Juju agent: 3.5.4

Charm revision: 468

LXD: n/a

Log output

juju debug log: https://paste.ubuntu.com/p/FzXnjMpNYz/
snap logs from one unit failing to start: https://paste.ubuntu.com/p/St8WZNn4GT/ (restart at the end of the log file)
snap logs from other unit failing to start: https://paste.ubuntu.com/p/BH3RXfZrTW/
snap logs from healthy unit: https://paste.ubuntu.com/p/b6bgSVZKYm/
pg snap services config: https://paste.ubuntu.com/p/xJJq6ktXm9/

Happy to provide more logs, details or access to the environment. Thanks.

syncronize-issues-to-jira · 2024-11-11T04:30:26Z

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/DPE-5931.

This message was autogenerated

dragomirp · 2024-11-11T11:18:56Z

Hi, @ethanmye-rs, it seems like Patroni's health endpoint is failing, likely due to the rejecting connections in the snap logs. This seems to be caused by Patroni calling pg_isready failing with exit code 1. Can you check if you have any Postgresql logs on the failing units from around the time they failed? They should be located in /var/snap/charmed-postgresql/common/var/log/postgresql/. There may be more details in the Patroni logs located in /var/snap/charmed-postgresql/common/var/log/patroni/

ethanmye-rs · 2024-11-12T20:05:13Z

Pulled the logs for the two failed units, located in google drive here. Not sure if there is anything sensitive in them (they are about ~200MB total, uncompressed) so I've made them Canonical internal. I will also pull the logs for the active/idle unit, but they are much larger, about 2.4G uncompressed.

ethanmye-rs · 2024-11-12T21:19:15Z

Thanks @marceloneppel very much for the help. Core issue was missing pg_wal data, which prevented the other two units from moving past starting state. A few misc things:

The core reason the replica machines could not start was missing pg_wal data. The data is missing because at an earlier point, postgres actually exhausted the 64GB vm disk and I was forced to restart the machine. This is probably why the replicas are missing pg_wal data. I believe pg clears the pg_wal data on reboot. However, the main database is not very large, maybe 6-7GB based on the data in /var/snap/charmed-postgresql/common/var/lib/postgresql/, so it is suprising that pg_wal is so large. It would be nice to set a charm limit on pg_wal to avoid getting into this issue.

Based on the logs, it seems like if upgrading from 429 -> 468, you will still see superfluous log entries for the charmed-postgresql.pgbackrest-service and similar failing to start.

It would also be nice to surface as a warning (either in juju or COS) if one of the patroni members is in anything but a streaming or running state. Having machines stuck in starting surfaced no errors.

For future reference, this is how we checked the patroni status:

curl <current active pg unit ip>:8008/cluster

and for reiniting the followers, to be run on one of the follower machines:

sudo -H -u snap_daemon charmed-postgresql.patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml reinit postgresql

You can query the state by looking at either the cluster/ endpoint or catting /var/snap/charmed-postgresql/common/var/log/patroni/patroni.log

ben-ballot · 2025-01-23T08:37:07Z

My 2 cents as I was hit by this bug as well, using this version of the charm

postgresql                        14.9     active      3  postgresql        14/candidate   331  yes

I confirm that doing a reinit of the member fixed the "issue".

I still have a different role for my 2 standby units thoug:

$ sudo -H -u snap_daemon charmed-postgresql.patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml list
+ Cluster: postgresql ---------+--------------+---------+-----+-----------+-----------------+
| Member       | Host          | Role         | State   |  TL | Lag in MB | Pending restart |
+--------------+---------------+--------------+---------+-----+-----------+-----------------+
| postgresql-3 | 10.141.31.28  | Leader       | running | 184 |           | *               |
| postgresql-4 | 10.141.31.216 | Replica      | running | 184 |         0 | *               |
| postgresql-5 | 10.141.31.15  | Sync Standby | running | 184 |         0 | *               |
+--------------+---------------+--------------+---------+-----+-----------+-----------------+

alexdlukens · 2025-01-28T14:34:03Z

A user also experienced this issue on revision 545, in the is-managed-database-prod-marketing-airbyte-marketo environment, although the above command to re-init the replica units does not seem to work.

In the patroni logs, I see

charmed-postgresql.patroni[180289]: pg_controldata: fatal: could not open file "/var/snap/charmed-postgresql/common/var/lib/postgresql/global/pg_control" for reading: No such file or directory

Before the first occurance of this line in the logs, there was additional context

Jan 09 13:54:45 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo charmed-postgresql.patroni[980280]: pg_basebackup: error: could not get COPY data stream: ERROR:  the standby was promoted during online backup
Jan 09 13:54:45 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo charmed-postgresql.patroni[980280]: HINT:  This means that the backup being taken is corrupt and should not be used. Try taking another online backup.
Jan 09 13:54:45 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo charmed-postgresql.patroni[980280]: pg_basebackup: removing data directory "/var/snap/charmed-postgresql/common/var/lib/postgresql"
Jan 09 13:55:15 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: Stopping Service for snap application charmed-postgresql.patroni...
Jan 09 13:55:15 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: snap.charmed-postgresql.patroni.service: Deactivated successfully.
Jan 09 13:55:15 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: Stopped Service for snap application charmed-postgresql.patroni.
Jan 09 13:55:15 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: snap.charmed-postgresql.patroni.service: Consumed 9.929s CPU time.
Jan 09 13:55:15 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: Started Service for snap application charmed-postgresql.patroni.
Jan 09 13:59:20 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: Stopping Service for snap application charmed-postgresql.patroni...
Jan 09 13:59:50 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: snap.charmed-postgresql.patroni.service: State 'final-sigterm' timed out. Killing.
Jan 09 13:59:50 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: snap.charmed-postgresql.patroni.service: Killing process 981390 (python3) with signal SIGKILL.
Jan 09 13:59:50 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: snap.charmed-postgresql.patroni.service: Killing process 984912 (pg_basebackup) with signal SIGKILL.
Jan 09 13:59:50 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: snap.charmed-postgresql.patroni.service: Killing process 984915 (pg_basebackup) with signal SIGKILL.
Jan 09 13:59:50 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: snap.charmed-postgresql.patroni.service: Failed with result 'timeout'.
Jan 09 13:59:50 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: snap.charmed-postgresql.patroni.service: Unit process 981390 (python3) remains running after unit stopped.
Jan 09 13:59:50 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: snap.charmed-postgresql.patroni.service: Unit process 984912 (pg_basebackup) remains running after unit stopped.
Jan 09 13:59:50 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: snap.charmed-postgresql.patroni.service: Unit process 984915 (pg_basebackup) remains running after unit stopped.
Jan 09 13:59:50 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: Stopped Service for snap application charmed-postgresql.patroni.
Jan 09 13:59:50 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: snap.charmed-postgresql.patroni.service: Consumed 1min 11.395s CPU time.
Jan 09 13:59:50 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: Started Service for snap application charmed-postgresql.patroni.
Jan 09 13:59:55 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo charmed-postgresql.patroni[985096]: pg_controldata: fatal: could not open file "/var/snap/charmed-postgresql/common/var/lib/postgresql/global/pg_control" for reading: No such file or directory

so pg_basebackup removed the directory, was killed, and then when patroni restarted could not find the directory?

This error has been persistent in the logs since Jan 09, when this happened

is-managed-database-prod-marketing-airbyte-marketo@is-bastion-ps6:~$ cat patroni.log | grep "could not open file" | wc -l
268331

marceloneppel · 2025-02-27T14:32:41Z

Hi, @alexdlukens! Thanks for the logs.

I did some investigation and reproduced the issue locally but I'd like to check it on your environment, to see how it's happening there. The issue is a little bit different from the one from the first comment from this GH issue.

I'll be on PTO till March, 14th. Can I contact you on March 17th onwards to check the issue with you? Thanks.

alexdlukens · 2025-02-27T15:12:27Z

Hello, @marceloneppel

I think this was worked-around during our IS - Data platform sync meeting. I will share what was done here:

sudo systemctl stop jujud-machine-<X>.service
sudo -H -u snap_daemon charmed-postgresql.patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml reinit postgresql postgresql-4
sudo systemctl start jujud-machine-<X>.service

Replacing postgresql-4 with the respective unit name.

Please do reach out when you are back. We have several deployments, I am sure we can find an additional one with a similar issue

hloeung · 2025-03-03T00:56:19Z

Confirmed #668 (comment) fixed it for me:

Was:

postgresql/0              waiting   idle   3        10.146.64.196   5432/tcp  awaiting for member to start
  landscape-client/3      active    idle            10.146.64.196             Client registered!
  telegraf/3              active    idle            10.146.64.196   9103/tcp  Monitoring postgresql/0 (source version/commit 23.10)
  ubuntu-advantage/3      active    idle            10.146.64.196             Attached (esm-apps,esm-infra,livepatch)

Then ran:

ubuntu@juju-69c4ae-dbaas-stg-data-mesh-el-netsuite-3:~$ sudo systemctl stop jujud-machine-3.service
ubuntu@juju-69c4ae-dbaas-stg-data-mesh-el-netsuite-3:~$ sudo -H -u snap_daemon charmed-postgresql.patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml reinit postgresql postgresql-0
+ Cluster: postgresql (7468654271286863093) --+-----------+----+-----------+
| Member       | Host          | Role         | State     | TL | Lag in MB |
+--------------+---------------+--------------+-----------+----+-----------+
| postgresql-0 | 10.146.64.196 | Replica      | stopped   |    |   unknown |
| postgresql-1 | 10.146.64.206 | Sync Standby | streaming |  6 |         0 |
| postgresql-2 | 10.146.64.203 | Leader       | running   |  6 |           |
+--------------+---------------+--------------+-----------+----+-----------+
Are you sure you want to reinitialize members postgresql-0? [y/N]: y
Success: reinitialize for member postgresql-0
ubuntu@juju-69c4ae-dbaas-stg-data-mesh-el-netsuite-3:~$ sudo systemctl start jujud-machine-3.service

Now:

postgresql/0              active    idle   3        10.146.64.196   5432/tcp
  landscape-client/3      active    idle            10.146.64.196             Client registered!
  telegraf/3              active    idle            10.146.64.196   9103/tcp  Monitoring postgresql/0 (source version/commit 23.10)
  ubuntu-advantage/3      active    idle            10.146.64.196             Attached (esm-apps,esm-infra,livepatch)

hloeung · 2025-03-03T02:42:26Z

I went around repairing / fixing the stuck units on a few environments/models. But on two, it is failing with start failed:

ubuntu@juju-30681d-is-managed-database-prod-support-airbyte-jira-3:~$ sudo -H -u snap_daemon charmed-postgresql.patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml list
+ Cluster: postgresql (7446670001525216935) -+--------------+----+-----------+-----------------+
| Member       | Host         | Role         | State        | TL | Lag in MB | Pending restart |
+--------------+--------------+--------------+--------------+----+-----------+-----------------+
| postgresql-0 | 10.146.64.7  | Replica      | start failed |    |   unknown |                 |
| postgresql-1 | 10.146.64.11 | Leader       | running      | 56 |           | *               |
| postgresql-2 | 10.146.64.5  | Sync Standby | streaming    | 56 |         0 | *               |
+--------------+--------------+--------------+--------------+----+-----------+-----------------+

ubuntu@juju-f46785-is-managed-database-prod-marketing-airbyte-marketo:~$ sudo -H -u snap_daemon charmed-postgresql.patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml list
+ Cluster: postgresql (7443567519097256572) -+--------------+-----+-----------+-----------------+
| Member       | Host         | Role         | State        |  TL | Lag in MB | Pending restart |
+--------------+--------------+--------------+--------------+-----+-----------+-----------------+
| postgresql-0 | 10.146.64.39 | Sync Standby | streaming    | 127 |         0 | *               |
| postgresql-4 | 10.146.64.36 | Leader       | running      | 127 |           | *               |
| postgresql-5 | 10.146.64.43 | Replica      | start failed |     |   unknown |                 |
+--------------+--------------+--------------+--------------+-----+-----------+-----------------+

Any ideas what to do for these cases?

taurus-forever · 2025-03-21T13:43:28Z

Dear @hloeung and @alexdlukens , can you please contact @marceloneppel in MM/Matrix to check your issue.
We suspect several independent issues mixed in this bugreport due to common UI status awaiting for member to start.
Let's try to separate this cases starting from your subcases. Tnx!

cmisare · 2025-03-21T16:46:10Z

@taurus-forever we met with @marceloneppel for troubleshooting the 2 environments mentioned in #668 (comment). In both cases there was a mismatch between the charm revision (553) and the installed charmed-postgresql snap revision (120, should have been 143 per https://github.com/canonical/postgresql-operator/blob/rev553/src/constants.py#L37) that was breaking patroni from starting:

2025-03-19T18:33:50Z charmed-postgresql.patroni[4056206]: 10.146.64.36:5432 - no response
2025-03-19T18:33:50Z charmed-postgresql.patroni[4056205]: 2025-03-19 18:33:50 UTC [4056205]: user=,db=,app=,client=,line=1 FATAL:  could not access file "pgaudit": No such file or directory
2025-03-19T18:33:50Z charmed-postgresql.patroni[4056205]: 2025-03-19 18:33:50 UTC [4056205]: user=,db=,app=,client=,line=2 LOG:  database system is shut down

We believe this may have been due to the way we refreshed the charm revision with terraform, but need to investigate further.

We fixed the unhappy state in this case by running the following on all units

juju exec --application postgresql -- sudo snap refresh --unhold charmed-postgresql
juju exec --application postgresql -- sudo snap refresh --revision 143 charmed-postgresql
juju exec --application postgresql -- sudo snap refresh --hold charmed-postgresql

We then started patroni manually on the most-recently primary unit first, followed by starting it on the remaining units:

juju exec --unit postgresql/4 -- sudo snap start charmed-postgresql.patroni
# wait for the process to come up
juju exec --unit postgresql/0 -- sudo snap start charmed-postgresql.patroni
juju exec --unit postgresql/5 -- sudo snap start charmed-postgresql.patroni

One of the environments looks to be back to full health. The other environment is now seeing a behavior like in #784 that needs to be investigated further there.

taurus-forever · 2025-03-21T20:04:15Z

Thank you @cmisare ! JFYI, we are in process of migration on new and heavily refactored upgrade library 'refresh v3'.

Massive UI and UX implevemtents.
It will NOT require manual pre-upgrade-check (will be executed internally IF missed by user).
it will also check internal resources versions and inform users about mismatch between SNAP/OCI and charm.
.. but this is a story for another ticket(s). Stay tuned!

Regarding this ticket, @ethanmye-rs is it still reproducible on your side with latest 14/stable. If so, let's check it together! Thank you!

ethanmye-rs · 2025-03-21T20:34:45Z

Hi @taurus-forever

I am sorry, but I not longer have access to the environment. @marceloneppel was very helpful in getting the streaming started again, steps are documented in this comment.

I would still support having a limit on wal archive size (so it cannot eat up a whole disk and bring down a machine) and surfacing this in the charm or in COS as an alert/issue.

IMO, this can be closed.

taurus-forever · 2025-03-26T22:08:00Z

@ethanmye-rs thank you! The wal_keep_size is already part of PostgreSQL VM and K8s. Is it something you were searching?

Did I read you comment correct: we do not have place to reproduce the issue. Should the ticket be resolved?

ethanmye-rs · 2025-03-27T14:15:12Z

Sure, once it is added on both I think it will keep issues like this from coming up in the future.

The original environment is no longer available, but you can reproduce the issue easily enough by letting the WAL grow, eg by knocking patroni cluster members out of streaming/running status. I have marked this issue as closed.

Thanks for working on this and adding the option!

ethanmye-rs added the bug Something isn't working as expected label Nov 11, 2024

ethanmye-rs closed this as completed Mar 27, 2025

marceloneppel mentioned this issue Mar 31, 2025

[DPE-5931] Fix unit stuck in waiting state due to incomplete refresh #809

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2/3 PG units stuck in waiting/idle state, not moving to active/idle #668

2/3 PG units stuck in waiting/idle state, not moving to active/idle #668

ethanmye-rs commented Nov 11, 2024

syncronize-issues-to-jira bot commented Nov 11, 2024

dragomirp commented Nov 11, 2024

ethanmye-rs commented Nov 12, 2024 •

edited

Loading

ethanmye-rs commented Nov 12, 2024 •

edited

Loading

ben-ballot commented Jan 23, 2025

alexdlukens commented Jan 28, 2025

marceloneppel commented Feb 27, 2025 •

edited

Loading

alexdlukens commented Feb 27, 2025 •

edited

Loading

hloeung commented Mar 3, 2025

hloeung commented Mar 3, 2025

taurus-forever commented Mar 21, 2025

cmisare commented Mar 21, 2025

taurus-forever commented Mar 21, 2025

ethanmye-rs commented Mar 21, 2025

taurus-forever commented Mar 26, 2025

ethanmye-rs commented Mar 27, 2025

2/3 PG units stuck in waiting/idle state, not moving to active/idle #668

2/3 PG units stuck in waiting/idle state, not moving to active/idle #668

Comments

ethanmye-rs commented Nov 11, 2024

Steps to reproduce

Expected behavior

Actual behavior

Versions

Log output

syncronize-issues-to-jira bot commented Nov 11, 2024

dragomirp commented Nov 11, 2024

ethanmye-rs commented Nov 12, 2024 • edited Loading

ethanmye-rs commented Nov 12, 2024 • edited Loading

ben-ballot commented Jan 23, 2025

alexdlukens commented Jan 28, 2025

marceloneppel commented Feb 27, 2025 • edited Loading

alexdlukens commented Feb 27, 2025 • edited Loading

hloeung commented Mar 3, 2025

hloeung commented Mar 3, 2025

taurus-forever commented Mar 21, 2025

cmisare commented Mar 21, 2025

taurus-forever commented Mar 21, 2025

ethanmye-rs commented Mar 21, 2025

taurus-forever commented Mar 26, 2025

ethanmye-rs commented Mar 27, 2025

ethanmye-rs commented Nov 12, 2024 •

edited

Loading

ethanmye-rs commented Nov 12, 2024 •

edited

Loading

marceloneppel commented Feb 27, 2025 •

edited

Loading

alexdlukens commented Feb 27, 2025 •

edited

Loading