Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2/3 PG units stuck in waiting/idle state, not moving to active/idle #668

Closed
ethanmye-rs opened this issue Nov 11, 2024 · 16 comments
Closed
Labels
bug Something isn't working as expected

Comments

@ethanmye-rs
Copy link
Member

Steps to reproduce

a. I do not have a firm reproducer, but I ran into this issue upgrading from rev 429 to rev 468 in a charmed landscape deployment. I originally encountered the issue in rev 429, and based on a prior bug, expected refreshing to 468 would fix the issue. However, I still see my pg units not starting, in a "awaiting for member to start" state.
b. I did not encounter this issue on another cluster in an identical environment, so it seems somewhat random. The machines in the juju model are manual machines in Azure.

  1. Essentially, 2/3 postgres units stay stuck in a "awaiting for member to start" state. They cycle through different waiting and executing states, but the PG units never actually start.

Expected behavior

I expect the other 2 units to start and enter a active/idle state. They have been in this state for >48 hours.

Actual behavior

image

see logs below, but the machines cycle through waiting/executing states, but never enter active/idle as expected.

Versions

Operating system: 22.04.4

Juju CLI: 3.5.4

Juju agent: 3.5.4

Charm revision: 468

LXD: n/a

Log output

juju debug log: https://paste.ubuntu.com/p/FzXnjMpNYz/
snap logs from one unit failing to start: https://paste.ubuntu.com/p/St8WZNn4GT/ (restart at the end of the log file)
snap logs from other unit failing to start: https://paste.ubuntu.com/p/BH3RXfZrTW/
snap logs from healthy unit: https://paste.ubuntu.com/p/b6bgSVZKYm/
pg snap services config: https://paste.ubuntu.com/p/xJJq6ktXm9/

Happy to provide more logs, details or access to the environment. Thanks.

@ethanmye-rs ethanmye-rs added the bug Something isn't working as expected label Nov 11, 2024
Copy link

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/DPE-5931.

This message was autogenerated

@dragomirp
Copy link
Contributor

Hi, @ethanmye-rs, it seems like Patroni's health endpoint is failing, likely due to the rejecting connections in the snap logs. This seems to be caused by Patroni calling pg_isready failing with exit code 1. Can you check if you have any Postgresql logs on the failing units from around the time they failed? They should be located in /var/snap/charmed-postgresql/common/var/log/postgresql/. There may be more details in the Patroni logs located in /var/snap/charmed-postgresql/common/var/log/patroni/

@ethanmye-rs
Copy link
Member Author

ethanmye-rs commented Nov 12, 2024

Pulled the logs for the two failed units, located in google drive here. Not sure if there is anything sensitive in them (they are about ~200MB total, uncompressed) so I've made them Canonical internal. I will also pull the logs for the active/idle unit, but they are much larger, about 2.4G uncompressed.

@ethanmye-rs
Copy link
Member Author

ethanmye-rs commented Nov 12, 2024

Thanks @marceloneppel very much for the help. Core issue was missing pg_wal data, which prevented the other two units from moving past starting state. A few misc things:

The core reason the replica machines could not start was missing pg_wal data. The data is missing because at an earlier point, postgres actually exhausted the 64GB vm disk and I was forced to restart the machine. This is probably why the replicas are missing pg_wal data. I believe pg clears the pg_wal data on reboot. However, the main database is not very large, maybe 6-7GB based on the data in /var/snap/charmed-postgresql/common/var/lib/postgresql/, so it is suprising that pg_wal is so large. It would be nice to set a charm limit on pg_wal to avoid getting into this issue.

Based on the logs, it seems like if upgrading from 429 -> 468, you will still see superfluous log entries for the charmed-postgresql.pgbackrest-service and similar failing to start.

It would also be nice to surface as a warning (either in juju or COS) if one of the patroni members is in anything but a streaming or running state. Having machines stuck in starting surfaced no errors.

For future reference, this is how we checked the patroni status:

curl <current active pg unit ip>:8008/cluster

and for reiniting the followers, to be run on one of the follower machines:

sudo -H -u snap_daemon charmed-postgresql.patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml reinit postgresql

You can query the state by looking at either the cluster/ endpoint or catting /var/snap/charmed-postgresql/common/var/log/patroni/patroni.log

@ben-ballot
Copy link

My 2 cents as I was hit by this bug as well, using this version of the charm

postgresql                        14.9     active      3  postgresql        14/candidate   331  yes

I confirm that doing a reinit of the member fixed the "issue".

I still have a different role for my 2 standby units thoug:

$ sudo -H -u snap_daemon charmed-postgresql.patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml list
+ Cluster: postgresql ---------+--------------+---------+-----+-----------+-----------------+
| Member       | Host          | Role         | State   |  TL | Lag in MB | Pending restart |
+--------------+---------------+--------------+---------+-----+-----------+-----------------+
| postgresql-3 | 10.141.31.28  | Leader       | running | 184 |           | *               |
| postgresql-4 | 10.141.31.216 | Replica      | running | 184 |         0 | *               |
| postgresql-5 | 10.141.31.15  | Sync Standby | running | 184 |         0 | *               |
+--------------+---------------+--------------+---------+-----+-----------+-----------------+

@alexdlukens
Copy link

A user also experienced this issue on revision 545, in the is-managed-database-prod-marketing-airbyte-marketo environment, although the above command to re-init the replica units does not seem to work.

In the patroni logs, I see

charmed-postgresql.patroni[180289]: pg_controldata: fatal: could not open file "/var/snap/charmed-postgresql/common/var/lib/postgresql/global/pg_control" for reading: No such file or directory

Before the first occurance of this line in the logs, there was additional context

Jan 09 13:54:45 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo charmed-postgresql.patroni[980280]: pg_basebackup: error: could not get COPY data stream: ERROR:  the standby was promoted during online backup
Jan 09 13:54:45 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo charmed-postgresql.patroni[980280]: HINT:  This means that the backup being taken is corrupt and should not be used. Try taking another online backup.
Jan 09 13:54:45 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo charmed-postgresql.patroni[980280]: pg_basebackup: removing data directory "/var/snap/charmed-postgresql/common/var/lib/postgresql"
Jan 09 13:55:15 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: Stopping Service for snap application charmed-postgresql.patroni...
Jan 09 13:55:15 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: snap.charmed-postgresql.patroni.service: Deactivated successfully.
Jan 09 13:55:15 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: Stopped Service for snap application charmed-postgresql.patroni.
Jan 09 13:55:15 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: snap.charmed-postgresql.patroni.service: Consumed 9.929s CPU time.
Jan 09 13:55:15 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: Started Service for snap application charmed-postgresql.patroni.
Jan 09 13:59:20 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: Stopping Service for snap application charmed-postgresql.patroni...
Jan 09 13:59:50 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: snap.charmed-postgresql.patroni.service: State 'final-sigterm' timed out. Killing.
Jan 09 13:59:50 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: snap.charmed-postgresql.patroni.service: Killing process 981390 (python3) with signal SIGKILL.
Jan 09 13:59:50 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: snap.charmed-postgresql.patroni.service: Killing process 984912 (pg_basebackup) with signal SIGKILL.
Jan 09 13:59:50 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: snap.charmed-postgresql.patroni.service: Killing process 984915 (pg_basebackup) with signal SIGKILL.
Jan 09 13:59:50 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: snap.charmed-postgresql.patroni.service: Failed with result 'timeout'.
Jan 09 13:59:50 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: snap.charmed-postgresql.patroni.service: Unit process 981390 (python3) remains running after unit stopped.
Jan 09 13:59:50 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: snap.charmed-postgresql.patroni.service: Unit process 984912 (pg_basebackup) remains running after unit stopped.
Jan 09 13:59:50 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: snap.charmed-postgresql.patroni.service: Unit process 984915 (pg_basebackup) remains running after unit stopped.
Jan 09 13:59:50 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: Stopped Service for snap application charmed-postgresql.patroni.
Jan 09 13:59:50 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: snap.charmed-postgresql.patroni.service: Consumed 1min 11.395s CPU time.
Jan 09 13:59:50 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: Started Service for snap application charmed-postgresql.patroni.
Jan 09 13:59:55 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo charmed-postgresql.patroni[985096]: pg_controldata: fatal: could not open file "/var/snap/charmed-postgresql/common/var/lib/postgresql/global/pg_control" for reading: No such file or directory

so pg_basebackup removed the directory, was killed, and then when patroni restarted could not find the directory?

This error has been persistent in the logs since Jan 09, when this happened

is-managed-database-prod-marketing-airbyte-marketo@is-bastion-ps6:~$ cat patroni.log | grep "could not open file" | wc -l
268331

@marceloneppel
Copy link
Member

marceloneppel commented Feb 27, 2025

Hi, @alexdlukens! Thanks for the logs.

I did some investigation and reproduced the issue locally but I'd like to check it on your environment, to see how it's happening there. The issue is a little bit different from the one from the first comment from this GH issue.

I'll be on PTO till March, 14th. Can I contact you on March 17th onwards to check the issue with you? Thanks.

@alexdlukens
Copy link

alexdlukens commented Feb 27, 2025

Hello, @marceloneppel

I think this was worked-around during our IS - Data platform sync meeting. I will share what was done here:

sudo systemctl stop jujud-machine-<X>.service
sudo -H -u snap_daemon charmed-postgresql.patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml reinit postgresql postgresql-4
sudo systemctl start jujud-machine-<X>.service

Replacing postgresql-4 with the respective unit name.

Please do reach out when you are back. We have several deployments, I am sure we can find an additional one with a similar issue

@hloeung
Copy link
Contributor

hloeung commented Mar 3, 2025

Confirmed #668 (comment) fixed it for me:

Was:

postgresql/0              waiting   idle   3        10.146.64.196   5432/tcp  awaiting for member to start
  landscape-client/3      active    idle            10.146.64.196             Client registered!
  telegraf/3              active    idle            10.146.64.196   9103/tcp  Monitoring postgresql/0 (source version/commit 23.10)
  ubuntu-advantage/3      active    idle            10.146.64.196             Attached (esm-apps,esm-infra,livepatch)

Then ran:

ubuntu@juju-69c4ae-dbaas-stg-data-mesh-el-netsuite-3:~$ sudo systemctl stop jujud-machine-3.service
ubuntu@juju-69c4ae-dbaas-stg-data-mesh-el-netsuite-3:~$ sudo -H -u snap_daemon charmed-postgresql.patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml reinit postgresql postgresql-0
+ Cluster: postgresql (7468654271286863093) --+-----------+----+-----------+
| Member       | Host          | Role         | State     | TL | Lag in MB |
+--------------+---------------+--------------+-----------+----+-----------+
| postgresql-0 | 10.146.64.196 | Replica      | stopped   |    |   unknown |
| postgresql-1 | 10.146.64.206 | Sync Standby | streaming |  6 |         0 |
| postgresql-2 | 10.146.64.203 | Leader       | running   |  6 |           |
+--------------+---------------+--------------+-----------+----+-----------+
Are you sure you want to reinitialize members postgresql-0? [y/N]: y
Success: reinitialize for member postgresql-0
ubuntu@juju-69c4ae-dbaas-stg-data-mesh-el-netsuite-3:~$ sudo systemctl start jujud-machine-3.service

Now:

postgresql/0              active    idle   3        10.146.64.196   5432/tcp
  landscape-client/3      active    idle            10.146.64.196             Client registered!
  telegraf/3              active    idle            10.146.64.196   9103/tcp  Monitoring postgresql/0 (source version/commit 23.10)
  ubuntu-advantage/3      active    idle            10.146.64.196             Attached (esm-apps,esm-infra,livepatch)

@hloeung
Copy link
Contributor

hloeung commented Mar 3, 2025

I went around repairing / fixing the stuck units on a few environments/models. But on two, it is failing with start failed:

ubuntu@juju-30681d-is-managed-database-prod-support-airbyte-jira-3:~$ sudo -H -u snap_daemon charmed-postgresql.patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml list
+ Cluster: postgresql (7446670001525216935) -+--------------+----+-----------+-----------------+
| Member       | Host         | Role         | State        | TL | Lag in MB | Pending restart |
+--------------+--------------+--------------+--------------+----+-----------+-----------------+
| postgresql-0 | 10.146.64.7  | Replica      | start failed |    |   unknown |                 |
| postgresql-1 | 10.146.64.11 | Leader       | running      | 56 |           | *               |
| postgresql-2 | 10.146.64.5  | Sync Standby | streaming    | 56 |         0 | *               |
+--------------+--------------+--------------+--------------+----+-----------+-----------------+
ubuntu@juju-f46785-is-managed-database-prod-marketing-airbyte-marketo:~$ sudo -H -u snap_daemon charmed-postgresql.patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml list
+ Cluster: postgresql (7443567519097256572) -+--------------+-----+-----------+-----------------+
| Member       | Host         | Role         | State        |  TL | Lag in MB | Pending restart |
+--------------+--------------+--------------+--------------+-----+-----------+-----------------+
| postgresql-0 | 10.146.64.39 | Sync Standby | streaming    | 127 |         0 | *               |
| postgresql-4 | 10.146.64.36 | Leader       | running      | 127 |           | *               |
| postgresql-5 | 10.146.64.43 | Replica      | start failed |     |   unknown |                 |
+--------------+--------------+--------------+--------------+-----+-----------+-----------------+

Any ideas what to do for these cases?

@taurus-forever
Copy link
Contributor

Dear @hloeung and @alexdlukens , can you please contact @marceloneppel in MM/Matrix to check your issue.
We suspect several independent issues mixed in this bugreport due to common UI status awaiting for member to start.
Let's try to separate this cases starting from your subcases. Tnx!

@cmisare
Copy link

cmisare commented Mar 21, 2025

@taurus-forever we met with @marceloneppel for troubleshooting the 2 environments mentioned in #668 (comment). In both cases there was a mismatch between the charm revision (553) and the installed charmed-postgresql snap revision (120, should have been 143 per https://github.com/canonical/postgresql-operator/blob/rev553/src/constants.py#L37) that was breaking patroni from starting:

2025-03-19T18:33:50Z charmed-postgresql.patroni[4056206]: 10.146.64.36:5432 - no response
2025-03-19T18:33:50Z charmed-postgresql.patroni[4056205]: 2025-03-19 18:33:50 UTC [4056205]: user=,db=,app=,client=,line=1 FATAL:  could not access file "pgaudit": No such file or directory
2025-03-19T18:33:50Z charmed-postgresql.patroni[4056205]: 2025-03-19 18:33:50 UTC [4056205]: user=,db=,app=,client=,line=2 LOG:  database system is shut down

We believe this may have been due to the way we refreshed the charm revision with terraform, but need to investigate further.

We fixed the unhappy state in this case by running the following on all units

juju exec --application postgresql -- sudo snap refresh --unhold charmed-postgresql
juju exec --application postgresql -- sudo snap refresh --revision 143 charmed-postgresql
juju exec --application postgresql -- sudo snap refresh --hold charmed-postgresql

We then started patroni manually on the most-recently primary unit first, followed by starting it on the remaining units:

juju exec --unit postgresql/4 -- sudo snap start charmed-postgresql.patroni
# wait for the process to come up
juju exec --unit postgresql/0 -- sudo snap start charmed-postgresql.patroni
juju exec --unit postgresql/5 -- sudo snap start charmed-postgresql.patroni

One of the environments looks to be back to full health. The other environment is now seeing a behavior like in #784 that needs to be investigated further there.

@taurus-forever
Copy link
Contributor

Thank you @cmisare ! JFYI, we are in process of migration on new and heavily refactored upgrade library 'refresh v3'.

  1. Massive UI and UX implevemtents.
  2. It will NOT require manual pre-upgrade-check (will be executed internally IF missed by user).
  3. it will also check internal resources versions and inform users about mismatch between SNAP/OCI and charm.
    .. but this is a story for another ticket(s). Stay tuned!

Regarding this ticket, @ethanmye-rs is it still reproducible on your side with latest 14/stable. If so, let's check it together! Thank you!

@ethanmye-rs
Copy link
Member Author

Hi @taurus-forever

I am sorry, but I not longer have access to the environment. @marceloneppel was very helpful in getting the streaming started again, steps are documented in this comment.

I would still support having a limit on wal archive size (so it cannot eat up a whole disk and bring down a machine) and surfacing this in the charm or in COS as an alert/issue.

IMO, this can be closed.

@taurus-forever
Copy link
Contributor

@ethanmye-rs thank you! The wal_keep_size is already part of PostgreSQL VM and K8s. Is it something you were searching?

Did I read you comment correct: we do not have place to reproduce the issue. Should the ticket be resolved?

@ethanmye-rs
Copy link
Member Author

Sure, once it is added on both I think it will keep issues like this from coming up in the future.

The original environment is no longer available, but you can reproduce the issue easily enough by letting the WAL grow, eg by knocking patroni cluster members out of streaming/running status. I have marked this issue as closed.

Thanks for working on this and adding the option!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working as expected
Projects
None yet
Development

No branches or pull requests

8 participants