-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2/3 PG units stuck in waiting/idle state, not moving to active/idle #668
Comments
Thank you for reporting us your feedback! The internal ticket has been created: https://warthogs.atlassian.net/browse/DPE-5931.
|
Hi, @ethanmye-rs, it seems like Patroni's health endpoint is failing, likely due to the |
Pulled the logs for the two failed units, located in google drive here. Not sure if there is anything sensitive in them (they are about ~200MB total, uncompressed) so I've made them Canonical internal. I will also pull the logs for the active/idle unit, but they are much larger, about 2.4G uncompressed. |
Thanks @marceloneppel very much for the help. Core issue was missing pg_wal data, which prevented the other two units from moving past starting state. A few misc things: The core reason the replica machines could not start was missing pg_wal data. The data is missing because at an earlier point, postgres actually exhausted the 64GB vm disk and I was forced to restart the machine. This is probably why the replicas are missing pg_wal data. I believe pg clears the pg_wal data on reboot. However, the main database is not very large, maybe 6-7GB based on the data in /var/snap/charmed-postgresql/common/var/lib/postgresql/, so it is suprising that pg_wal is so large. It would be nice to set a charm limit on pg_wal to avoid getting into this issue. Based on the logs, it seems like if upgrading from 429 -> 468, you will still see superfluous log entries for the charmed-postgresql.pgbackrest-service and similar failing to start. It would also be nice to surface as a warning (either in juju or COS) if one of the patroni members is in anything but a streaming or running state. Having machines stuck in starting surfaced no errors. For future reference, this is how we checked the patroni status:
and for reiniting the followers, to be run on one of the follower machines:
You can query the state by looking at either the cluster/ endpoint or catting |
My 2 cents as I was hit by this bug as well, using this version of the charm
I confirm that doing a reinit of the member fixed the "issue". I still have a different role for my 2 standby units thoug:
|
A user also experienced this issue on revision 545, in the In the patroni logs, I see
Before the first occurance of this line in the logs, there was additional context Jan 09 13:54:45 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo charmed-postgresql.patroni[980280]: pg_basebackup: error: could not get COPY data stream: ERROR: the standby was promoted during online backup
Jan 09 13:54:45 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo charmed-postgresql.patroni[980280]: HINT: This means that the backup being taken is corrupt and should not be used. Try taking another online backup.
Jan 09 13:54:45 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo charmed-postgresql.patroni[980280]: pg_basebackup: removing data directory "/var/snap/charmed-postgresql/common/var/lib/postgresql"
Jan 09 13:55:15 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: Stopping Service for snap application charmed-postgresql.patroni...
Jan 09 13:55:15 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: snap.charmed-postgresql.patroni.service: Deactivated successfully.
Jan 09 13:55:15 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: Stopped Service for snap application charmed-postgresql.patroni.
Jan 09 13:55:15 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: snap.charmed-postgresql.patroni.service: Consumed 9.929s CPU time.
Jan 09 13:55:15 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: Started Service for snap application charmed-postgresql.patroni.
Jan 09 13:59:20 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: Stopping Service for snap application charmed-postgresql.patroni...
Jan 09 13:59:50 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: snap.charmed-postgresql.patroni.service: State 'final-sigterm' timed out. Killing.
Jan 09 13:59:50 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: snap.charmed-postgresql.patroni.service: Killing process 981390 (python3) with signal SIGKILL.
Jan 09 13:59:50 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: snap.charmed-postgresql.patroni.service: Killing process 984912 (pg_basebackup) with signal SIGKILL.
Jan 09 13:59:50 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: snap.charmed-postgresql.patroni.service: Killing process 984915 (pg_basebackup) with signal SIGKILL.
Jan 09 13:59:50 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: snap.charmed-postgresql.patroni.service: Failed with result 'timeout'.
Jan 09 13:59:50 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: snap.charmed-postgresql.patroni.service: Unit process 981390 (python3) remains running after unit stopped.
Jan 09 13:59:50 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: snap.charmed-postgresql.patroni.service: Unit process 984912 (pg_basebackup) remains running after unit stopped.
Jan 09 13:59:50 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: snap.charmed-postgresql.patroni.service: Unit process 984915 (pg_basebackup) remains running after unit stopped.
Jan 09 13:59:50 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: Stopped Service for snap application charmed-postgresql.patroni.
Jan 09 13:59:50 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: snap.charmed-postgresql.patroni.service: Consumed 1min 11.395s CPU time.
Jan 09 13:59:50 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo systemd[1]: Started Service for snap application charmed-postgresql.patroni.
Jan 09 13:59:55 juju-f46785-is-managed-database-prod-marketing-airbyte-marketo charmed-postgresql.patroni[985096]: pg_controldata: fatal: could not open file "/var/snap/charmed-postgresql/common/var/lib/postgresql/global/pg_control" for reading: No such file or directory so pg_basebackup removed the directory, was killed, and then when patroni restarted could not find the directory? This error has been persistent in the logs since Jan 09, when this happened is-managed-database-prod-marketing-airbyte-marketo@is-bastion-ps6:~$ cat patroni.log | grep "could not open file" | wc -l
268331
|
Hi, @alexdlukens! Thanks for the logs. I did some investigation and reproduced the issue locally but I'd like to check it on your environment, to see how it's happening there. The issue is a little bit different from the one from the first comment from this GH issue. I'll be on PTO till March, 14th. Can I contact you on March 17th onwards to check the issue with you? Thanks. |
Hello, @marceloneppel I think this was worked-around during our IS - Data platform sync meeting. I will share what was done here: sudo systemctl stop jujud-machine-<X>.service
sudo -H -u snap_daemon charmed-postgresql.patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml reinit postgresql postgresql-4
sudo systemctl start jujud-machine-<X>.service Replacing postgresql-4 with the respective unit name. Please do reach out when you are back. We have several deployments, I am sure we can find an additional one with a similar issue |
Confirmed #668 (comment) fixed it for me: Was:
Then ran:
Now:
|
I went around repairing / fixing the stuck units on a few environments/models. But on two, it is failing with
Any ideas what to do for these cases? |
Dear @hloeung and @alexdlukens , can you please contact @marceloneppel in MM/Matrix to check your issue. |
@taurus-forever we met with @marceloneppel for troubleshooting the 2 environments mentioned in #668 (comment). In both cases there was a mismatch between the charm revision (553) and the installed charmed-postgresql snap revision (120, should have been 143 per https://github.com/canonical/postgresql-operator/blob/rev553/src/constants.py#L37) that was breaking patroni from starting:
We believe this may have been due to the way we refreshed the charm revision with terraform, but need to investigate further. We fixed the unhappy state in this case by running the following on all units
We then started patroni manually on the most-recently primary unit first, followed by starting it on the remaining units:
One of the environments looks to be back to full health. The other environment is now seeing a behavior like in #784 that needs to be investigated further there. |
Thank you @cmisare ! JFYI, we are in process of migration on new and heavily refactored upgrade library 'refresh v3'.
Regarding this ticket, @ethanmye-rs is it still reproducible on your side with latest 14/stable. If so, let's check it together! Thank you! |
I am sorry, but I not longer have access to the environment. @marceloneppel was very helpful in getting the streaming started again, steps are documented in this comment. I would still support having a limit on wal archive size (so it cannot eat up a whole disk and bring down a machine) and surfacing this in the charm or in COS as an alert/issue. IMO, this can be closed. |
@ethanmye-rs thank you! The Did I read you comment correct: we do not have place to reproduce the issue. Should the ticket be resolved? |
Sure, once it is added on both I think it will keep issues like this from coming up in the future. The original environment is no longer available, but you can reproduce the issue easily enough by letting the WAL grow, eg by knocking patroni cluster members out of streaming/running status. I have marked this issue as closed. Thanks for working on this and adding the option! |
Steps to reproduce
a. I do not have a firm reproducer, but I ran into this issue upgrading from rev 429 to rev 468 in a charmed landscape deployment. I originally encountered the issue in rev 429, and based on a prior bug, expected refreshing to 468 would fix the issue. However, I still see my pg units not starting, in a "awaiting for member to start" state.
b. I did not encounter this issue on another cluster in an identical environment, so it seems somewhat random. The machines in the juju model are manual machines in Azure.
Expected behavior
I expect the other 2 units to start and enter a active/idle state. They have been in this state for >48 hours.
Actual behavior
see logs below, but the machines cycle through waiting/executing states, but never enter active/idle as expected.
Versions
Operating system: 22.04.4
Juju CLI: 3.5.4
Juju agent: 3.5.4
Charm revision: 468
LXD: n/a
Log output
juju debug log: https://paste.ubuntu.com/p/FzXnjMpNYz/
snap logs from one unit failing to start: https://paste.ubuntu.com/p/St8WZNn4GT/ (restart at the end of the log file)
snap logs from other unit failing to start: https://paste.ubuntu.com/p/BH3RXfZrTW/
snap logs from healthy unit: https://paste.ubuntu.com/p/b6bgSVZKYm/
pg snap services config: https://paste.ubuntu.com/p/xJJq6ktXm9/
Happy to provide more logs, details or access to the environment. Thanks.
The text was updated successfully, but these errors were encountered: