-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2/3 PG units stuck in waiting/idle state, not moving to active/idle #668
Comments
Thank you for reporting us your feedback! The internal ticket has been created: https://warthogs.atlassian.net/browse/DPE-5931.
|
Hi, @ethanmye-rs, it seems like Patroni's health endpoint is failing, likely due to the |
Pulled the logs for the two failed units, located in google drive here. Not sure if there is anything sensitive in them (they are about ~200MB total, uncompressed) so I've made them Canonical internal. I will also pull the logs for the active/idle unit, but they are much larger, about 2.4G uncompressed. |
Thanks @marceloneppel very much for the help. Core issue was missing pg_wal data, which prevented the other two units from moving past starting state. A few misc things: The core reason the replica machines could not start was missing pg_wal data. The data is missing because at an earlier point, postgres actually exhausted the 64GB vm disk and I was forced to restart the machine. This is probably why the replicas are missing pg_wal data. I believe pg clears the pg_wal data on reboot. However, the main database is not very large, maybe 6-7GB based on the data in /var/snap/charmed-postgresql/common/var/lib/postgresql/, so it is suprising that pg_wal is so large. It would be nice to set a charm limit on pg_wal to avoid getting into this issue. Based on the logs, it seems like if upgrading from 429 -> 468, you will still see superfluous log entries for the charmed-postgresql.pgbackrest-service and similar failing to start. It would also be nice to surface as a warning (either in juju or COS) if one of the patroni members is in anything but a streaming or running state. Having machines stuck in starting surfaced no errors. For future reference, this is how we checked the patroni status:
and for reiniting the followers, to be run on one of the follower machines:
You can query the state by looking at either the cluster/ endpoint or catting |
Steps to reproduce
a. I do not have a firm reproducer, but I ran into this issue upgrading from rev 429 to rev 468 in a charmed landscape deployment. I originally encountered the issue in rev 429, and based on a prior bug, expected refreshing to 468 would fix the issue. However, I still see my pg units not starting, in a "awaiting for member to start" state.
b. I did not encounter this issue on another cluster in an identical environment, so it seems somewhat random. The machines in the juju model are manual machines in Azure.
Expected behavior
I expect the other 2 units to start and enter a active/idle state. They have been in this state for >48 hours.
Actual behavior
see logs below, but the machines cycle through waiting/executing states, but never enter active/idle as expected.
Versions
Operating system: 22.04.4
Juju CLI: 3.5.4
Juju agent: 3.5.4
Charm revision: 468
LXD: n/a
Log output
juju debug log: https://paste.ubuntu.com/p/FzXnjMpNYz/
snap logs from one unit failing to start: https://paste.ubuntu.com/p/St8WZNn4GT/ (restart at the end of the log file)
snap logs from other unit failing to start: https://paste.ubuntu.com/p/BH3RXfZrTW/
snap logs from healthy unit: https://paste.ubuntu.com/p/b6bgSVZKYm/
pg snap services config: https://paste.ubuntu.com/p/xJJq6ktXm9/
Happy to provide more logs, details or access to the environment. Thanks.
The text was updated successfully, but these errors were encountered: