Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PS-9453: percona_telemetry causes a long wait on COND_thd_list due to the absence of the root user #5452

Merged
merged 2 commits into from
Oct 16, 2024

Conversation

kamil-holubicki
Copy link
Contributor

@kamil-holubicki kamil-holubicki commented Oct 11, 2024

https://perconadev.atlassian.net/browse/PS-9453

Problem:
If there is no 'root' user (it was renamed) and Percona Telemetry is
enabled, server shutdown stuck.
Additionally SHOW PROCESSLIST reports increasing number of processes
with root user in state 'login'.

Cause:
Percona Telemetry component used hard codded 'root' user used for
querying the server for metrics. If the user is not present,
mysql_command_factory service connect() method fails, however it leaves
opened/orphaned internal MYSQL_SESSION. It is because after opening
MYSQL_SESSION we check if the user exists. It doesn't exist, so the
method returns with STATE_MACHINE_FAILED error.
Caller does not know anything about underlying session, does cleanup,
frees memory. Moreover, the same problem potentially exists even if
the user exists, but cssm_begin_connect() exits with error for any
reason.
Creation of session registers THD object in Global_THD_manager. That's
why increasing number of processes is visible in SHOW PROCESSLIST.

Why it hangs during shutdown:
During shutdown, the server waits for signal handler thread
(signal_hand()) to join the main thread. Then the component
infrastructure is deinitialized. At the same time signal_hand() waits
for all threads to finish Global_THD_manager::wait_till_no_thd().

The above described bug related to orphaned mysql sessions cause that
there are orphaned THDs that never end. This causes server to wait
infinitively.

Solution:

  1. Handle the error in cssm_begin_connect() and close opened
    MYSQL_SESSION. This part fixes hangs caused by nonexistent user.
  2. Percona Telemetry Component uses mysql.session user. When Percona
    Telemetry Component is enabled, the user is granted a few more
    permissions during the server startup. Note that even without this
    permissions, the component is able to work, but not able to report some
    metrics.

Additionally fixed potential memory leak if connection setup in Percona
Telemetry Component fails.

Copy link
Collaborator

@percona-ysorokin percona-ysorokin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

file permissions and typos need to be fixed, everything else - just suggestions for improvements.

mysql-test/r/information_schema_ci.result Show resolved Hide resolved
cleaned up immediately by the caller. The caller does not take care of
session_svc, because it is not aware of this structure.
*/
std::shared_ptr<MYSQL_SESSION> mysql_session_close_guard(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was struggling with the same double indirection problem some time ago.
MYSQL_SESSION is by itself a pointer to an opaque structure. So, I did this trick with
std::remove_pointer_t<>;
https://github.com/percona/percona-server/blob/8.0/components/masking_functions/src/masking_functions/sql_context.cpp#L105
In my opinion this make the code more readable.
The only problem I see if you go my way, you won't be able to do mysql_session = nullptr. You will need
to call something like mysql_session_close_guard.release() but for this, in turns you will need std::unique_ptr<> with more boilerplate instead of std::shared_ptr<>.
In any case this is just a butification comment.

&mysql_h, [&srv = command_factory_service_](MYSQL_H *ptr) {
srv.close(*ptr);
});

mysql_service_status_t sstatus = command_factory_service_.init(&mysql_h);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another improvement I discovered recently, which may help simplify things a bit.
Oracle recently introduced a new MYSQL_COMMAND_LOCAL_THD_HANDLE option for mysql_command_service.
Basically, then you set this option with nullptr value, it will create an internal THD object, initialize its session vars, calls my_thread_init() / my_thread_end(), sets current_thd, etc. This THD object will be destroyed when you call mysql_command_service::close().
percona-ysorokin@c9a98d3#diff-3706ffbbf71661ebd0414b16d27fd6a1cc8f6960488d43d5ad51d9b15fe9ac00R51

… the absence of the root user

https://perconadev.atlassian.net/browse/PS-9453

Problem:
If there is no 'root' user (it was renamed) and Percona Telemetry is
enabled, server shutdown stuck.
Additionally SHOW PROCESSLIST reports increasing number of processes
with root user in state 'login'.

Cause:
Percona Telemetry component used hard codded 'root' user used for
querying the server for metrics. If the user is not present,
mysql_command_factory service connect() method fails, however it leaves
opened/orphaned internal MYSQL_SESSION. It is because after opening
MYSQL_SESSION we check if the user exists. It doesn't exist, so the
method returns with STATE_MACHINE_FAILED error.
Caller does not know anything about underlying session, does cleanup,
frees memory. Moreover, the same problem potentially exists even if
the user exists, but cssm_begin_connect() exits with error for any
reason.
Creation of session registers THD object in Global_THD_manager. That's
why increasing number of processes is visible in SHOW PROCESSLIST.

Why it hangs during shutdown:
During shutdown, the server waits for signal handler thread
(signal_hand()) to join the main thread. Then the component
infrastructure is deinitialized. At the same time signal_hand() waits
for all threads to finish Global_THD_manager::wait_till_no_thd().

The above described bug related to orphaned mysql sessions cause that
there are orphaned THDs that never end. This causes server to wait
infinitively.

Solution:
1. Handle the error in cssm_begin_connect() and close opened
MYSQL_SESSION. This part fixes hangs caused by nonexistent user.
2. Percona Telemetry Component uses mysql.session user. When Percona
Telemetry Component is enabled, the user is granted a few more
permissions during the server startup. Note that even without this
permissions, the component is able to work, but not able to report some
metrics.

Additionally fixed potential memory leak if connection setup in Percona
Telemetry Component fails.
@kamil-holubicki kamil-holubicki merged commit 60a65b3 into percona:8.0 Oct 16, 2024
25 of 29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants