Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix crashes on exit caused by wlroots listener checks #8578

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

Ferdi265
Copy link
Contributor

@Ferdi265 Ferdi265 commented Feb 17, 2025

This PR fixes sway crashing on exit due to event listeners not being removed from wlroots objects on exit.

Listeners fixed:

  • server: all objects created during server_init(), reworked handle_renderer_lost() to use an idle callback
  • input/input-manager: wlr_backend.new_input, wlr_virtual_keyboard, wlr_virtual_pointer, wlr_keyboard_shortcuts_inhibit, wlr_transient_seat_manager
  • desktop/idle-inhibit: wlr_idle_inhibit_manager
  • input/text-input: wlr_text_input_manager and wlr_input_method_manager
  • tree/container: wlr_scene_buffer.output_enter and wlr_scene_buffer.output_leave

This doesn't fix all wlroots objects with listener checks, just those that lead to immediate crashes on exit. I'll wait for a review of my general method of fixing this before continuing.

@Ferdi265
Copy link
Contributor Author

Ferdi265 commented Feb 18, 2025

On second thought, it might be more useful to remove the bulk of the listeners in server_fini() before calling destroy, which avoids adding dozens of destroy listeners.

Destroy listeners are still needed for a bunch of other components that aren't directly created in server.c, but that reduces the clutter by a lot.

@kennylevinsen
Copy link
Member

On second thought, it might be more useful to remove the bulk of the listeners in server_fini() before calling destroy, which avoids adding dozens of destroy listeners.

If we could have a 1:1 mapping between what is set up in server_init and what is torn down in server_fini, that might be fine - but it is probably still easier to understand if all structs with listeners must have an appropriate destroy handler to remove them.

One thing is what is easiest to implement, another is what is easiest to maintain and catch errors within in the future.

@Ferdi265 Ferdi265 force-pushed the fix-wlr-listener-checks branch from 8ea9677 to 307ce0c Compare February 18, 2025 21:30
@Ferdi265
Copy link
Contributor Author

Ferdi265 commented Feb 18, 2025

On second thought, it might be more useful to remove the bulk of the listeners in server_fini() before calling destroy, which avoids adding dozens of destroy listeners.

If we could have a 1:1 mapping between what is set up in server_init and what is torn down in server_fini, that might be fine - but it is probably still easier to understand if all structs with listeners must have an appropriate destroy handler to remove them.

One thing is what is easiest to implement, another is what is easiest to maintain and catch errors within in the future.

IMO components that have _init() and _finish() functions that create protocol objects that can't go away on their own should probably avoid extra destroy handlers and just remove the listeners in finish. Components where objects are created in reference to an outside lifetime (e.g. in response to a seat being created) should probably create destroy listeners to remove those. input-manager has a different problem where it can't remove its own listeners in destroy handlers due to it not having access to the main server struct, needing a new _finish() method that is called from server_fini().

This is sometimes hard to distinguish though; text_input.c for example uses global objects initialized in server_init(), but is created in response to seat objects which are destroyed after the wlroots ones. This lifetime mismatch makes this hard to deal with, since it needs to be cleaned up both in response to a seat going away and in response to the wlroots objects being destroyed, with both happening at exit (which would cause a crash due to double wl_list_remove()).

I'm currently handling these by making the _finish() function robust against being called twice, but this doesn't feel optimal.

I think I'll try to get this PR far enough to exit sway without crashes with this method, keeping as much as possible in separate commits for reviewability, and we'll sort out what the best path forward is during review.

@Ferdi265 Ferdi265 force-pushed the fix-wlr-listener-checks branch 2 times, most recently from adfe709 to 8ee1fc7 Compare February 18, 2025 22:25
@Ferdi265
Copy link
Contributor Author

The current state makes sway exit correctly from a desktop containing a few konsole windows and a firefox window.

I'm not completely sure about the correct handling of all of these cases, and I'm sure there are a lot more objects with listeners that need to be removed before destroy, but those don't trigger on a regular exit in my setup. I've already started writing a script to search for all objects in wlroots that have listener checks and their corresponding usages in sway, but that might take a while.

In the meantime, I think it's better to mark this as reviewable now and see where we go afterwards.

@Ferdi265 Ferdi265 marked this pull request as ready for review February 18, 2025 22:28
@Ferdi265 Ferdi265 changed the title WIP: Fix crashes on exit caused by wlroots listener checks Fix crashes on exit caused by wlroots listener checks Feb 18, 2025
@Ferdi265 Ferdi265 force-pushed the fix-wlr-listener-checks branch from 8ee1fc7 to 6df3682 Compare February 18, 2025 22:31
@Ferdi265
Copy link
Contributor Author

Ferdi265 commented Feb 18, 2025

If we could have a 1:1 mapping between what is set up in server_init and what is torn down in server_fini, that might be fine

Addendum to this: this works fine for server_init, (and saves a ton of destroy handlers) since all those objects are created/destroyed only at init and fini, but this doesn't work for some of the more dynamic objects or listeners created elsewhere in sway.

The current implementation has this 1:1 mapping for objects created in server_init() in the first commit.

@Ferdi265
Copy link
Contributor Author

Diff to state before last force-push:

diff --git a/sway/server.c b/sway/server.c
index 683ce169..79c8f542 100644
--- a/sway/server.c
+++ b/sway/server.c
@@ -458,6 +458,7 @@ bool server_init(struct sway_server *server) {
 
 void server_fini(struct sway_server *server) {
        // remove listeners
+       wl_list_remove(&server->renderer_lost.link);
        wl_list_remove(&server->new_output.link);
        wl_list_remove(&server->layer_shell_surface.link);
        wl_list_remove(&server->xdg_shell_toplevel.link);
@@ -474,14 +475,12 @@ void server_fini(struct sway_server *server) {
        wl_list_remove(&server->xdg_activation_v1_request_activate.link);
        wl_list_remove(&server->xdg_activation_v1_new_token.link);
        wl_list_remove(&server->request_set_cursor_shape.link);
-#if WLR_HAS_XWAYLAND
-       wl_list_remove(&server->xwayland_surface.link);
-       wl_list_remove(&server->xwayland_ready.link);
-#endif
        input_manager_finish(server);
 
        // TODO: free sway-specific resources
 #if WLR_HAS_XWAYLAND
+       wl_list_remove(&server->xwayland_surface.link);
+       wl_list_remove(&server->xwayland_ready.link);
        wlr_xwayland_destroy(server->xwayland.wlr_xwayland);
 #endif
        wl_display_destroy_clients(server->wl_display);

@Ferdi265 Ferdi265 force-pushed the fix-wlr-listener-checks branch from ca19d20 to 8932123 Compare March 1, 2025 22:49
@Ferdi265
Copy link
Contributor Author

Ferdi265 commented Mar 10, 2025

I found an interesting crash related to the renderer_lost listener when testing this PR today that seems non-trivial to fix:

The error occurs while sway is trying to destroy the old renderer after a GPU reset and also exists before this PR.

Call Trace: wlroots begin_gles2_buffer_pass() -> libwayland wl_signal_emit_mutable() -> sway handle_renderer_lost() -> wlroots wlr_renderer_destroy()

Detailed Stack Trace

wlroots begin_gles2_buffer_pass()

struct wlr_gles2_render_pass *begin_gles2_buffer_pass(...) {
	...
	if (...) {
                ...
		if (...) {
			wlr_log(WLR_ERROR, "GPU reset (%s)", reset_status_str(status));
			wl_signal_emit_mutable(&renderer->wlr_renderer.events.lost, NULL); // <--- HERE
			return NULL;
		}
	}
        ...
}

libwayland wl_signal_emit_mutable()

WL_EXPORT void
wl_signal_emit_mutable(...)
{
	struct wl_listener cursor;
	struct wl_listener end;

	/* Add two special markers: one cursor and one end marker. This way, we
	 * know that we've already called listeners on the left of the cursor
	 * and that we don't want to call listeners on the right of the end
	 * marker. The 'it' function can remove any element it wants from the
	 * list without troubles.
	 *
	 * There was a previous attempt that used to steal the whole list of
	 * listeners but then that broke wl_signal_get().
	 *
	 * wl_list_for_each_safe tries to be safe but it fails: it works fine
	 * if the current item is removed, but not if the next one is. */
	wl_list_insert(&signal->listener_list, &cursor.link); // <-- additional listeners outside of sway's control
	cursor.notify = handle_noop;
	wl_list_insert(signal->listener_list.prev, &end.link); // <-- additional listeners outside of sway's control
	end.notify = handle_noop;

	while (cursor.link.next != &end.link) {
		struct wl_list *pos = cursor.link.next;
		struct wl_listener *l = wl_container_of(pos, l, link);

		wl_list_remove(&cursor.link);
		wl_list_insert(pos, &cursor.link);

		l->notify(l, data); // <--- HERE
	}
        ...
}

sway handle_renderer_lost()

static void handle_renderer_lost(...) {
	...
	struct wlr_renderer *old_renderer = server->renderer;
        ...
	wlr_renderer_destroy(old_renderer); // <--- HERE
}

wlroots wlr_renderer_destroy()

void wlr_renderer_destroy(struct wlr_renderer *r) {
	if (!r) {
		return;
	}

	wl_signal_emit_mutable(&r->events.destroy, r);

	assert(wl_list_empty(&r->events.destroy.listener_list));
	assert(wl_list_empty(&r->events.lost.listener_list)); // <-- ASSERTION FAILS HERE

	if (r->impl && r->impl->destroy) {
		r->impl->destroy(r);
	} else {
		free(r);
	}
}

Basically, when the old wlr_renderer is destroyed in response to the renderer_lost event, there will always still be (fake) listeners added by wl_signal_emit_mutable() in the call stack below, triggering the assertion, regardless of what sway does. The wlr_renderer would also be free'd, which I'm not sure works out without any UAF in the return path of wl_signal_emit_mutable(), I still need to check that. which leads (when assertions are disabled) to a definite UAF in the return path of wl_signal_emit_mutable() when it attempts to remove the cursor from the list. Using non-mutable wl_signal_emit() in the GPU reset path instead would also not fix it since the wl_list_for_each_safe() would UAF while trying to read the next property of the already free'd listener_list. Generally (if I read this correctly), in a wl_listener callback it is not safe to destroy the object that contains the corresponding wl_signal.

The only way I see that this can be solved by sway without wlroots changes is that sway somehow schedules an event in the future where the old wlr_renderer can be destroyed without having a wl_signal_emit_mutable() on its own events in the call stack below.

EDIT: clarified UAF in the return path of wl_signal_emit_mutable() and added note on the safety of destroying objects in wl_listener callbacks.

@kennylevinsen
Copy link
Member

I found an interesting crash related to the renderer_lost listener when testing this PR today that seems non-trivial to fix:

We can fix this in sway if we can break the stack without causing issues elsewhere by either:

  1. Make the listener just queue an idle callback for the renderer lost handling.
  2. Make the listener queue an idle callback just for destroying the old renderer at the end.

The first one is cleaner and is hopefully fine, but I'm not sure if there will be a bunch of things failing in between the renderer being lost and the idle callback running.

@Ferdi265
Copy link
Contributor Author

Ferdi265 commented Mar 10, 2025

I found an interesting crash related to the renderer_lost listener when testing this PR today that seems non-trivial to fix:

We can fix this in sway if we can break the stack without causing issues elsewhere by either:

1. Make the listener just queue an idle callback for the renderer lost handling.

2. Make the listener queue an idle callback just for destroying the old renderer at the end.

The first one is cleaner and is hopefully fine, but I'm not sure if there will be a bunch of things failing in between the renderer being lost and the idle callback running.

These seem like sensible approaches. I'll look into it and try to see if the first approach works without complications. IMO this could be done in a separate PR, since it doesn't really touch the exit logic and is more than just removing listeners.

EDIT: implemented this as a separate commit in this PR

@Ferdi265 Ferdi265 force-pushed the fix-wlr-listener-checks branch from 8932123 to 9122400 Compare March 13, 2025 17:21
@Ferdi265 Ferdi265 force-pushed the fix-wlr-listener-checks branch from 9122400 to e53f77a Compare March 21, 2025 17:23
@Ferdi265
Copy link
Contributor Author

Ferdi265 commented Mar 21, 2025

  1. Make the listener just queue an idle callback for the renderer lost handling.

I just implemented this in the latest push. I haven't fully tested it yet since I don't know of a way to force a GPU reset or renderer loss. I'm still searching if there's a simple way to do this (for info: I'm using amdgpu on this laptop), otherwise I'll hack something in to make sway think a GPU reset happened to test this.

I've already tried /sys/kernel/debug/dri/<pci id>/amdgpu_gpu_recover, but that doesn't seem to trigger the gpu reset code path on my system.

@Ferdi265 Ferdi265 force-pushed the fix-wlr-listener-checks branch from 2126919 to 9f0e3c0 Compare March 25, 2025 19:51
@Ferdi265
Copy link
Contributor Author

Ferdi265 commented Mar 25, 2025

Rebased onto latest master. I've also tested the new delayed renderer recreation by faking the return value of glGetGraphicsResetStatusKHR() in wlroots. The new reset handling works and doesn't crash. It's also robust against triggering multiple times before the idle handler gets called.

@kennylevinsen / @llyyr: could you look at the new renderer lost handling and confirm that nothing looks out of the ordinary?

Otherwise I think this PR is good to go. I've been running this for weeks now and haven't had any more crashes on exit. There might be a few listeners still somewhere in sway, but for daily use this should be fine, and the rest can be handled when we come across them (IMO anything is better than the current state of master where sway just crashes on exit).

This fixes a crash in wlroots listener checks. See swaywm#8509.
This fixes a crash in wlroots listener checks. See swaywm#8509.
This fixes a crash in wlroots listener checks. See swaywm#8509.
sway_input_method_relay can be destroyed from two sources, either the
seat is destroyed or the manager protocol objects are destroyed due
compositor exit. Therefore, finish must check whether it has already
been called.

This fixes a crash in wlroots listener checks. See swaywm#8509.
Change begin_destroy to remove event listeners before the final destroy,
since otherwise event listeners would be removed twice, which crashes.

This fixes a crash in wlroots listener checks. See swaywm#8509.
Destroying the wlr_renderer in a callback to its own renderer_lost event
is unsafe due to wl_signal_emit*() still accessing it after it was
destroyed.

Delegate recreation of renderer to an idle callback and ensure that only
one such idle callback is scheduled at a time by storing the returned
event source.
@Ferdi265 Ferdi265 force-pushed the fix-wlr-listener-checks branch from 9f0e3c0 to 95117e8 Compare March 29, 2025 21:28
@Ferdi265
Copy link
Contributor Author

Rebased again onto latest master.

I also did a quick grep over the codebase and from a first look it seems like every remaining wl_signal_add() has at least one corresponding wl_signal_remove(), and at least for the ones I've looked at in more detail those seem to correctly handle exit (mostly via destroy listeners), but I haven't individually tested them. This seems to match with the fact that I haven't encountered any exit crashes these past weeks.

@llyyr
Copy link
Contributor

llyyr commented Mar 29, 2025

I've also tested this a bit and haven't seen any exit crashes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants