Fix bugs in async code #753

romkatv · 2023-07-26T11:35:06Z

This PR fixes two bugs in async code.

Insert - in echo -nE "$suggestion". This is necessary to prevent "$suggestion" from being treated as an option for echo.
Close file descriptors only in _zsh_autosuggest_async_response to ensure that each file descriptor is closed only once.

It's the second bug that prompted the fix. The original code in some cases could close the same file descriptor twice. The code relied on an invalid assumption that _zsh_autosuggest_async_response cannot fire after the file descriptor is closed. Here's a demo that shows this assumption being violated:

() {
  emulate -L zsh

  function callback1() {
    zle -I
    emulate -L zsh -o xtrace
    : "$@"
    zle -F $fd1
    exec {fd1}>&-
    zle -F $fd2
    exec {fd2}>&-
  }

  function callback2() {
    zle -I
    emulate -L zsh -o xtrace
    : "$@"
  }

  exec {fd1} </dev/null
  exec {fd2} </dev/null
  zle -F $fd1 callback1
  zle -F $fd2 callback2
}

And here's the output I get if the code is pasted into an interactive zsh:

+callback1:3> : 12
+callback1:4> zle -F 12
+callback1:6> zle -F 13
+callback2:3> : 13

Note that callback2 fires after its file descriptor has been closed by callback1.

This bug was the culprit of several issues filed against powerlevel10k. In a nutshell:

_zsh_autosuggest_async_request opens a file.
_zsh_autosuggest_async_request closes the file descriptor.
powerlevel10k opens a file and gets the same file descriptor as above.
_zsh_autosuggest_async_response fires and closes the same file descriptor.
powerlevel10k encounters errors when trying to read from the file descriptor.

1. Insert `-` in `echo -nE "$suggestion"`. This is necessary to prevent `"$suggestion"` from being treated as an option for `echo`. 2. Close file descriptors only in `_zsh_autosuggest_async_response` to ensure that each file descriptor is closed only once. It's the second bug that prompted the fix. The original code in some cases could close the same file descriptor twice. The code relied on an invalid assumption that `_zsh_autosuggest_async_response` cannot fire after the file descriptor is closed. Here's a demo that shows this assumption being violated: () { emulate -L zsh function callback1() { zle -I emulate -L zsh -o xtrace : "$@" zle -F $fd1 exec {fd1}>&- zle -F $fd2 exec {fd2}>&- } function callback2() { zle -I emulate -L zsh -o xtrace : "$@" } exec {fd1} </dev/null exec {fd2} </dev/null zle -F $fd1 callback1 zle -F $fd2 callback2 } And here's the output I get if the code is pasted into an interactive zsh: +callback1:3> : 12 +callback1:4> zle -F 12 +callback1:6> zle -F 13 +callback2:3> : 13 Note that `callback2` fires after its file descriptor has been closed by `callback1`. This bug was the culprit of several issues filed against powerlevel10k. In a nutshell: 1. `_zsh_autosuggest_async_request` opens a file. 2. `_zsh_autosuggest_async_request` closes the file descriptor. 3. powerlevel10k opens a file and gets the same file descriptor as above. 4. `_zsh_autosuggest_async_response` fires and closes the same file descriptor. 5. powerlevel10k encounters errors when trying to read from the file descriptor.

ericfreese · 2023-07-28T20:05:36Z

src/async.zsh

-				# have been forked by the suggestion strategy
-				kill -TERM -$_ZSH_AUTOSUGGEST_CHILD_PID 2>/dev/null
+	if (( _ZSH_AUTOSUGGEST_CHILD_PID )); then
+		kill -TERM -- $_ZSH_AUTOSUGGEST_CHILD_PID 2>/dev/null


I see you've also removed the check of [[ -o MONITOR ]]. Was that intentional and if so can you give some reasoning for it? I had never fully tested the different cases there.

The check for MONITOR is still there. It used to be performed before invoking kill, which isn't quite right. I moved it to the proper point (when forking). This makes a difference if the option is changed between forking and killing. I should've this change in the description.

Friendly ping.

ericfreese · 2023-08-25T20:28:35Z

src/async.zsh

+			local suggestion
+			_zsh_autosuggest_fetch_suggestion "$1"
+			echo -nE - "$suggestion"
+		) || return


@romkatv Can you elaborate on the addition of the || return added here and after read and echo? In what situations can exec/read/echo have non-zero exit status?

I suppose one could read the source code of the builtins exec/read/echo and find out which syscalls they rely on. Then read the source code or docs of all operating systems to figure out when those system calls can fail.

I think a better question is what the code should do when exec/read/echo fail. Which variant of this code do you think behaves better in case of errors?

ericfreese · 2023-08-26T03:54:22Z

I have been trying to wrap my head around this today. I'm having trouble with the specifics of the situation you've outlined, but I do think I've found a (different?) way that this stuff can get borked stemming from _zsh_autosuggest_async_response closing the fd but not clearing _ZSH_AUTOSUGGEST_ASYNC_FD.

As to the scenario you've outlined:

_zsh_autosuggest_async_request opens a file.

_zsh_autosuggest_async_request closes the file descriptor.

powerlevel10k opens a file and gets the same file descriptor as above.

_zsh_autosuggest_async_response fires and closes the same file descriptor.

powerlevel10k encounters errors when trying to read from the file descriptor.

I'm not understanding how step 3 can occur here. The scenario I'm imagining from these steps is a user typing two characters in relatively quick succession. The first character fires off a child process to fetch a suggestion (1. above), which does not complete before the second character is typed (2. above). This would have _zsh_autosuggest_async_request open the fd (1.) and then close it (2.) in preparation for making the next suggestion request.

But then I don't understand how p10k could get the same fd that _zsh_autosuggest_async_request is dealing with in steps 1 and 2. After closing the fd, _zsh_autosuggest_async_request immediately opens another, which as I understand it will take the same number that it had just closed. That number should then be unavailable for p10k to take without some other action occurring first.

Would you be able to provide a more in-depth explanation?

While I was looking into this, as mentioned above, I did see a problem that could arise in a different but related scenario. I was able to trigger it by enabling ZSH_AUTOSUGGEST_MANUAL_REBIND and then having a widget that opens a file descriptor defined after the first precmd after zsh-autosuggestions is sourced where it does its widget binding. This scenario results in _zsh_autosuggest_async_request closing the fd opened by the widget. I believe a fix for this is simply clearing _ZSH_AUTOSUGGEST_ASYNC_FD in _zsh_autosuggest_async_response so that the next time into _zsh_autosuggest_async_request it doesn't mistakenly try to close the fd again.

ericfreese · 2023-08-26T04:18:05Z

I've broken out some of your changes into a separate branch here: develop...fixes/romkatv-async-fixes

I tacked the _ZSH_AUTOSUGGEST_ASYNC_FD fix mentioned above on the end. I'm curious if that fixes the issues you were having?

romkatv · 2023-08-27T12:47:11Z

I'm not understanding how step 3 can occur here.

It might help if you take a look at the self-contained code snippet I posted together with its output. It basically shows that the main zle loop works as follows (pseudo code):

while (true) {
  // Get a list of all file descriptors that are ready for reading.
  // This includes STDIN and all file descriptors registered via `zle -F`.
  ready_fds = select();
  if (ready_fds contains STDIN) {
    // STDIN has input. Read it and invoke bound widgets.
    read_keys_and_invoke_widgets();
  }
  // Invoke `zle -F` handlers for all ready file descriptors.
  watches = get_watches_for_fds(ready_fds);
  foreach (watch in watches) {
    invoke_fd_watch(watch);
  }
}

If several file descriptors become ready for reading simultaneously, and the handler for the first of them closes the second file descriptor, the second handler will still get called. That's what the code snippet in the PR description demonstrates.

Thus, when _zsh_autosuggest_async_request closes a file descriptor, the handler for this file descriptor will still fire if the fd is ready at that point. If in the interim some other code opens a file descriptor, it may be the same numerical value. It is in fact quite likely beucase the C runtime always allocates the smallest unused number when opening a file descriptor. In this case _zsh_autosuggest_async_response will close a file descriptor that zsh-autosuggestions did not open. Or, more precisely, zsh-autosuggestions has already closed it, and now a different piece of code "owns" a file descriptor with the same numerical value.

I took a cursory look at your changes and I don't think they fix this bug, or at least it's difficult for me to see that they do. In my code the correctness is easier to assess because 1) whenever a file descriptor is opened, a handler for it is registered; 2) the file descriptor is closed only from the handler. There is no room for a double close or fd leak: one function opens an fd and ensures that another function is eventually invoked; the second function closes the fd.

ericfreese · 2023-09-07T15:45:03Z

I just realized that #630 contained this _ZSH_AUTOSUGGEST_ASYNC_FD fix and I've merged that to develop. I'm fairly sure that this will fix the issues you're seeing, and is a less expansive change.

Please let me know if you see more issues and if so, it would help if you could provide some specific reproduction steps. Another thing that would help me in the future is to break the PR changes into more commits to make it clearer which changes are addressing which problems.

romkatv · 2023-09-07T18:18:33Z

I am positive that #630 does not fix the bug. I understand that it would be easier for you if I gave you a reproducible test case but it would take me a long time to build one given that we are talking about a race condition.

Can you see that the existing code does not provide a guarantee that each fd is closed only once? Do you see that the same fd can be first closed by this line and then again by this line? For that to happen, _zsh_autosuggest_async_request needs to be invoked when the fd is ready for reading.

ericfreese · 2023-09-09T15:33:19Z

Ok, thanks. I'm still trying to get my head around this.

Is this an example of what you're thinking? Please correct/extend this example to illustrate the scenario you're thinking of and help me check my own assumptions:

Some keystroke invokes async_request forking pid 1000 and opening fd 12

And then some time later, simultaneously:

user types a character
fd 12 becomes readable

Then in one iteration of the main zle loop:

async_request fires. It cancels the current request by signalling pid 1000 and closing fd 12. It then immediately forks a new pid 1001 (or whatever) and re-opens fd 12 to read from this new process.
Immediately after, async_response is called for fd 12. The fd is the same but it now points to a different process than originally intended (1001 instead of 1000). I'm not sure exactly what would happen here but I think the read would block until the process 1001 finishes. Probably works but I think would end up blocking rather than being async. Definitely not intentional, and into sketchy territory.

I think I see your point here, and something should be done to fix this 👍

Though I still don't quite see how something else could open fd 12 in between points 1 and 2 above. I wonder if we have pivoted from the originally reported bug to something slightly different?

romkatv · 2023-09-09T16:11:24Z

Spot on! Arbitrary code can be executed between 1 and 2 by a widget wrapper.

If you look at the code of my version rather than the diff, it might be easier to see its correctness. It should be possible to convince yourself that each fd is always closed exactly once.

minusfive · 2024-09-19T16:26:34Z

@ericfreese Wondering whether this will be merged, definitely still hitting this bug periodically.

egeogretmen · 2024-10-10T09:02:57Z

I can also confirm that the problem still exists.

The scenario I'm imagining from these steps is a user typing two characters in relatively quick succession.

This happens to me specifically.

ericfreese reviewed Jul 28, 2023

View reviewed changes

ericfreese reviewed Aug 25, 2023

View reviewed changes

romkatv mentioned this pull request Aug 31, 2023

_p9k_worker_stop:zle:4: No handler installed for fd 1 romkatv/powerlevel10k#1554

Closed

ericfreese mentioned this pull request Sep 7, 2023

Always reset file descriptor after consuming it #630

Merged

romkatv mentioned this pull request Feb 23, 2024

_p9k_deschedule_redraw error after paste romkatv/powerlevel10k#2584

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bugs in async code #753

Fix bugs in async code #753

romkatv commented Jul 26, 2023

ericfreese Jul 28, 2023

romkatv Jul 29, 2023

romkatv Aug 8, 2023

ericfreese Aug 25, 2023

romkatv Aug 27, 2023

ericfreese commented Aug 26, 2023

ericfreese commented Aug 26, 2023 •

edited

Loading

romkatv commented Aug 27, 2023

ericfreese commented Sep 7, 2023

romkatv commented Sep 7, 2023

ericfreese commented Sep 9, 2023

romkatv commented Sep 9, 2023

minusfive commented Sep 19, 2024

egeogretmen commented Oct 10, 2024

Fix bugs in async code #753

Are you sure you want to change the base?

Fix bugs in async code #753

Conversation

romkatv commented Jul 26, 2023

ericfreese Jul 28, 2023

Choose a reason for hiding this comment

romkatv Jul 29, 2023

Choose a reason for hiding this comment

romkatv Aug 8, 2023

Choose a reason for hiding this comment

ericfreese Aug 25, 2023

Choose a reason for hiding this comment

romkatv Aug 27, 2023

Choose a reason for hiding this comment

ericfreese commented Aug 26, 2023

ericfreese commented Aug 26, 2023 • edited Loading

romkatv commented Aug 27, 2023

ericfreese commented Sep 7, 2023

romkatv commented Sep 7, 2023

ericfreese commented Sep 9, 2023

romkatv commented Sep 9, 2023

minusfive commented Sep 19, 2024

egeogretmen commented Oct 10, 2024

ericfreese commented Aug 26, 2023 •

edited

Loading