Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bugs in async code #753

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

romkatv
Copy link
Contributor

@romkatv romkatv commented Jul 26, 2023

This PR fixes two bugs in async code.

  1. Insert - in echo -nE "$suggestion". This is necessary to prevent "$suggestion" from being treated as an option for echo.
  2. Close file descriptors only in _zsh_autosuggest_async_response to ensure that each file descriptor is closed only once.

It's the second bug that prompted the fix. The original code in some cases could close the same file descriptor twice. The code relied on an invalid assumption that _zsh_autosuggest_async_response cannot fire after the file descriptor is closed. Here's a demo that shows this assumption being violated:

() {
  emulate -L zsh

  function callback1() {
    zle -I
    emulate -L zsh -o xtrace
    : "$@"
    zle -F $fd1
    exec {fd1}>&-
    zle -F $fd2
    exec {fd2}>&-
  }

  function callback2() {
    zle -I
    emulate -L zsh -o xtrace
    : "$@"
  }

  exec {fd1} </dev/null
  exec {fd2} </dev/null
  zle -F $fd1 callback1
  zle -F $fd2 callback2
}

And here's the output I get if the code is pasted into an interactive zsh:

+callback1:3> : 12
+callback1:4> zle -F 12
+callback1:6> zle -F 13
+callback2:3> : 13

Note that callback2 fires after its file descriptor has been closed by callback1.

This bug was the culprit of several issues filed against powerlevel10k. In a nutshell:

  1. _zsh_autosuggest_async_request opens a file.
  2. _zsh_autosuggest_async_request closes the file descriptor.
  3. powerlevel10k opens a file and gets the same file descriptor as above.
  4. _zsh_autosuggest_async_response fires and closes the same file descriptor.
  5. powerlevel10k encounters errors when trying to read from the file descriptor.

1. Insert `-` in `echo -nE "$suggestion"`. This is necessary to prevent
   `"$suggestion"` from being treated as an option for `echo`.
2. Close file descriptors only in `_zsh_autosuggest_async_response` to
   ensure that each file descriptor is closed only once.

It's the second bug that prompted the fix. The original code in some
cases could close the same file descriptor twice. The code relied on
an invalid assumption that `_zsh_autosuggest_async_response` cannot
fire after the file descriptor is closed. Here's a demo that shows
this assumption being violated:

    () {
      emulate -L zsh

      function callback1() {
        zle -I
        emulate -L zsh -o xtrace
        : "$@"
        zle -F $fd1
        exec {fd1}>&-
        zle -F $fd2
        exec {fd2}>&-
      }

      function callback2() {
        zle -I
        emulate -L zsh -o xtrace
        : "$@"
      }

      exec {fd1} </dev/null
      exec {fd2} </dev/null
      zle -F $fd1 callback1
      zle -F $fd2 callback2
    }

And here's the output I get if the code is pasted into an interactive zsh:

    +callback1:3> : 12
    +callback1:4> zle -F 12
    +callback1:6> zle -F 13
    +callback2:3> : 13

Note that `callback2` fires after its file descriptor has been closed
by `callback1`.

This bug was the culprit of several issues filed against powerlevel10k.
In a nutshell:

1. `_zsh_autosuggest_async_request` opens a file.
2. `_zsh_autosuggest_async_request` closes the file descriptor.
3. powerlevel10k opens a file and gets the same file descriptor as above.
4. `_zsh_autosuggest_async_response` fires and closes the same file descriptor.
5. powerlevel10k encounters errors when trying to read from the file descriptor.
# have been forked by the suggestion strategy
kill -TERM -$_ZSH_AUTOSUGGEST_CHILD_PID 2>/dev/null
if (( _ZSH_AUTOSUGGEST_CHILD_PID )); then
kill -TERM -- $_ZSH_AUTOSUGGEST_CHILD_PID 2>/dev/null
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you've also removed the check of [[ -o MONITOR ]]. Was that intentional and if so can you give some reasoning for it? I had never fully tested the different cases there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check for MONITOR is still there. It used to be performed before invoking kill, which isn't quite right. I moved it to the proper point (when forking). This makes a difference if the option is changed between forking and killing. I should've this change in the description.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Friendly ping.

local suggestion
_zsh_autosuggest_fetch_suggestion "$1"
echo -nE - "$suggestion"
) || return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romkatv Can you elaborate on the addition of the || return added here and after read and echo? In what situations can exec/read/echo have non-zero exit status?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose one could read the source code of the builtins exec/read/echo and find out which syscalls they rely on. Then read the source code or docs of all operating systems to figure out when those system calls can fail.

I think a better question is what the code should do when exec/read/echo fail. Which variant of this code do you think behaves better in case of errors?

@ericfreese
Copy link
Member

I have been trying to wrap my head around this today. I'm having trouble with the specifics of the situation you've outlined, but I do think I've found a (different?) way that this stuff can get borked stemming from _zsh_autosuggest_async_response closing the fd but not clearing _ZSH_AUTOSUGGEST_ASYNC_FD.

As to the scenario you've outlined:

  1. _zsh_autosuggest_async_request opens a file.
  2. _zsh_autosuggest_async_request closes the file descriptor.
  3. powerlevel10k opens a file and gets the same file descriptor as above.
  4. _zsh_autosuggest_async_response fires and closes the same file descriptor.
  5. powerlevel10k encounters errors when trying to read from the file descriptor.

I'm not understanding how step 3 can occur here. The scenario I'm imagining from these steps is a user typing two characters in relatively quick succession. The first character fires off a child process to fetch a suggestion (1. above), which does not complete before the second character is typed (2. above). This would have _zsh_autosuggest_async_request open the fd (1.) and then close it (2.) in preparation for making the next suggestion request.

But then I don't understand how p10k could get the same fd that _zsh_autosuggest_async_request is dealing with in steps 1 and 2. After closing the fd, _zsh_autosuggest_async_request immediately opens another, which as I understand it will take the same number that it had just closed. That number should then be unavailable for p10k to take without some other action occurring first.

Would you be able to provide a more in-depth explanation?

While I was looking into this, as mentioned above, I did see a problem that could arise in a different but related scenario. I was able to trigger it by enabling ZSH_AUTOSUGGEST_MANUAL_REBIND and then having a widget that opens a file descriptor defined after the first precmd after zsh-autosuggestions is sourced where it does its widget binding. This scenario results in _zsh_autosuggest_async_request closing the fd opened by the widget. I believe a fix for this is simply clearing _ZSH_AUTOSUGGEST_ASYNC_FD in _zsh_autosuggest_async_response so that the next time into _zsh_autosuggest_async_request it doesn't mistakenly try to close the fd again.

@ericfreese
Copy link
Member

ericfreese commented Aug 26, 2023

I've broken out some of your changes into a separate branch here: develop...fixes/romkatv-async-fixes

I tacked the _ZSH_AUTOSUGGEST_ASYNC_FD fix mentioned above on the end. I'm curious if that fixes the issues you were having?

@romkatv
Copy link
Contributor Author

romkatv commented Aug 27, 2023

I'm not understanding how step 3 can occur here.

It might help if you take a look at the self-contained code snippet I posted together with its output. It basically shows that the main zle loop works as follows (pseudo code):

while (true) {
  // Get a list of all file descriptors that are ready for reading.
  // This includes STDIN and all file descriptors registered via `zle -F`.
  ready_fds = select();
  if (ready_fds contains STDIN) {
    // STDIN has input. Read it and invoke bound widgets.
    read_keys_and_invoke_widgets();
  }
  // Invoke `zle -F` handlers for all ready file descriptors.
  watches = get_watches_for_fds(ready_fds);
  foreach (watch in watches) {
    invoke_fd_watch(watch);
  }
}

If several file descriptors become ready for reading simultaneously, and the handler for the first of them closes the second file descriptor, the second handler will still get called. That's what the code snippet in the PR description demonstrates.

Thus, when _zsh_autosuggest_async_request closes a file descriptor, the handler for this file descriptor will still fire if the fd is ready at that point. If in the interim some other code opens a file descriptor, it may be the same numerical value. It is in fact quite likely beucase the C runtime always allocates the smallest unused number when opening a file descriptor. In this case _zsh_autosuggest_async_response will close a file descriptor that zsh-autosuggestions did not open. Or, more precisely, zsh-autosuggestions has already closed it, and now a different piece of code "owns" a file descriptor with the same numerical value.

I took a cursory look at your changes and I don't think they fix this bug, or at least it's difficult for me to see that they do. In my code the correctness is easier to assess because 1) whenever a file descriptor is opened, a handler for it is registered; 2) the file descriptor is closed only from the handler. There is no room for a double close or fd leak: one function opens an fd and ensures that another function is eventually invoked; the second function closes the fd.

@ericfreese
Copy link
Member

I just realized that #630 contained this _ZSH_AUTOSUGGEST_ASYNC_FD fix and I've merged that to develop. I'm fairly sure that this will fix the issues you're seeing, and is a less expansive change.

Please let me know if you see more issues and if so, it would help if you could provide some specific reproduction steps. Another thing that would help me in the future is to break the PR changes into more commits to make it clearer which changes are addressing which problems.

@romkatv
Copy link
Contributor Author

romkatv commented Sep 7, 2023

I am positive that #630 does not fix the bug. I understand that it would be easier for you if I gave you a reproducible test case but it would take me a long time to build one given that we are talking about a race condition.

Can you see that the existing code does not provide a guarantee that each fd is closed only once? Do you see that the same fd can be first closed by this line and then again by this line? For that to happen, _zsh_autosuggest_async_request needs to be invoked when the fd is ready for reading.

@ericfreese
Copy link
Member

Ok, thanks. I'm still trying to get my head around this.

Is this an example of what you're thinking? Please correct/extend this example to illustrate the scenario you're thinking of and help me check my own assumptions:

Some keystroke invokes async_request forking pid 1000 and opening fd 12

And then some time later, simultaneously:

  • user types a character
  • fd 12 becomes readable

Then in one iteration of the main zle loop:

  1. async_request fires. It cancels the current request by signalling pid 1000 and closing fd 12. It then immediately forks a new pid 1001 (or whatever) and re-opens fd 12 to read from this new process.
  2. Immediately after, async_response is called for fd 12. The fd is the same but it now points to a different process than originally intended (1001 instead of 1000). I'm not sure exactly what would happen here but I think the read would block until the process 1001 finishes. Probably works but I think would end up blocking rather than being async. Definitely not intentional, and into sketchy territory.

I think I see your point here, and something should be done to fix this 👍

Though I still don't quite see how something else could open fd 12 in between points 1 and 2 above. I wonder if we have pivoted from the originally reported bug to something slightly different?

@romkatv
Copy link
Contributor Author

romkatv commented Sep 9, 2023

Spot on! Arbitrary code can be executed between 1 and 2 by a widget wrapper.

If you look at the code of my version rather than the diff, it might be easier to see its correctness. It should be possible to convince yourself that each fd is always closed exactly once.

@minusfive
Copy link

@ericfreese Wondering whether this will be merged, definitely still hitting this bug periodically.

@egeogretmen
Copy link

I can also confirm that the problem still exists.

The scenario I'm imagining from these steps is a user typing two characters in relatively quick succession.

This happens to me specifically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants