Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help debugging a seemingly random segfault in wasmtime #10283

Open
imsnif opened this issue Feb 24, 2025 · 22 comments
Open

Help debugging a seemingly random segfault in wasmtime #10283

imsnif opened this issue Feb 24, 2025 · 22 comments

Comments

@imsnif
Copy link

imsnif commented Feb 24, 2025

Hey,

In the context of Zellij, ever since a recent (still unreleased) upgrade of wasmtime, I've been getting intermittent and seemingly random segfaults.

I am afraid I don't have a way to reproduce them, but I have been gathering some information about them and am hoping the maintainers can help pin-point the issue or provide more insight.

Version info:
wasmtime version: 29.0.1
(probably) did not occur in wasmtime version: 21.0.2

What happens: Zellij crashes, without a panic, and syslog tells me that it got a segfault:

* [343995.595910] async-std/runti[2487041]: segfault at 705a8ea73970 ip 0000705a8ea73970 sp 0000705a8ea73948 error 15 likely on CPU 3 (core 1, socket 0)

With some debug logging, I managed to narrow the crash down to this block in our app: https://github.com/zellij-org/zellij/blob/main/zellij-server/src/plugins/wasm_bridge.rs#L1589-L1596

I have a core dump, but unfortunately not one with debug symbols so I'm not sure it's a lot of help.

This happens about once or twice a day for me and so is extremely gnarly to pin-point. I'm not 100% sure this is a wasmtime issue, but the crash location would seem to indicate that it is.

Any help in troubleshooting or mitigating this somehow would be appreciated. On my part I'm going to try to get a coredump with some debug symbols so we'll hopefully be less blind.

Thanks!

@alexcrichton
Copy link
Member

segfault at 705a8ea73970 ip 0000705a8ea73970

it's suspicious that the segfaulting addressing and the ip are the same, meaning that this looks like it's trying to execute code that's either not mapped or not executable. What OS are you on? That may be some cache-coherency thing and/or bug in Wasmtime perhaps? (not that anything about that has changed recently, so I don't know what would have changed in the 21->29 update other than perhaps making it more likely one way or another)

@imsnif
Copy link
Author

imsnif commented Feb 24, 2025

I'm on arch linux

@alexcrichton
Copy link
Member

Can you run uname -a and paste that? In addition to Linux I'd be curious what architecture you're using (e.g. x64, aarch64, etc) and possibly what kernel version too

@imsnif
Copy link
Author

imsnif commented Feb 24, 2025

Sure:

Linux green 6.12.1-arch1-1 #1 SMP PREEMPT_DYNAMIC Fri, 22 Nov 2024 16:04:27 +0000 x86_64 GNU/Linux

@alexcrichton
Copy link
Member

alexcrichton commented Feb 24, 2025

Ah ok, x64 probably doesn't have any cache coherency issues or things like that. Can you open up the coredump and get the exact faulting instruction? E.g. if you open it up and run disas does it print anything? If nothing can you do disas $pc-40,$pc+30 to see the window around the faulting instruction?

EDIT: also if possible a stack trace via bt would be helpful too

@imsnif
Copy link
Author

imsnif commented Feb 24, 2025

Sure, here you go (I hope I did it right).

Core was generated by `/home/aram/backup/dev-zellij --server /run/user/1000/zellij/0.42.0/zellij'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007b103cdb1870 in ?? ()
[Current thread is 1 (LWP 800590)]
(gdb) disas
No function contains program counter for selected frame.
(gdb) disas $pc-40,$pc+30
Dump of assembler code from 0x7b103cdb1848 to 0x7b103cdb188e:
   0x00007b103cdb1848:	mov    $0xcedceff9,%ecx
   0x00007b103cdb184d:	pop    %rcx
   0x00007b103cdb184e:	add    %al,(%rax)
   0x00007b103cdb1850:	jo     0x7b103cdb186a
   0x00007b103cdb1852:	fstpt  (%rax,%rdx,1)
   0x00007b103cdb1855:	jnp    0x7b103cdb1857
   0x00007b103cdb1857:	add    %bl,%bl
   0x00007b103cdb1859:	lea    (bad),%edx
   0x00007b103cdb185a:	rcr    $1,%ah
   0x00007b103cdb185c:	(bad)
   0x00007b103cdb185d:	pop    %rcx
   0x00007b103cdb185e:	add    %al,(%rax)
   0x00007b103cdb1860:	call   0x7b104d17f37d
   0x00007b103cdb1865:	jnp    0x7b103cdb1867
   0x00007b103cdb1867:	add    %al,(%rax)
   0x00007b103cdb1869:	add    %al,(%rax)
   0x00007b103cdb186b:	add    %al,(%rax)
   0x00007b103cdb186d:	add    %al,(%rax)
   0x00007b103cdb186f:	add    %al,(%rax)
   0x00007b103cdb1871:	add    %al,(%rax)
   0x00007b103cdb1873:	add    %al,(%rax)
   0x00007b103cdb1875:	add    %al,(%rax)
   0x00007b103cdb1877:	add    %ch,(%rax)
   0x00007b103cdb1879:	call   0x7b10b7eb0c83
   0x00007b103cdb187e:	add    %al,(%rax)
   0x00007b103cdb1880:	add    %eax,(%rax)
   0x00007b103cdb1882:	add    %al,(%rax)
   0x00007b103cdb1884:	add    %al,(%rax)
   0x00007b103cdb1886:	add    %al,(%rax)
   0x00007b103cdb1888:	clc
   0x00007b103cdb1889:	push   %rbx
   0x00007b103cdb188a:	sbb    %bl,%dh
   0x00007b103cdb188c:	(bad)
   0x00007b103cdb188d:	pop    %rcx
End of assembler dump.
(gdb) bt
#0  0x00007b103cdb1870 in ?? ()
#1  0x000059cedceff9b9 in ?? ()
#2  0x00007b103cdb1870 in ?? ()
#3  0x000059cedcd08ddb in ?? ()
#4  0x00007b103cdb18e8 in ?? ()
#5  0x0000000000000000 in ?? ()
(gdb)

I'm also going to start working with an unstripped version so that next time this happens we can hopefully get some more info.

@bjorn3
Copy link
Contributor

bjorn3 commented Feb 24, 2025

Looks like the disassembly got mangled due to starting at the wrong offset. disas/r $pc-40,$pc+30 shows the raw bytes for each instruction too. Alsy try varying the -40 a bit until one instruction has 0x00007b103cdb1870 as start address.

@imsnif
Copy link
Author

imsnif commented Feb 24, 2025

It this better?

(gdb) disas $pc-10,$pc+30
Dump of assembler code from 0x7b103cdb1866 to 0x7b103cdb188e:
   0x00007b103cdb1866:	add    %al,(%rax)
   0x00007b103cdb1868:	add    %al,(%rax)
   0x00007b103cdb186a:	add    %al,(%rax)
   0x00007b103cdb186c:	add    %al,(%rax)
   0x00007b103cdb186e:	add    %al,(%rax)
=> 0x00007b103cdb1870:	add    %al,(%rax)
   0x00007b103cdb1872:	add    %al,(%rax)
   0x00007b103cdb1874:	add    %al,(%rax)
   0x00007b103cdb1876:	add    %al,(%rax)
   0x00007b103cdb1878:	sub    %ch,%al
   0x00007b103cdb187a:	add    $0x7b0ff4,%eax
   0x00007b103cdb187f:	add    %al,(%rcx)
   0x00007b103cdb1881:	add    %al,(%rax)
   0x00007b103cdb1883:	add    %al,(%rax)
   0x00007b103cdb1885:	add    %al,(%rax)
   0x00007b103cdb1887:	add    %bh,%al
   0x00007b103cdb1889:	push   %rbx
   0x00007b103cdb188a:	sbb    %bl,%dh
   0x00007b103cdb188c:	(bad)
   0x00007b103cdb188d:	pop    %rcx
End of assembler dump.
(gdb) bt
#0  0x00007b103cdb1870 in ?? ()
#1  0x000059cedceff9b9 in ?? ()
#2  0x00007b103cdb1870 in ?? ()
#3  0x000059cedcd08ddb in ?? ()
#4  0x00007b103cdb18e8 in ?? ()
#5  0x0000000000000000 in ?? ()
(gdb)

@cfallin
Copy link
Member

cfallin commented Feb 24, 2025

It appears execution fell into bad data somehow: add %al, (%rax) is the disassembly of all zero bytes.

Perhaps try starting further back -- did execution slide into this from legitimate code just prior, somehow?

Or perhaps a jump landed here. Are you able to capture the crash under rr (the reversible debugger) and backwards-step (rsi, reverse-step-instruction) to see where we came from?

@imsnif
Copy link
Author

imsnif commented Feb 24, 2025

Would it be helpful if I attach the coredump here?

@cfallin
Copy link
Member

cfallin commented Feb 24, 2025

Probably not without the binary and other information -- in general it's difficult to jump into the context of another project embedding us and work things out. Capturing the crash in rr if you can is probably the most likely way to make progress: that would allow us to see exactly how things became corrupted (if they did) or went off the rails otherwise.

@imsnif
Copy link
Author

imsnif commented Feb 24, 2025

I have unfortunately not managed to find a way to reproduce the crash. It happens about 1-2 times a day of heavy work with the application. I guess it will help some more if I provide a backtrace of the coredump with the debug symbols?

@cfallin
Copy link
Member

cfallin commented Feb 24, 2025

Landing in the middle of zero data is the sort of thing that requires a bit more info to debug unfortunately -- a backtrace might offer some clues, but the only way to really get enough information to zero in on a fix (if this is a Wasmtime bug) would be, again, an rr trace -- we really need to see register values, see how control flow reached a place, see how struct fields got corrupted if they did, etc.

@cfallin
Copy link
Member

cfallin commented Feb 24, 2025

A few questions that might help triage as well:

  • Does your application that embeds Wasmtime include any unsafe Rust code, or have any libraries written in unsafe languages (e.g. C/C++) linked in, other than libc?
  • Do the failures correlate with heavy load? Wasm execution? Level of concurrency? Or do they happen randomly?
  • A bit of a stretch, but I recently encountered some perplexing kernel crashes on one of my machines and initially assumed the issue must have been some subtle race condition in the filesystem code that was crashing. It turned out that I had bad RAM. So
    • Does this crashing issue occur only on one machine or on multiple?
    • If only one machine, has it had any other stability issues? Could you run memtest86 (if x86/PC hardware) as a baseline check?

@imsnif
Copy link
Author

imsnif commented Feb 25, 2025

* Does your application that embeds Wasmtime include any unsafe Rust code, or have any libraries written in unsafe languages (e.g. C/C++) linked in, other than libc?

Yes and yes. Zellij has 500+ dependencies. I did debug it down do these lines though https://github.com/zellij-org/zellij/blob/main/zellij-server/src/plugins/wasm_bridge.rs#L1589-L1596 - it definitely happens here. The only unsafe thing that's happening in this range has to do with wasmtime (unless I'm missing something which is always a possibility). Otherwise it's serialization, acquiring a lock on an Arc<Mutex> and pushing into a VecDeque.

* Do the failures correlate with heavy load? Wasm execution? Level of concurrency? Or do they happen randomly?

Anecdotally, it seems to me that this happens when the system itself is under heavy load (eg. compiling rust). This is a 10+ year old laptop, so it struggles often and these tend to be the times in which this happens - but definitely not only.

  * Does this crashing issue occur only on one machine or on multiple?

So far only on my machine. I unfortunately do not have access to others. I have no heard this reported, but that's because this only happens (to the best of my knowledge) in unreleased code. I'm currently very wary of releasing said code (supposed to do so in the coming week or two) and finding out because these sort of crashes will definitely cause many users to rage quit the app. I'm hoping we get to the bottom of this one way or the other.

  * If only one machine, has it had any other stability issues? Could you run `memtest86` (if x86/PC hardware) as a baseline check?

This was a good idea, but unfortunately all is well. I ran memtest86+ overnight, 10 passes with no errors.

@imsnif
Copy link
Author

imsnif commented Feb 25, 2025

And I have a proper backtrace with debug symbols!

(gdb) bt
#0  0x00007f134ab5f880 in ?? ()
#1  0x000057ceab37a049 in wasmtime_setjmp_29_0_1 ()
#2  0x000057ceaad40b6b in wasmtime::runtime::vm::traphandlers::catch_traps ()
#3  0x000057ceaadf72d1 in wasmtime::runtime::func::typed::TypedFunc<Params,Results>::call ()
#4  0x000057ceaad679ba in zellij_server::plugins::wasm_bridge::apply_event_to_plugin ()
#5  0x000057ceab032cc6 in async_task::raw::RawTask<F,T,S>::run ()
#6  0x000057ceaa2f9b71 in async_global_executor::threading::thread_main_loop ()
#7  0x000057ceaa2ff776 in std::sys::backtrace::__rust_begin_short_backtrace ()
#8  0x000057ceaa2f7a21 in core::ops::function::FnOnce::call_once{{vtable.shim}} ()
#9  0x000057ceaa71a09b in alloc::boxed::{impl#28}::call_once<(), dyn core::ops::function::FnOnce<(), Output=()>, alloc::alloc::Global> () at alloc/src/boxed.rs:1972
#10 alloc::boxed::{impl#28}::call_once<(), alloc::boxed::Box<dyn core::ops::function::FnOnce<(), Output=()>, alloc::alloc::Global>, alloc::alloc::Global> () at alloc/src/boxed.rs:1972
#11 std::sys::pal::unix::thread::{impl#2}::new::thread_start () at std/src/sys/pal/unix/thread.rs:105
#12 0x00007f134c1ad39d in ?? () from /usr/lib/libc.so.6
#13 0x00007f134c23249c in ?? () from /usr/lib/libc.so.6

@alexcrichton
Copy link
Member

My guess is that back-and-forth debugging over and asynchronous chat mechanism such as issue comments here is unlikely to turn up much of a solution with most of the low-hanging-fruit possibilities already having been weeded out. I think @cfallin is right in that to make any progress on this you'd probably need to be able to upload an artifact of some kind here, e.g. a core dump or an rr trace or similar. That's of course not guaranteed to yield a solution, but otherwise only getting snippets here and there is a bit of a stunted debugging experience from our end.

Another thing that might be useful: is there one wasm module which is causing issues? Multiple? One particular export? Multiple? Basically getting a more detailed picture of the crash might yield insights with respect to the shape of what Wasmtime is doing. I realize this is probably difficult to gather from your end due to the non-reproducible nature of the crash but if you're able to install some verbose logging and correlate that with a crash it could perhaps prove valuable.

@imsnif
Copy link
Author

imsnif commented Feb 25, 2025

I'm very happy to provide whichever information you need. I'm happy to send you the core dump as well if you think that will help (though it was compiled to my machine, so I'm not sure if you'd be able to work with it?)

This does not seem related to any specific wasm module, it happens with the built-in plugins in Zellij. I can point to the code if you'd like.

I'm in a bit of a precarious situation here. This is a very rare crash (as I mentioned, it happens about twice a day for me when working with the app as my daily driver and IDE) - and so as much as I would want to, I don't know how to give you what you're asking for. I simply can't reproduce it.

On the other hand - I also can't release the software in this case. This would be devastating for the application and its stability (assuming it's not somehow a problem local to my case). My only other recourse is to downgrade back to version 21, which is the currently released one and hope for the best. I don't think this would be a good solution.

Could you please help me, maybe with educated guesses? Maybe with a way to mitigate this error, now that we know it's happening in the call function? I would be fine with a try_call or some such that lets me totally restart the plugin. Not a great experience, but better than crashing the whole app.

@bjorn3
Copy link
Contributor

bjorn3 commented Feb 25, 2025

@imsnif Do you have any plugins installed that are written in Go or another GC'ed language by the way? Or is this with just the default Zellij plugins?

@imsnif
Copy link
Author

imsnif commented Feb 25, 2025

Just the default Zellij ones.

@bjorn3
Copy link
Contributor

bjorn3 commented Feb 25, 2025

That at least eliminates the possibility that this is a bug in the GC support of Wasmtime as all default plugins are written in Rust which doesn't use Wasm GC.

@alexcrichton
Copy link
Member

At least from my perspective I want to be able to help you @imsnif but the context here is very sparse. Others might have more context but the context at least I have is:

  • This is non-reproducible and spurious. Any requests for more information are understood that it may take awhile to collect the crash.
  • This crash is a segfault where the ip itself seems to be the address of the faulting instruction and the address causing the fault. This was present in the OP's syslog logs, and I'm assuming that's the same for all other crashes shown her because the syslog hasn't been recaptured.
  • The stack trace points at compiled wasm code, which knowingly doesn't have a symbol associated with it.

That's unfortunately not really much to go on. "Segfault in wasm code" could range anywhere from critical security bug in the runtime to some other random thread unmapping memory unknowningly. Debugging is in general a pretty interactive experience insofar as we don't have a runbook which says "run this command and it'll file an issue" but instead debugging issues like this requires a lot of back-and-forth with what's being debugged. This is all of course under the limitations you're describing which is a spurious crash that is not easily reproducible.

It's also worth pointing out that at least from my perspective I'm no zellij expert myself, rather I'm not familiar with the codebase at all. Not being familiar with a codebase can severly hinder debugging because there's so much unknown context of what's going on. Now of course you're in this bucket with respect to Wasmtime as well (I'm assuming you're not intimately familiar with Wasmtime's codebase), and that's something I personally very much want to respect. I don't expect you to be able to provide the perfect crash log/trace that narrows down this issue in a few minutes from our perspective, but at. the same time I'd hope you can be sympathetic to our end as well in terms of "segfault in wasm code" is not much to go on in terms of debugging.

I also want to very much respect the pressures you have in play as well with respect to releasing. From my (probably naive) perspective it seems like you probably want to downgrade to Wasmtime 21 while this is debugged in parallel. Either that or have some sort of time box for investigating this and after that downgrade for a release. I'd love to give you an estimate of how long this will take or some mitigation to apply to make it less likely or workaround or something like that. With "segfault in wasm code" though that option unfortunately isn't a reality.


Some other information that could possibly help:

  • I see in Zellij there's support for both Winch and Cranelift. Which is being used?
  • I see in Zellij there's support for deserializing compiled modules from disk? Is that being used?
    • If so, is it possible that the file is being modified on disk while it's in use? There's some documentation about the unsafe here. Also if so, perhaps you could test disabling this to see if it improves things?
  • I've got no idea if I can inspect a core dump from your machine on my machine but I figure it can't hurt. I'm no core dump expert myself.
  • What are you currently doing to reproduce? Is there a sequence of things you do which after 15 minutes reproduces? Or do you just work in the background and once a week it crashes? This is somewhat rhetorical but if you do have a method of reproducing, even if it takes awhile, that's where again rr could be invaluable to reproduce the crash.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants