-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Help debugging a seemingly random segfault in wasmtime #10283
Comments
it's suspicious that the segfaulting addressing and the ip are the same, meaning that this looks like it's trying to execute code that's either not mapped or not executable. What OS are you on? That may be some cache-coherency thing and/or bug in Wasmtime perhaps? (not that anything about that has changed recently, so I don't know what would have changed in the 21->29 update other than perhaps making it more likely one way or another) |
I'm on arch linux |
Can you run |
Sure:
|
Ah ok, x64 probably doesn't have any cache coherency issues or things like that. Can you open up the coredump and get the exact faulting instruction? E.g. if you open it up and run EDIT: also if possible a stack trace via |
Sure, here you go (I hope I did it right).
I'm also going to start working with an unstripped version so that next time this happens we can hopefully get some more info. |
Looks like the disassembly got mangled due to starting at the wrong offset. |
It this better?
|
It appears execution fell into bad data somehow: Perhaps try starting further back -- did execution slide into this from legitimate code just prior, somehow? Or perhaps a jump landed here. Are you able to capture the crash under |
Would it be helpful if I attach the coredump here? |
Probably not without the binary and other information -- in general it's difficult to jump into the context of another project embedding us and work things out. Capturing the crash in |
I have unfortunately not managed to find a way to reproduce the crash. It happens about 1-2 times a day of heavy work with the application. I guess it will help some more if I provide a backtrace of the coredump with the debug symbols? |
Landing in the middle of zero data is the sort of thing that requires a bit more info to debug unfortunately -- a backtrace might offer some clues, but the only way to really get enough information to zero in on a fix (if this is a Wasmtime bug) would be, again, an |
A few questions that might help triage as well:
|
Yes and yes. Zellij has 500+ dependencies. I did debug it down do these lines though https://github.com/zellij-org/zellij/blob/main/zellij-server/src/plugins/wasm_bridge.rs#L1589-L1596 - it definitely happens here. The only unsafe thing that's happening in this range has to do with wasmtime (unless I'm missing something which is always a possibility). Otherwise it's serialization, acquiring a lock on an
Anecdotally, it seems to me that this happens when the system itself is under heavy load (eg. compiling rust). This is a 10+ year old laptop, so it struggles often and these tend to be the times in which this happens - but definitely not only.
So far only on my machine. I unfortunately do not have access to others. I have no heard this reported, but that's because this only happens (to the best of my knowledge) in unreleased code. I'm currently very wary of releasing said code (supposed to do so in the coming week or two) and finding out because these sort of crashes will definitely cause many users to rage quit the app. I'm hoping we get to the bottom of this one way or the other.
This was a good idea, but unfortunately all is well. I ran memtest86+ overnight, 10 passes with no errors. |
And I have a proper backtrace with debug symbols!
|
My guess is that back-and-forth debugging over and asynchronous chat mechanism such as issue comments here is unlikely to turn up much of a solution with most of the low-hanging-fruit possibilities already having been weeded out. I think @cfallin is right in that to make any progress on this you'd probably need to be able to upload an artifact of some kind here, e.g. a core dump or an Another thing that might be useful: is there one wasm module which is causing issues? Multiple? One particular export? Multiple? Basically getting a more detailed picture of the crash might yield insights with respect to the shape of what Wasmtime is doing. I realize this is probably difficult to gather from your end due to the non-reproducible nature of the crash but if you're able to install some verbose logging and correlate that with a crash it could perhaps prove valuable. |
I'm very happy to provide whichever information you need. I'm happy to send you the core dump as well if you think that will help (though it was compiled to my machine, so I'm not sure if you'd be able to work with it?) This does not seem related to any specific wasm module, it happens with the built-in plugins in Zellij. I can point to the code if you'd like. I'm in a bit of a precarious situation here. This is a very rare crash (as I mentioned, it happens about twice a day for me when working with the app as my daily driver and IDE) - and so as much as I would want to, I don't know how to give you what you're asking for. I simply can't reproduce it. On the other hand - I also can't release the software in this case. This would be devastating for the application and its stability (assuming it's not somehow a problem local to my case). My only other recourse is to downgrade back to version 21, which is the currently released one and hope for the best. I don't think this would be a good solution. Could you please help me, maybe with educated guesses? Maybe with a way to mitigate this error, now that we know it's happening in the call function? I would be fine with a |
@imsnif Do you have any plugins installed that are written in Go or another GC'ed language by the way? Or is this with just the default Zellij plugins? |
Just the default Zellij ones. |
That at least eliminates the possibility that this is a bug in the GC support of Wasmtime as all default plugins are written in Rust which doesn't use Wasm GC. |
At least from my perspective I want to be able to help you @imsnif but the context here is very sparse. Others might have more context but the context at least I have is:
That's unfortunately not really much to go on. "Segfault in wasm code" could range anywhere from critical security bug in the runtime to some other random thread unmapping memory unknowningly. Debugging is in general a pretty interactive experience insofar as we don't have a runbook which says "run this command and it'll file an issue" but instead debugging issues like this requires a lot of back-and-forth with what's being debugged. This is all of course under the limitations you're describing which is a spurious crash that is not easily reproducible. It's also worth pointing out that at least from my perspective I'm no zellij expert myself, rather I'm not familiar with the codebase at all. Not being familiar with a codebase can severly hinder debugging because there's so much unknown context of what's going on. Now of course you're in this bucket with respect to Wasmtime as well (I'm assuming you're not intimately familiar with Wasmtime's codebase), and that's something I personally very much want to respect. I don't expect you to be able to provide the perfect crash log/trace that narrows down this issue in a few minutes from our perspective, but at. the same time I'd hope you can be sympathetic to our end as well in terms of "segfault in wasm code" is not much to go on in terms of debugging. I also want to very much respect the pressures you have in play as well with respect to releasing. From my (probably naive) perspective it seems like you probably want to downgrade to Wasmtime 21 while this is debugged in parallel. Either that or have some sort of time box for investigating this and after that downgrade for a release. I'd love to give you an estimate of how long this will take or some mitigation to apply to make it less likely or workaround or something like that. With "segfault in wasm code" though that option unfortunately isn't a reality. Some other information that could possibly help:
|
Hey,
In the context of Zellij, ever since a recent (still unreleased) upgrade of wasmtime, I've been getting intermittent and seemingly random segfaults.
I am afraid I don't have a way to reproduce them, but I have been gathering some information about them and am hoping the maintainers can help pin-point the issue or provide more insight.
Version info:
wasmtime version: 29.0.1
(probably) did not occur in wasmtime version: 21.0.2
What happens: Zellij crashes, without a panic, and syslog tells me that it got a segfault:
With some debug logging, I managed to narrow the crash down to this block in our app: https://github.com/zellij-org/zellij/blob/main/zellij-server/src/plugins/wasm_bridge.rs#L1589-L1596
I have a core dump, but unfortunately not one with debug symbols so I'm not sure it's a lot of help.
This happens about once or twice a day for me and so is extremely gnarly to pin-point. I'm not 100% sure this is a wasmtime issue, but the crash location would seem to indicate that it is.
Any help in troubleshooting or mitigating this somehow would be appreciated. On my part I'm going to try to get a coredump with some debug symbols so we'll hopefully be less blind.
Thanks!
The text was updated successfully, but these errors were encountered: