-
Notifications
You must be signed in to change notification settings - Fork 601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent deadlock when closing a channel using CloseAsync in 7.x #1751
Comments
Hi, thanks for the report. As I'm sure you're aware of, there's not much to work with here 😸 Obviously, the gold standard is to provide code that reproduces this issue, or at least some idea of steps to do so.
What does this mean? Do you have some way in your application to increase the frequency of channel closure? |
We're running tests that create and close channels very frequently, and it appears that the test suite that do this the most; is the the one that is usually getting stuck. Anyhow, I can try to look dig into this further and see if I can provide something that will help you reproduce it. Thanks |
@Andersso channel and connection churn are workloads explicitly recommended against. |
It would be extremely helpful for you to share your test code. If you can't do that, describe the test as best you can:
My guess is that you could be hitting a
This is a related issue: |
Also note that management UI has connection and channel churn metrics, on the Overview page but also on the node page IIRC. So at the very least it should be easy to see the churn rate: is it 50 channels opened per second? Is it 200? |
@Andersso @ZajacPiotr98 - I've modified a test app in this project to try and trigger the error in this issue, or the error in #1749, and it works fine every time in my environment: |
Hi again, and sorry for the delayed response. I hope you guys had a good Christmas and new year!
I’ve been working on reproducing the issue in a test project but haven’t had any success. I’ve tried experimenting with different thread pool sizes, but it didn’t seem to affect the outcome. Based on my investigation of my latest memory dump, there’s no indication of thread pool starvation; all the threads in the pool are idle and waiting for work. It is also worth mentioning that my application is a console app so it does not have any synchronization context. Regarding the connection churn, wouldn’t that have caused issues in the 6.x versions as well? We’ve had this setup running fine for years without any problems until the upgrade to 7.x. I’ve done some additional digging by analyzing the memory dump. Specifically, I’ve looked at the tasks being awaited in the method that always seem to get stuck (according to the async dump):
It appears that the channel never gets completed, which prevents the method from ever completing. |
@Andersso I'm going to publish a 7.1.0 alpha release right now. When it's ready, I'll follow-up here. There have been a couple fixes merged that might help here. Any luck reproducing this issue reliably? |
@Andersso please give this version a try! https://www.nuget.org/packages/RabbitMQ.Client/7.1.0-alpha.0 |
Hey, |
I also performed the test with the alpha version and same results. Also I tried to do workaround with passing cancellation token to the In my case it was around 500 close requests in 2 minutes from one instance of my application (overall 6 instances, 5 connection each, 5 RabbitMQ nodes with 3GiB high watermark). Second instance of app had same issue for around 1000 close requests in 4 minutes. |
Thanks for your reports. I'll try to reproduce this issue locally, though I've had no luck so far. |
Fixes #1751 Attempt to fix deadlock by waiting on channel dispatcher first, then channel reader.
@Andersso @ZajacPiotr98 I'm wondering if you're running into this condition - https://stackoverflow.com/a/66521303 Is it possible to test my PR branch in your environments? If not, I can publish another alpha release. Thank you! |
Hey, I do not have the infrastructure to use your repo directly. A nuget package would be perfect! Thanks |
@Andersso - I build the packages locally on my branch, and uploaded them here: https://www.myget.org/feed/rabbitmq-dotnet-client/package/nuget/RabbitMQ.Client/7.1.0-alpha.0.1 |
I tested this PR and issue is still there. I added logs and it seems that for some reason |
Thanks for the follow-up. I wish I could reproduce this! I think the best fix will be to either not await |
@ZajacPiotr98 @Andersso I've uploaded a new version to MyGet - https://www.myget.org/feed/rabbitmq-dotnet-client/package/nuget/RabbitMQ.Client/7.1.0-alpha.0.2 When the |
I will run 7.1.0-alpha.0.2 over the weekend, fingers crossed! Sorry for my ignorance, but where does the log end up? |
You have to configure Use that class as a starting point in your own project. Instead of writing to |
Hey again, sorry for the delayed response. Unfortunately, the issue is still present, and no log output has been observed. (I did verify that the event listener is working) I will have another dive once I have a fresh memory dump. Thanks |
@Andersso thanks for the report. Argh, I wish I could reproduce this issue here. I will try some other ideas and will publish a new release to MyGet. I REALLY appreciate you being willing to test and investigate. |
Fixes #1751 Attempt to fix deadlock by waiting on channel dispatcher first, then channel reader.
Fixes #1751 See if not awaiting `_reader.Completion` fixes the issue.
@Andersso @ZajacPiotr98 - please test version 7.1.0-alpha.1.1, which includes the code in this pull request: #1782 https://www.myget.org/feed/rabbitmq-dotnet-client/package/nuget/RabbitMQ.Client Thank you! |
Hey, I can concur that the memory leak fix did not solve the issue. I will try out the new package today. Thanks! |
@lukebakken Would it be possible to release the memory leak fix as a minor version? We are also noticing a number of cancellation tokens and would greatly appreciate the update. Thank you! |
@NathanielAB you probably mean "as a patch version" |
I'll produce a new release once @Andersso and / or @ZajacPiotr98 confirm the fix in 7.1.0-alpha.1.1. It shouldn't take long. @NathanielAB you're more than welcome to use that version, of course! https://www.myget.org/feed/rabbitmq-dotnet-client/package/nuget/RabbitMQ.Client |
Hey, I have run the new pre-release package over the weekend and I haven't observed it getting stuck, it looks promising! |
@Andersso thanks for letting us know! |
Hey everyone, hot off the presses: https://www.nuget.org/packages/RabbitMQ.Client/7.1.0 |
Tried version 7.1.0 but encountered locks. The MonitorHeld metric is increasing, while the thread remains unchanged. Some of the locks seem to be gone, but it looks like the issue is still present.
|
Tried this fix, and now those leaks are no longer detected.
|
@DenisMayorko - can you reproduce the locks every time? |
Yes, the application gradually accumulates locks with each startup. Preventing the arrival of new messages does not reduce the amount. |
@DenisMayorko - do you see the same symptoms as initially reported in this issue - what appears to be a deadlock in |
@DenisMayorko - let's continue discussion here: #1784 |
Hmm… I see a large number of ShutdownEvent and a corresponding deadlock, an increase in the number of timers, etc., but I don’t see any mentions of Close or Dispose methods in the StackTrace. I’m starting to doubt that the issues are related, although the problem also seems to occur when shutting down the channel, judging by the call stack. |
@Andersso @ZajacPiotr98 @DenisMayorko - Please upgrade to this version! Thank you for all of your debugging! |
Describe the bug
Hi there,
Ever since upgrading from 6.x to 7.x, I've been running into intermittent deadlocks whenever I try to close a channel via
CloseAsync
.I haven't been able to reproduce it locally, but I've been able to do some remote debugging, but I could not get any insight. (all TP threads are waiting for work)
I did however manage to run a
dotnet-dump dumpasync
during one of these deadlocks and got the following info:First dump
Second dump (another instance)
I noticed that in both dump instances, the stacks aren’t displayed with the usual
Awaiting:
notation you often see in async stack traces, but it might be normal.Reproduction steps
I haven’t pinned down a reliable way to reproduce this, but calling
CloseAsync
more frequently seems to increase the chances of hitting the deadlock. It also appears more common on Linux than Windows, though that might just be due to hardware differences rather than OS behavior.Expected behavior
When calling
CloseAsync
, I’d expect the channel to close normally without causing a deadlock.Additional context
No response
The text was updated successfully, but these errors were encountered: