Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with Timeout::Error from Channel#wait_for_confirms #351

Closed
justinmills opened this issue Sep 15, 2015 · 8 comments
Closed

Issue with Timeout::Error from Channel#wait_for_confirms #351

justinmills opened this issue Sep 15, 2015 · 8 comments
Assignees

Comments

@justinmills
Copy link

We're experiencing an issue where various workers of ours (some using Sneakers, some using Resque) are seeing higher rates of this above exception. We've tried this with various versions of sneakers and bunny, and it's pretty consistent. We're currently on the latest versions of both of these gems and I recently cranked up the heartbeat from 2s to 10s. I'm not sure that had any effect as we're continuing to see these.

We use a provider for our Rabbit cluster (CloudAMQP) and from their Server logs stream I can see messages like this occasionally:

=ERROR REPORT==== 2015-09-15 18:13:07 UTC ===
closing AMQP connection <0.4284.1934> (54.161.3.57:53797 -> 10.50.145.163:5672):
{heartbeat_timeout,running}

These correlate time-wise with when we start seeing the Timeout::Error from our publishers. At this point, at least one worker I was watching seems to be stuck with a bad connection and no auto-reconnect via Bunny was happening. If any of the publish code raises any other exception than Timeout::Error we disconnect and re-connect just to be sure. I was under the impression a Timeout::Error was not thrown in the case of an actual network error and that Bunny would auto-reconnect if it was.

Also, all of these workers that are doing the publishing run in Heroku. I'm not sure that has anything to do with it, but thought it worth mentioning.

Unfortunately we've not been able to reproduce this outside of our production system and the code is a bit strewn across multiple classes to simply paste here, but we're establishing our connection using Bunny like this:

    @bunny ||= Bunny.new(RABBITMQ_URL,
                         :vhost => RABBITMQ_VHOST,
                         :heartbeat => 10).tap do |bunny|
      bunny.start
    end

And we're publishing using a block of code like this:

    r = exchange.publish(msg, { persistent: true })
    success = channel.wait_for_confirms
    if !success
      raise UnconfirmedPublishError.new
    end

We are using thread-local channels based on this single Bunny connection but have also tuned the Sneakers worker down to 1 thread to minimize that as a cause.

Any insight into what to look for next or that we should be disconnecting and re-connecting when we see these Timeouts would be great. I'm pretty sure reconnecting on Timeout has caused us issues in the past with leaking Connections, but we'll give it another shot if that's the correct way to handle them.

Thanks, Justin

@michaelklishin
Copy link
Member

Please use the mailing list for questions.

There are separate timeouts for continuation timeouts and heartbeats. The default these days is 15 or greater. 10 is not a particularly
high value. See rabbitmq/rabbitmq-server#227 for one possible cause.

Or just use streaming confirms with callbacks instead of blocking wsits.

On 15/9/2015, at 21:44, Justin Mills [email protected] wrote:

We're experiencing an issue where various workers of ours (some using Sneakers, some using Resque) are seeing higher rates of this above exception. We've tried this with various versions of sneakers and bunny, and it's pretty consistent. We're currently on the latest versions of both of these gems and I recently cranked up the heartbeat from 2s to 10s. I'm not sure that had any effect as we're continuing to see these.

We use a provider for our Rabbit cluster (CloudAMQP) and from their Server logs stream I can see messages like this occasionally:

closing AMQP connection <0.4284.1934> (54.161.3.57:53797 -> 10.50.145.163:5672):
{heartbeat_timeout,running}

These correlate time-wise with when we start seeing the `Timeout::Error` from our publishers. At this point, at least one worker I was watching seems to be stuck with a bad connection and no auto-reconnect via Bunny was happening. If any of the publish code raises any other exception than `Timeout::Error` we disconnect and re-connect just to be sure. I was under the impression a `Timeout::Error` was not thrown in the case of an actual network error and that Bunny would auto-reconnect if it was.

Also, all of these workers that are doing the publishing run in Heroku. I'm not sure that has anything to do with it, but thought it worth mentioning.

Unfortunately we've not been able to reproduce this outside of our production system and the code is a bit strewn across multiple classes to simply paste here, but we're establishing our connection using Bunny like this: 

``` ruby
    @bunny ||= Bunny.new(RABBITMQ_URL,
                         :vhost => RABBITMQ_VHOST,
                         :heartbeat => 10).tap do |bunny|
      bunny.start
    end
And we're publishing using a block of code like this:

    r = exchange.publish(msg, { persistent: true })
    success = channel.wait_for_confirms
    if !success
      raise UnconfirmedPublishError.new
    end
We are using thread-local channels based on this single Bunny connection but have also tuned the Sneakers worker down to 1 thread to minimize that as a cause.

Any insight into what to look for next or that we should be disconnecting and re-connecting when we see these Timeouts would be great. I'm pretty sure reconnecting on Timeout has caused us issues in the past with leaking Connections, but we'll give it another shot if that's the correct way to handle them.

Thanks, Justin

—
Reply to this email directly or view it on GitHub.

@justinmills
Copy link
Author

Thanks for the quick response. Forgot about the mailing list.

I'll check out that issue and try some higher level timeouts to see if that helps. We looked into the streaming confirms, but those unfortunately weren't very compatible with the various places we use it. Might be time to re-investigate that though.

Thanks again.

@tigluiz
Copy link

tigluiz commented Sep 22, 2015

@justin-yesware can you send the link of this problem on mailing list? I'm having same trouble sporadically. Tks

@michaelklishin
Copy link
Member

@tigluiz this can be caused by rabbitmq/rabbitmq-server#227, bumping the timeout is the only thing a client can do.

@tigluiz
Copy link

tigluiz commented Sep 22, 2015

@michaelklishin tks !

@justinmills
Copy link
Author

FWIW, we wound up cranking up heartbeat to 60s. This has helped as well as some hardware upgrades on our rabbit cluster. We still do see these occasionally and when they do, no confirmed publish appear to work after the first one. Fortunately, we're on heroku, so the 24h process restart they do typically saves us. We also run into this via our Resque workers, but they re-fork every n jobs which again clears out any connection to RabbitMq that may be experiencing this issue. We haven't had much time to confirm this configuration (~ 1 week) will work long term, but so far so good.

Let me know if you'd like for me to summarize this issue and post to the mailing list.

@ionutzp
Copy link
Contributor

ionutzp commented Jun 28, 2020

sorry to revive an old issue @michaelklishin but how does one implement this?

Or just use streaming confirms with callbacks instead of blocking wsits.

@michaelklishin
Copy link
Member

Apparently streaming confirms were not implemented in Bunny #432. Contributions would be considered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants