Issue with Timeout::Error from Channel#wait_for_confirms #351

justinmills · 2015-09-15T18:44:06Z

We're experiencing an issue where various workers of ours (some using Sneakers, some using Resque) are seeing higher rates of this above exception. We've tried this with various versions of sneakers and bunny, and it's pretty consistent. We're currently on the latest versions of both of these gems and I recently cranked up the heartbeat from 2s to 10s. I'm not sure that had any effect as we're continuing to see these.

We use a provider for our Rabbit cluster (CloudAMQP) and from their Server logs stream I can see messages like this occasionally:

=ERROR REPORT==== 2015-09-15 18:13:07 UTC ===
closing AMQP connection <0.4284.1934> (54.161.3.57:53797 -> 10.50.145.163:5672):
{heartbeat_timeout,running}

These correlate time-wise with when we start seeing the Timeout::Error from our publishers. At this point, at least one worker I was watching seems to be stuck with a bad connection and no auto-reconnect via Bunny was happening. If any of the publish code raises any other exception than Timeout::Error we disconnect and re-connect just to be sure. I was under the impression a Timeout::Error was not thrown in the case of an actual network error and that Bunny would auto-reconnect if it was.

Also, all of these workers that are doing the publishing run in Heroku. I'm not sure that has anything to do with it, but thought it worth mentioning.

Unfortunately we've not been able to reproduce this outside of our production system and the code is a bit strewn across multiple classes to simply paste here, but we're establishing our connection using Bunny like this:

    @bunny ||= Bunny.new(RABBITMQ_URL,
                         :vhost => RABBITMQ_VHOST,
                         :heartbeat => 10).tap do |bunny|
      bunny.start
    end

And we're publishing using a block of code like this:

    r = exchange.publish(msg, { persistent: true })
    success = channel.wait_for_confirms
    if !success
      raise UnconfirmedPublishError.new
    end

We are using thread-local channels based on this single Bunny connection but have also tuned the Sneakers worker down to 1 thread to minimize that as a cause.

Any insight into what to look for next or that we should be disconnecting and re-connecting when we see these Timeouts would be great. I'm pretty sure reconnecting on Timeout has caused us issues in the past with leaking Connections, but we'll give it another shot if that's the correct way to handle them.

Thanks, Justin

The text was updated successfully, but these errors were encountered:

michaelklishin · 2015-09-15T18:58:21Z

Please use the mailing list for questions.

There are separate timeouts for continuation timeouts and heartbeats. The default these days is 15 or greater. 10 is not a particularly
high value. See rabbitmq/rabbitmq-server#227 for one possible cause.

Or just use streaming confirms with callbacks instead of blocking wsits.

On 15/9/2015, at 21:44, Justin Mills [email protected] wrote:

We're experiencing an issue where various workers of ours (some using Sneakers, some using Resque) are seeing higher rates of this above exception. We've tried this with various versions of sneakers and bunny, and it's pretty consistent. We're currently on the latest versions of both of these gems and I recently cranked up the heartbeat from 2s to 10s. I'm not sure that had any effect as we're continuing to see these.

We use a provider for our Rabbit cluster (CloudAMQP) and from their Server logs stream I can see messages like this occasionally:

closing AMQP connection <0.4284.1934> (54.161.3.57:53797 -> 10.50.145.163:5672):
{heartbeat_timeout,running}

These correlate time-wise with when we start seeing the `Timeout::Error` from our publishers. At this point, at least one worker I was watching seems to be stuck with a bad connection and no auto-reconnect via Bunny was happening. If any of the publish code raises any other exception than `Timeout::Error` we disconnect and re-connect just to be sure. I was under the impression a `Timeout::Error` was not thrown in the case of an actual network error and that Bunny would auto-reconnect if it was.

Also, all of these workers that are doing the publishing run in Heroku. I'm not sure that has anything to do with it, but thought it worth mentioning.

Unfortunately we've not been able to reproduce this outside of our production system and the code is a bit strewn across multiple classes to simply paste here, but we're establishing our connection using Bunny like this: 

``` ruby
    @bunny ||= Bunny.new(RABBITMQ_URL,
                         :vhost => RABBITMQ_VHOST,
                         :heartbeat => 10).tap do |bunny|
      bunny.start
    end
And we're publishing using a block of code like this:

    r = exchange.publish(msg, { persistent: true })
    success = channel.wait_for_confirms
    if !success
      raise UnconfirmedPublishError.new
    end
We are using thread-local channels based on this single Bunny connection but have also tuned the Sneakers worker down to 1 thread to minimize that as a cause.

Any insight into what to look for next or that we should be disconnecting and re-connecting when we see these Timeouts would be great. I'm pretty sure reconnecting on Timeout has caused us issues in the past with leaking Connections, but we'll give it another shot if that's the correct way to handle them.

Thanks, Justin

—
Reply to this email directly or view it on GitHub.

justinmills · 2015-09-16T00:04:46Z

Thanks for the quick response. Forgot about the mailing list.

I'll check out that issue and try some higher level timeouts to see if that helps. We looked into the streaming confirms, but those unfortunately weren't very compatible with the various places we use it. Might be time to re-investigate that though.

Thanks again.

tigluiz · 2015-09-22T16:59:54Z

@justin-yesware can you send the link of this problem on mailing list? I'm having same trouble sporadically. Tks

michaelklishin · 2015-09-22T17:00:53Z

@tigluiz this can be caused by rabbitmq/rabbitmq-server#227, bumping the timeout is the only thing a client can do.

tigluiz · 2015-09-22T17:07:14Z

@michaelklishin tks !

justinmills · 2015-09-23T00:36:45Z

FWIW, we wound up cranking up heartbeat to 60s. This has helped as well as some hardware upgrades on our rabbit cluster. We still do see these occasionally and when they do, no confirmed publish appear to work after the first one. Fortunately, we're on heroku, so the 24h process restart they do typically saves us. We also run into this via our Resque workers, but they re-fork every n jobs which again clears out any connection to RabbitMq that may be experiencing this issue. We haven't had much time to confirm this configuration (~ 1 week) will work long term, but so far so good.

Let me know if you'd like for me to summarize this issue and post to the mailing list.

ionutzp · 2020-06-28T10:44:59Z

sorry to revive an old issue @michaelklishin but how does one implement this?

Or just use streaming confirms with callbacks instead of blocking wsits.

michaelklishin · 2020-06-28T22:41:15Z

Apparently streaming confirms were not implemented in Bunny #432. Contributions would be considered.

michaelklishin self-assigned this Sep 15, 2015

michaelklishin added the mailing list material label Sep 15, 2015

michaelklishin closed this as completed Sep 15, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with Timeout::Error from Channel#wait_for_confirms #351

Issue with Timeout::Error from Channel#wait_for_confirms #351

justinmills commented Sep 15, 2015

michaelklishin commented Sep 15, 2015

justinmills commented Sep 16, 2015

tigluiz commented Sep 22, 2015

michaelklishin commented Sep 22, 2015

tigluiz commented Sep 22, 2015

justinmills commented Sep 23, 2015

ionutzp commented Jun 28, 2020

michaelklishin commented Jun 28, 2020

Issue with Timeout::Error from Channel#wait_for_confirms #351

Issue with Timeout::Error from Channel#wait_for_confirms #351

Comments

justinmills commented Sep 15, 2015

michaelklishin commented Sep 15, 2015

justinmills commented Sep 16, 2015

tigluiz commented Sep 22, 2015

michaelklishin commented Sep 22, 2015

tigluiz commented Sep 22, 2015

justinmills commented Sep 23, 2015

ionutzp commented Jun 28, 2020

michaelklishin commented Jun 28, 2020