Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Disconnect not always detected #307

Open
allada opened this issue Oct 28, 2024 · 6 comments
Open

[Bug] Disconnect not always detected #307

allada opened this issue Oct 28, 2024 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@allada
Copy link

allada commented Oct 28, 2024

We recently got all the configs going to handle reconnect in fred, but it appears that there is some state where disconnects are not detected by fred and end up hanging the connection forever.

I cannot get it to consistently reproduce, but I can usually get the reproduction case to happen after a few mins.

From what I can tell read timeouts do trigger everything properly, however writes and reconnects seemed to be more obscured and were where I started focusing my thoughts.

Before digging down into this, I figured I'd post this to see if there are any known issues around this and if there were any hints?

@allada allada added the bug Something isn't working label Oct 28, 2024
@aembke
Copy link
Owner

aembke commented Oct 28, 2024

Not that I'm aware of, no. Can you post the configuration details you're using and describe the server/cluster deployment model?

@allada
Copy link
Author

allada commented Oct 28, 2024

I can see this in end-to-end setups, so it is hard to pin point relevant parts/configs because there are so many moving parts, but here's the high level...

Here is the code that is relevant (can reproduce with default config to that store): https://github.com/TraceMachina/nativelink/blob/b2386fdd16ccc4d3330fcf91f593c7e9262a6197/nativelink-store/src/redis_store.rs#L212

Reproduction case:

  1. Spin up redis sentinel cluster & related nativelink config (easier to do over call than over github issue).
  2. Compile nativelink with remote execution pointing to deployed cluster.
  3. While compiling, kill the sentinel redis node.

Nativelink should take up to 10-20 seconds to realize something is wrong and reschedule the downloads, uploads and jobs... Sometimes it never recovers.

If possible it might be easier to show this in action over a call [email in my github user profile]. I started writing my own RedisPool to work around this issue to try and move fast, but if we can get some support here we can probably help everyone.

@aembke
Copy link
Owner

aembke commented Oct 29, 2024

Yeah no problem, happy to get on a call. I'll send you an email tomorrow and we can set something up.

@aembke
Copy link
Owner

aembke commented Nov 30, 2024

Hey @allada, did you have any luck repro'ing this on your side? I added sentinel support to the failover testing tool I usually use for this kind of thing (https://github.com/aembke/fred.rs/tree/main/bin/inf_loop) and tested sentinel failover for a few hours but wasn't able to repro the issue.

Also for what it's worth I just released version 10.0.0, which contains a ton of routing changes, so you might want to try against that version too.

@allada
Copy link
Author

allada commented Jan 30, 2025

Sorry for the delay on this, I just got back into looking at this.

I am able to reproduce it, but it's pretty tricky. I don't have a unit test to show, but I can reproduce it within a few mins of manual testing.

From what I can tell is happening is that the underlying std::net::TCPStream stalls out on read/write until the OS or other network interface to timeout the connection (usually around 10-16 mins in my kubernetes-land testing). Things do eventually get back to normal.

Right now I'm thinking the problem lies around the fact that tokio's TcpStream does not support:
https://doc.rust-lang.org/std/net/struct.TcpStream.html#method.set_write_timeout
https://doc.rust-lang.org/std/net/struct.TcpStream.html#method.set_read_timeout

I am still testing this theory, but the symptoms make sense.

Here's some logs (note I truncate after 200 chracters):

[2025-01-30T05:10:30Z TRACE fred::modules::response] FromValue(Bytes): Bytes(b"\xbc\xb31\r_\x9cMdw\xc1\x10S\xb4)\xb0\x11P\x12\xd5\x9b06A\xedAEb-\xf1\xe9\xd6\xb7\x9d\xb4)\xc8\xd5n\xf9\xd4\xd4\xb9}\xb8G
[2025-01-30T05:12:05Z TRACE fred::interfaces] fred-f14re7BdWm: Sending command SETRANGE (0) to router.
[2025-01-30T05:12:05Z TRACE fred::router::commands] fred-f14re7BdWm: Recv command: RouterCommand { kind: "Command", command: "SETRANGE" }
[2025-01-30T05:12:05Z TRACE fred::router::commands] fred-f14re7BdWm: Writing command: "SETRANGE" (0)
[2025-01-30T05:12:05Z TRACE fred::router::utils] fred-f14re7BdWm: Sending command SETRANGE (0) to redis-node-1.redis-headless.nativelink-dev-drake.svc.cluster.local:6379
[2025-01-30T05:12:05Z TRACE fred::protocol::codec] fred-f14re7BdWm: Encoded 1048737 bytes to redis-node-1.redis-headless.nativelink-dev-drake.svc.cluster.local:6379. Buffer len: 1048737 (RESP2)
[2025-01-30T05:12:05Z TRACE fred::protocol::debug] fred-f14re7BdWm: Encoded Array([String("SETRANGE"), String("temp-d54f930e-7321-4793-af7b-807613b968f1-{2ee4337e9529019bb1d17c33230c2d26f9edc0d6a778af
[2025-01-30T05:12:05Z TRACE fred::router::commands] fred-f14re7BdWm: Sent command to redis-node-1.redis-headless.nativelink-dev-drake.svc.cluster.local:6379. Flushed: true
[2025-01-30T05:12:05Z TRACE fred::protocol::connection] Flushing socket to redis-node-1.redis-headless.nativelink-dev-drake.svc.cluster.local:6379
[2025-01-30T05:12:05Z TRACE fred::protocol::codec] fred-f14re7BdWm: Recv 10 bytes from redis-node-1.redis-headless.nativelink-dev-drake.svc.cluster.local:6379 (RESP2).
[2025-01-30T05:12:05Z TRACE fred::protocol::codec] fred-f14re7BdWm: Parsed 10 bytes from redis-node-1.redis-headless.nativelink-dev-drake.svc.cluster.local:6379
[2025-01-30T05:12:05Z TRACE fred::protocol::debug] fred-f14re7BdWm: Decoded Integer(1048576)
[2025-01-30T05:12:05Z TRACE fred::router::utils] fred-f14re7BdWm: Recv read result from redis-node-1.redis-headless.nativelink-dev-drake.svc.cluster.local:6379
[2025-01-30T05:12:05Z TRACE fred::router::centralized] fred-f14re7BdWm: Parsing response frame from redis-node-1.redis-headless.nativelink-dev-drake.svc.cluster.local:6379
[2025-01-30T05:12:05Z TRACE fred::router::centralized] fred-f14re7BdWm: Checking response to SETRANGE (0)
[2025-01-30T05:12:05Z TRACE fred::router::centralized] fred-f14re7BdWm: Handling centralized response kind: Respond
[2025-01-30T05:12:05Z TRACE fred::protocol::responders] fred-f14re7BdWm: Respond to caller from redis-node-1.redis-headless.nativelink-dev-drake.svc.cluster.local:6379 for SETRANGE with Number
[2025-01-30T05:12:05Z TRACE fred::protocol::codec] fred-f14re7BdWm: Recv 0 bytes from redis-node-1.redis-headless.nativelink-dev-drake.svc.cluster.local:6379 (RESP2).
[2025-01-30T05:12:07Z TRACE fred::interfaces] fred-f14re7BdWm: Sending command SETRANGE (0) to router.
[2025-01-30T05:12:07Z TRACE fred::router::commands] fred-f14re7BdWm: Recv command: RouterCommand { kind: "Command", command: "SETRANGE" }
[2025-01-30T05:12:07Z TRACE fred::router::commands] fred-f14re7BdWm: Writing command: "SETRANGE" (0)
[2025-01-30T05:12:07Z TRACE fred::router::utils] fred-f14re7BdWm: Sending command SETRANGE (0) to redis-node-1.redis-headless.nativelink-dev-drake.svc.cluster.local:6379
[2025-01-30T05:12:07Z TRACE fred::protocol::codec] fred-f14re7BdWm: Encoded 1048737 bytes to redis-node-1.redis-headless.nativelink-dev-drake.svc.cluster.local:6379. Buffer len: 1048737 (RESP2)
[2025-01-30T05:12:07Z TRACE fred::protocol::debug] fred-f14re7BdWm: Encoded Array([String("SETRANGE"), String("temp-84e6779e-525f-402b-82a5-51243b6895a4-{2ee4337e9529019bb1d17c33230c2d26f9edc0d6a778af


// --- LONG PAUSE HERE ---



[2025-01-30T05:27:38Z DEBUG fred::router::utils] redis-node-1.redis-headless.nativelink-dev-drake.svc.cluster.local:6379: Error sending frame to socket: Error { details: "Os { code: 113, kind: HostUnr
[2025-01-30T05:27:38Z DEBUG fred::router::utils] fred-f14re7BdWm: Resetting connection to redis-node-1.redis-headless.nativelink-dev-drake.svc.cluster.local:6379 with error: Error { details: "Os { cod
[2025-01-30T05:27:38Z WARN  fred::router::responses] fred-f14re7BdWm: Broadcasting error Some(Error { details: "Os { code: 113, kind: HostUnreachable, message: \"No route to host\" }", kind: IO }) fro
[2025-01-30T05:27:38Z DEBUG fred::modules::inner] fred-f14re7BdWm: No `on_error` listener. The error was: Error { details: "Os { code: 113, kind: HostUnreachable, message: \"No route to host\" }", kin
[2025-01-30T05:27:38Z DEBUG fred::modules::inner] fred-f14re7BdWm: Checking reconnect state. Has policy: true, Is intentionally disconnecting: false
[2025-01-30T05:27:38Z DEBUG fred::router::utils] fred-f14re7BdWm: Defer reconnection to None after Error { details: "Os { code: 113, kind: HostUnreachable, message: \"No route to host\" }", kind: IO }
[2025-01-30T05:27:38Z TRACE fred::router::commands] fred-f14re7BdWm: Recv command: RouterCommand { kind: "Command", command: "SETRANGE" }
[2025-01-30T05:27:38Z TRACE fred::router::commands] fred-f14re7BdWm: Writing command: "SETRANGE" (0)
[2025-01-30T05:27:38Z DEBUG fred::modules::inner] fred-f14re7BdWm: Checking reconnect state. Has policy: true, Is intentionally disconnecting: false
[2025-01-30T05:27:38Z DEBUG fred::router::utils] fred-f14re7BdWm: Skip defer reconnection.
[2025-01-30T05:27:38Z TRACE fred::router::commands] fred-f14re7BdWm: Recv command: RouterCommand { kind: "Reconnect", server: None, force: false }
[2025-01-30T05:27:38Z DEBUG fred::router::commands] fred-f14re7BdWm: Maybe reconnecting to None (force: false)
[2025-01-30T05:27:38Z DEBUG fred::router::commands] fred-f14re7BdWm: Starting reconnection loop...
[2025-01-30T05:27:38Z DEBUG fred::router::utils] fred-f14re7BdWm: Sleeping for 128 ms.
[2025-01-30T05:27:38Z DEBUG fred::modules::inner] fred-f14re7BdWm: Sleeping for 128 ms
[2025-01-30T05:27:38Z TRACE fred::router::utils] fred-f14re7BdWm: Reconnecting...
[2025-01-30T05:27:38Z DEBUG fred::router::sentinel] fred-f14re7BdWm: Initializing sentinel connection.
[2025-01-30T05:27:38Z DEBUG fred::router::sentinel] fred-f14re7BdWm: Connecting to sentinel redis-node-0.redis-headless.nativelink-dev-drake.svc.cluster.local:26379
[2025-01-30T05:27:38Z TRACE fred::protocol::connection] fred-f14re7BdWm: Checking connection type. Native-tls: false, Rustls: false
[2025-01-30T05:27:38Z TRACE fred::protocol::types] fred-f14re7BdWm: Found 1 addresses for redis-node-0.redis-headless.nativelink-dev-drake.svc.cluster.local:26379
[2025-01-30T05:27:38Z DEBUG fred::protocol::connection] fred-f14re7BdWm: Creating TCP connection to redis-node-0.redis-headless.nativelink-dev-drake.svc.cluster.local at 10.0.131.247:26379
[2025-01-30T05:27:38Z TRACE fred::protocol::connection] fred-f14re7BdWm: Skip authentication without credentials.
[2025-01-30T05:27:38Z TRACE fred::protocol::codec] fred-f14re7BdWm: Encoded 63 bytes to redis-node-0.redis-headless.nativelink-dev-drake.svc.cluster.local:26379. Buffer len: 63 (RESP2)
[2025-01-30T05:27:38Z TRACE fred::protocol::debug] fred-f14re7BdWm: Encoded Array([String("SENTINEL"), String("GET-MASTER-ADDR-BY-NAME"), String("dev-drake")])
[2025-01-30T05:27:38Z TRACE fred::protocol::codec] fred-f14re7BdWm: Recv 87 bytes from redis-node-0.redis-headless.nativelink-dev-drake.svc.cluster.local:26379 (RESP2).
[2025-01-30T05:27:38Z TRACE fred::protocol::codec] fred-f14re7BdWm: Parsed 87 bytes from redis-node-0.redis-headless.nativelink-dev-drake.svc.cluster.local:26379

@allada
Copy link
Author

allada commented Feb 3, 2025

FYI, I tried playing around by manually setting set_write_timeout and set_read_timeout, but didn't seem to help. I'm thinking maybe using write_timeout internal calls might work, but it is possible that there's some kind of issue in how futures are being dispatched to blocking threads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants