Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: ldr/kv0/workload=both/network_partition failed #141483

Closed
cockroach-teamcity opened this issue Feb 14, 2025 · 9 comments
Closed

roachtest: ldr/kv0/workload=both/network_partition failed #141483

cockroach-teamcity opened this issue Feb 14, 2025 · 9 comments
Labels
A-disaster-recovery branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Feb 14, 2025

roachtest.ldr/kv0/workload=both/network_partition failed with artifacts on master @ 22b262749c502d07ec7ccec5b76abd67c361ae4d. A Side-Eye cluster snapshot was captured on timeout: https://app.side-eye.io/#/snapshots/642.

(test_runner.go:1382).runTest: test timed out (1h0m0s)
test artifacts and logs in: /artifacts/ldr/kv0/workload=both/network_partition/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=8
  • encrypted=false
  • metamorphicBufferedSender=false
  • metamorphicLeases=epoch
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-47851

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-disaster-recovery labels Feb 14, 2025
@cockroach-teamcity
Copy link
Member Author

roachtest.ldr/kv0/workload=both/network_partition failed with artifacts on master @ 67a82bfc09766413d7f8ae8edb10035940b94c7e. A Side-Eye cluster snapshot was captured on timeout: https://app.side-eye.io/#/snapshots/644.

(test_runner.go:1382).runTest: test timed out (1h0m0s)
test artifacts and logs in: /artifacts/ldr/kv0/workload=both/network_partition/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=8
  • encrypted=false
  • metamorphicBufferedSender=false
  • metamorphicLeases=default
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.ldr/kv0/workload=both/network_partition failed with artifacts on master @ 34e34c2d5b9598441b935f03a794277e9329a256:

(soon.go:60).SucceedsWithin: condition failed to evaluate within 5m0s: from cluster_to_cluster.go:1910: replicated time 2025-02-16 08:11:00 +0000 UTC not yet at 2025-02-16 08:20:08.346842171 +0000 UTC m=+6509.901081710
test artifacts and logs in: /artifacts/ldr/kv0/workload=both/network_partition/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=8
  • encrypted=false
  • metamorphicBufferedSender=true
  • metamorphicLeases=epoch
  • runtimeAssertionsBuild=true
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.ldr/kv0/workload=both/network_partition failed with artifacts on master @ 34e34c2d5b9598441b935f03a794277e9329a256. A Side-Eye cluster snapshot was captured on timeout: https://app.side-eye.io/#/snapshots/651.

(test_runner.go:1382).runTest: test timed out (1h0m0s)
test artifacts and logs in: /artifacts/ldr/kv0/workload=both/network_partition/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=8
  • encrypted=false
  • metamorphicBufferedSender=true
  • metamorphicLeases=epoch
  • runtimeAssertionsBuild=true
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.ldr/kv0/workload=both/network_partition failed with artifacts on master @ f3193fa36f11d583dd7e7ba505a89e7e309996aa. A Side-Eye cluster snapshot was captured on timeout: https://app.side-eye.io/#/snapshots/657.

(test_runner.go:1382).runTest: test timed out (1h0m0s)
test artifacts and logs in: /artifacts/ldr/kv0/workload=both/network_partition/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=8
  • encrypted=false
  • fs=ext4
  • localSSD=false
  • metamorphicBufferedSender=true
  • metamorphicLeases=epoch
  • runtimeAssertionsBuild=true
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@msbutler msbutler self-assigned this Feb 18, 2025
@msbutler
Copy link
Collaborator

msbutler commented Feb 18, 2025

Looks like the catch up scan could not complete in like 40 mins:

2025/02/18 12:25:02 logical_data_replication.go:988: Waiting for replicated times to catchup before verifying left and right clusters
2025/02/18 13:13:54 test_impl.go:474: test failure #1: full stack retained in failure_1.log: (test_runner.go:1382).runTest: test timed out (1h0m0s)

We kept hitting this: failed to connect to any connection uri

6.unredacted/cockroach.teamcity-18851605-1739860110-120-n7cpu8-0006.ubuntu.2025-02-18T12_14_18Z.016964.log:I250218 12:19:14.673583 6374 crosscluster/logical/logical_replication_job.go:859 ⋮ [T1,Vsystem,n3,job=LOGICAL REPLICATION id=1047954941014310914] 475  hit retryable error failed to connect to any connection uri: ‹failed to connect to `host=35.231.162.36 user=roachprod database=kv`: server error (ERROR: internal error while retrieving user account memberships: operation "get-user-session" timed out after 10.001s (given timeout 10s): internal error while retrieving user account: get auth info error: interrupted during singleflight acquire-lease:1: context deadline exceeded (SQLSTATE 28000))›

until we paused the stream:

6.unredacted/cockroach.teamcity-18851605-1739860110-120-n7cpu8-0006.ubuntu.2025-02-18T12_14_18Z.016964.log:I250218 13:02:49.653038 6374 crosscluster/logical/logical_replication_job.go:106 ⋮ [T1,Vsystem,n3,job=LOGICAL REPLICATION id=1047954941014310914] 833  pausing after error: ‹failed to connect to any connection uri: failed to connect to `host=35.231.162.36 user=roachprod database=kv`: server error (ERROR: internal error while retrieving user account memberships: operation "get-user-session" timed out after 10s (given timeout 10s): internal error while retrieving user account: get auth info error: interrupted during singleflight acquire-lease:1: context deadline exceeded (SQLSTATE 28000))›

@msbutler
Copy link
Collaborator

These symptoms look identical to the investigation @tbg did over in #141484 (comment)

Assigning to sql foundations.

@msbutler msbutler removed their assignment Feb 18, 2025
@msbutler msbutler added T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) and removed T-disaster-recovery labels Feb 18, 2025
@cockroach-teamcity
Copy link
Member Author

roachtest.ldr/kv0/workload=both/network_partition failed with artifacts on master @ 954761451e68c1c5db5fbe0f336225955dad1075:

(soon.go:60).SucceedsWithin: condition failed to evaluate within 5m0s: from cluster_to_cluster.go:1910: replicated time 2025-02-19 09:23:55 +0000 UTC not yet at 2025-02-19 09:33:08.262736832 +0000 UTC m=+10775.746287756
test artifacts and logs in: /artifacts/ldr/kv0/workload=both/network_partition/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=8
  • encrypted=false
  • metamorphicBufferedSender=true
  • metamorphicLeases=expiration
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.ldr/kv0/workload=both/network_partition failed with artifacts on master @ 3e38c3357dc026e41cf48d192611608b34fc064a. A Side-Eye cluster snapshot was captured on timeout: https://app.side-eye.io/#/snapshots/687.

(test_runner.go:1376).runTest: test timed out (1h0m0s)
test artifacts and logs in: /artifacts/ldr/kv0/workload=both/network_partition/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=8
  • encrypted=false
  • fs=ext4
  • localSSD=false
  • metamorphicBufferedSender=true
  • metamorphicLeases=leader
  • runtimeAssertionsBuild=true
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@rafiss
Copy link
Collaborator

rafiss commented Feb 20, 2025

closed by #141749

@rafiss rafiss closed this as completed Feb 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Projects
None yet
Development

No branches or pull requests

3 participants