Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: failover/non-system/pause/lease=expiration failed #141484

Closed
cockroach-teamcity opened this issue Feb 14, 2025 · 14 comments · Fixed by #141749
Closed

roachtest: failover/non-system/pause/lease=expiration failed #141484

cockroach-teamcity opened this issue Feb 14, 2025 · 14 comments · Fixed by #141749
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) target-release-25.2.0

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Feb 14, 2025

roachtest.failover/non-system/pause/lease=expiration failed with artifacts on master @ 22b262749c502d07ec7ccec5b76abd67c361ae4d:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/failover.go:1631
	            				pkg/cmd/roachtest/tests/failover.go:863
	            				pkg/cmd/roachtest/monitor.go:115
	            				external/org_golang_x_sync/errgroup/errgroup.go:78
	            				src/runtime/asm_amd64.s:1695
	Error:      	Received unexpected error:
	            	pq: internal error while retrieving user account memberships: operation "get-user-session" timed out after 10s (given timeout 10s): internal error while retrieving user account: get auth info error: interrupted during singleflight acquire-lease:1: context deadline exceeded
	Test:       	failover/non-system/pause/lease=expiration
(require.go:1357).NoError: FailNow called
(monitor.go:149).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/failover/non-system/pause/lease=expiration/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=2
  • encrypted=false
  • fs=ext4
  • localSSD=true
  • metamorphicBufferedSender=false
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-47852

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team labels Feb 14, 2025
@tbg
Copy link
Member

tbg commented Feb 14, 2025

I investigated sibling test #141480 (comment). Unsure if this failure is the same, but superficially looks that way. Will wait for more nightlies to see if there are repeat failures.

@cockroach-teamcity
Copy link
Member Author

roachtest.failover/non-system/pause/lease=expiration failed with artifacts on master @ 67a82bfc09766413d7f8ae8edb10035940b94c7e:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/failover.go:1631
	            				pkg/cmd/roachtest/tests/failover.go:863
	            				pkg/cmd/roachtest/monitor.go:115
	            				external/org_golang_x_sync/errgroup/errgroup.go:78
	            				src/runtime/asm_amd64.s:1695
	Error:      	Received unexpected error:
	            	pq: internal error while retrieving user account memberships: operation "get-user-session" timed out after 10s (given timeout 10s): internal error while retrieving user account: get auth info error: interrupted during singleflight acquire-lease:1: context deadline exceeded
	Test:       	failover/non-system/pause/lease=expiration
(require.go:1357).NoError: FailNow called
(monitor.go:149).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/failover/non-system/pause/lease=expiration/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=2
  • encrypted=false
  • fs=ext4
  • localSSD=true
  • metamorphicBufferedSender=false
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.failover/non-system/pause/lease=expiration failed with artifacts on master @ 34e34c2d5b9598441b935f03a794277e9329a256:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/failover.go:1631
	            				pkg/cmd/roachtest/tests/failover.go:863
	            				pkg/cmd/roachtest/monitor.go:115
	            				external/org_golang_x_sync/errgroup/errgroup.go:78
	            				src/runtime/asm_amd64.s:1695
	Error:      	Received unexpected error:
	            	pq: internal error while retrieving user account memberships: operation "get-user-session" timed out after 10s (given timeout 10s): internal error while retrieving user account: get auth info error: interrupted during singleflight acquire-lease:1: context deadline exceeded
	Test:       	failover/non-system/pause/lease=expiration
(require.go:1357).NoError: FailNow called
(monitor.go:149).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/failover/non-system/pause/lease=expiration/run_1

Parameters:

  • arch=amd64
  • cloud=azure
  • coverageBuild=false
  • cpu=2
  • encrypted=false
  • fs=ext4
  • localSSD=true
  • metamorphicBufferedSender=false
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for azure clusters

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.failover/non-system/pause/lease=expiration failed with artifacts on master @ 34e34c2d5b9598441b935f03a794277e9329a256:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/failover.go:1631
	            				pkg/cmd/roachtest/tests/failover.go:863
	            				pkg/cmd/roachtest/monitor.go:115
	            				external/org_golang_x_sync/errgroup/errgroup.go:78
	            				src/runtime/asm_amd64.s:1695
	Error:      	Received unexpected error:
	            	pq: internal error while retrieving user account memberships: operation "get-user-session" timed out after 10s (given timeout 10s): internal error while retrieving user account: get auth info error: interrupted during singleflight acquire-lease:1: context deadline exceeded
	Test:       	failover/non-system/pause/lease=expiration
(require.go:1357).NoError: FailNow called
(monitor.go:149).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/failover/non-system/pause/lease=expiration/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=2
  • encrypted=false
  • fs=ext4
  • localSSD=true
  • metamorphicBufferedSender=false
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.failover/non-system/pause/lease=expiration failed with artifacts on master @ 34e34c2d5b9598441b935f03a794277e9329a256:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/failover.go:1631
	            				pkg/cmd/roachtest/tests/failover.go:863
	            				pkg/cmd/roachtest/monitor.go:115
	            				external/org_golang_x_sync/errgroup/errgroup.go:78
	            				src/runtime/asm_amd64.s:1695
	Error:      	Received unexpected error:
	            	pq: internal error while retrieving user account memberships: operation "get-user-session" timed out after 10s (given timeout 10s): internal error while retrieving user account: get auth info error: interrupted during singleflight acquire-lease:1: context deadline exceeded
	Test:       	failover/non-system/pause/lease=expiration
(require.go:1357).NoError: FailNow called
(monitor.go:149).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/failover/non-system/pause/lease=expiration/run_1

Parameters:

  • arch=amd64
  • cloud=azure
  • coverageBuild=false
  • cpu=2
  • encrypted=false
  • fs=ext4
  • localSSD=true
  • metamorphicBufferedSender=false
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for azure clusters

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.failover/non-system/pause/lease=expiration failed with artifacts on master @ 34e34c2d5b9598441b935f03a794277e9329a256:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/failover.go:1631
	            				pkg/cmd/roachtest/tests/failover.go:863
	            				pkg/cmd/roachtest/monitor.go:115
	            				external/org_golang_x_sync/errgroup/errgroup.go:78
	            				src/runtime/asm_amd64.s:1695
	Error:      	Received unexpected error:
	            	pq: internal error while retrieving user account memberships: operation "get-user-session" timed out after 10.001s (given timeout 10s): internal error while retrieving user account: get auth info error: interrupted during singleflight acquire-lease:1: context deadline exceeded
	Test:       	failover/non-system/pause/lease=expiration
(require.go:1357).NoError: FailNow called
(monitor.go:149).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/failover/non-system/pause/lease=expiration/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=2
  • encrypted=false
  • metamorphicBufferedSender=false
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@tbg
Copy link
Member

tbg commented Feb 17, 2025

$ git log 22b2627 --not 9598211 --merges --oneline
22b2627 Merge #141371 #141439 🔴🔴🔴
c326442 Merge #141033 🔴🔴🔴🔴🔴🔴🔴🔴🔴
557a7c4 Merge #141436 🟢🟢🟢🟢🟢🟢
74067cb Merge #141211 #141420
076d719 Merge #141450
0681b73 Merge #141431 🟢
4c890df Merge #141402
68a90a5 Merge #141428
9892b54 Merge #141147 #141432
1b281fe Merge #141423
086fbd4 Merge #140685
a203a49 Merge #140601
7efd743 Merge #140511
ab7abe1 Merge #141414
83b38f5 Merge #140662

@tbg tbg added T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) and removed T-kv KV Team labels Feb 17, 2025
@cockroach-teamcity
Copy link
Member Author

roachtest.failover/non-system/pause/lease=expiration failed with artifacts on master @ f3193fa36f11d583dd7e7ba505a89e7e309996aa:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/failover.go:1631
	            				pkg/cmd/roachtest/tests/failover.go:863
	            				pkg/cmd/roachtest/monitor.go:115
	            				external/org_golang_x_sync/errgroup/errgroup.go:78
	            				src/runtime/asm_amd64.s:1695
	Error:      	Received unexpected error:
	            	pq: internal error while retrieving user account memberships: operation "get-user-session" timed out after 10s (given timeout 10s): internal error while retrieving user account: get auth info error: interrupted during singleflight acquire-lease:1: context deadline exceeded
	Test:       	failover/non-system/pause/lease=expiration
(require.go:1357).NoError: FailNow called
(monitor.go:149).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/failover/non-system/pause/lease=expiration/run_1

Parameters:

  • arch=amd64
  • cloud=azure
  • coverageBuild=false
  • cpu=2
  • encrypted=false
  • metamorphicBufferedSender=false
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for azure clusters

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.failover/non-system/pause/lease=expiration failed with artifacts on master @ f3193fa36f11d583dd7e7ba505a89e7e309996aa:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/failover.go:1631
	            				pkg/cmd/roachtest/tests/failover.go:863
	            				pkg/cmd/roachtest/monitor.go:115
	            				external/org_golang_x_sync/errgroup/errgroup.go:78
	            				src/runtime/asm_amd64.s:1695
	Error:      	Received unexpected error:
	            	pq: internal error while retrieving user account memberships: operation "get-user-session" timed out after 10s (given timeout 10s): internal error while retrieving user account: get auth info error: interrupted during singleflight acquire-lease:1: context deadline exceeded
	Test:       	failover/non-system/pause/lease=expiration
(require.go:1357).NoError: FailNow called
(monitor.go:149).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/failover/non-system/pause/lease=expiration/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=2
  • encrypted=false
  • fs=ext4
  • localSSD=true
  • metamorphicBufferedSender=false
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.failover/non-system/pause/lease=expiration failed with artifacts on master @ 954761451e68c1c5db5fbe0f336225955dad1075:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/failover.go:1631
	            				pkg/cmd/roachtest/tests/failover.go:863
	            				pkg/cmd/roachtest/monitor.go:115
	            				external/org_golang_x_sync/errgroup/errgroup.go:78
	            				src/runtime/asm_amd64.s:1695
	Error:      	Received unexpected error:
	            	pq: internal error while retrieving user account memberships: operation "get-user-session" timed out after 10s (given timeout 10s): internal error while retrieving user account: get auth info error: interrupted during singleflight acquire-lease:1: context deadline exceeded
	Test:       	failover/non-system/pause/lease=expiration
(require.go:1357).NoError: FailNow called
(monitor.go:149).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/failover/non-system/pause/lease=expiration/run_1

Parameters:

  • arch=amd64
  • cloud=azure
  • coverageBuild=false
  • cpu=2
  • encrypted=false
  • metamorphicBufferedSender=true
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for azure clusters

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.failover/non-system/pause/lease=expiration failed with artifacts on master @ 954761451e68c1c5db5fbe0f336225955dad1075:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/failover.go:1631
	            				pkg/cmd/roachtest/tests/failover.go:863
	            				pkg/cmd/roachtest/monitor.go:115
	            				external/org_golang_x_sync/errgroup/errgroup.go:78
	            				src/runtime/asm_amd64.s:1695
	Error:      	Received unexpected error:
	            	pq: internal error while retrieving user account memberships: operation "get-user-session" timed out after 10s (given timeout 10s): internal error while retrieving user account: get auth info error: interrupted during singleflight acquire-lease:1: context deadline exceeded
	Test:       	failover/non-system/pause/lease=expiration
(require.go:1357).NoError: FailNow called
(monitor.go:149).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/failover/non-system/pause/lease=expiration/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=2
  • encrypted=false
  • metamorphicBufferedSender=false
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@fqazi
Copy link
Collaborator

fqazi commented Feb 19, 2025

We seem to get stuck reading the descriptor, I'll try to see where the remote end of this is stuck:

I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +goroutine 2602407 [chan receive]:
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).divideAndSendBatchToRanges.func1()
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +	pkg/kv/kvclient/kvcoord/dist_sender.go:1832 +0x11d
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).divideAndSendBatchToRanges(0xc003593808, {0x8a53df8, 0xc00a849c20}, 0xc0036548c0, {{0xc010cfbe00, 0x4, 0x8}, {0xc010cfbe78, 0x4, 0x4}}, ...)
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +	pkg/kv/kvclient/kvcoord/dist_sender.go:2030 +0x14a8
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).Send(0xc003593808, {0x8a53df8, 0xc00a849b80}, 0xc0036548c0)
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +	pkg/kv/kvclient/kvcoord/dist_sender.go:1265 +0x695
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnLockGatekeeper).SendLocked(0xc01483b010, {0x8a53df8, 0xc00a849b80}, 0xc0036548c0)
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +	pkg/kv/kvclient/kvcoord/txn_lock_gatekeeper.go:77 +0x1d1
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnMetricRecorder).SendLocked(0xc01483afc0, {0x8a53df8?, 0xc00a849b80?}, 0xc0036548c0?)
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +	pkg/kv/kvclient/kvcoord/txn_interceptor_metric_recorder.go:41 +0xd0
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSpanRefresher).sendLockedWithRefreshAttempts(0xc01483aee0, {0x8a53df8, 0xc00a849b80}, 0xc0036548c0, 0x5)
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +	pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:234 +0x1a4
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSpanRefresher).SendLocked(0xc01483aee0, {0x8a53df8, 0xc00a849b80}, 0x1?)
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +	pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:162 +0xb3
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnCommitter).SendLocked(0xc01483aea0, {0x8a53df8, 0xc00a849b80}, 0xc0036548c0)
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +	pkg/kv/kvclient/kvcoord/txn_interceptor_committer.go:144 +0x47c
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnPipeliner).SendLocked(0xc01483ad48, {0x8a53df8, 0xc00a849b80}, 0x0?)
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +	pkg/kv/kvclient/kvcoord/txn_interceptor_pipeliner.go:334 +0x167
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnWriteBuffer).SendLocked(0x8a33c30?, {0x8a53df8?, 0xc00a849b80?}, 0x8?)
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +	pkg/kv/kvclient/kvcoord/txn_interceptor_write_buffer.go:121 +0x17a
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSeqNumAllocator).SendLocked(0xc01483aca8, {0x8a53df8, 0xc00a849b80}, 0xc0036548c0)
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +	pkg/kv/kvclient/kvcoord/txn_interceptor_seq_num_allocator.go:112 +0x248
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnHeartbeater).SendLocked(0xc01483abf0, {0x8a53df8, 0xc00a849b80}, 0xc0036548c0)
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +	pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go:265 +0x494
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*TxnCoordSender).Send(0xc01483aa08, {0x8a53b90, 0xc001ade3c0}, 0xc0036548c0)
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +	pkg/kv/kvclient/kvcoord/txn_coord_sender.go:549 +0x5d1
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +github.com/cockroachdb/cockroach/pkg/kv.(*DB).sendUsingSender(0xc00359dcc0, {0x8a53b90, 0xc001ade3c0}, 0xc0036548c0, {0x7b608eeabb68, 0xc01483aa08})
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +	pkg/kv/db.go:1149 +0xe7
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +github.com/cockroachdb/cockroach/pkg/kv.(*Txn).Send(0xc00a849ae0, {0x8a53b90, 0xc001ade3c0}, 0xc0036548c0)
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +	pkg/kv/txn.go:1320 +0x265
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +github.com/cockroachdb/cockroach/pkg/kv.sendAndFill({0x8a53b90, 0xc001ade3c0}, 0xc02b286848, 0xc0036b8008)
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +	pkg/kv/db.go:941 +0x10a
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +github.com/cockroachdb/cockroach/pkg/kv.(*Txn).Run(0xc00a849ae0, {0x8a53b90, 0xc001ade3c0}, 0xc0036b8008)
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +	pkg/kv/txn.go:815 +0x6a
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +github.com/cockroachdb/cockroach/pkg/sql/catalog/internal/catkv.catalogQuery.query({{{0xd702290, 0xd7022b0, {0x1}}, {0xd702290}, 0x0}, 0x1, {0x7110aec, 0x3}}, {0x8a53b90, 0xc001ade3c0}, ...)
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +	pkg/sql/catalog/internal/catkv/catalog_query.go:54 +0x17a
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +github.com/cockroachdb/cockroach/pkg/sql/catalog/internal/catkv.catalogReader.GetByIDs({{{0xd702290, 0xd7022b0, {0x1}}, {0xd702290}, 0x0}}, {0x8a53b90?, 0xc001ade3c0?}, 0xf2d189?, {0xc010cfbdf0, 0x1, ...}, ...)
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +	pkg/sql/catalog/internal/catkv/catalog_reader.go:356 +0x1d3
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +github.com/cockroachdb/cockroach/pkg/sql/catalog/internal/catkv.(*cachedCatalogReader).GetByIDs(0xc0125b91a0, {0x8a53b90, 0xc001ade3c0}, 0xc00a849ae0, {0xc010cfbdf0, 0x1, 0x1}, 0x1, {0x7110aec, 0x3})
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +	pkg/sql/catalog/internal/catkv/catalog_reader_cached.go:357 +0x203
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.storage.mustGetDescriptorByID({0xc002009a40, {0x8a2a250, 0xc0048e4ab0}, 0xc0012708c0, 0xc003270000, {{0xd702290, 0xd7022b0, {0x1}}, {0xd702290}, 0x0}, ...}, ...)
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +	pkg/sql/catalog/lease/storage.go:376 +0xdc
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.storage.acquire.func1({0x8a53b90, 0xc001ade3c0}, 0xc00a849ae0)
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +	pkg/sql/catalog/lease/storage.go:172 +0x2b7
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +github.com/cockroachdb/cockroach/pkg/kv.(*Txn).exec(0xc00a849ae0, {0x8a53b90, 0xc001ade3c0}, 0xc003719860)
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +	pkg/kv/txn.go:1082 +0x2d1
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +github.com/cockroachdb/cockroach/pkg/kv.runTxn({0x8a53b90, 0xc001ade3c0}, 0xc00a849ae0, 0x5?)
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +	pkg/kv/db.go:1074 +0x3e
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +github.com/cockroachdb/cockroach/pkg/kv.(*DB).TxnWithAdmissionControl(0xc002009a40?, {0x8a53b90, 0xc001ade3c0}, 0x12708c0?, 0xc0?, 0x0, 0xc019a19860)
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +	pkg/kv/db.go:1037 +0xa5
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn(...)
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +	pkg/kv/db.go:1012
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.storage.acquire({0xc002009a40, {0x8a2a250, 0xc0048e4ab0}, 0xc0012708c0, 0xc003270000, {{0xd702290, 0xd7022b0, {0x1}}, {0xd702290}, 0x0}, ...}, ...)
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +	pkg/sql/catalog/lease/storage.go:220 +0x54a
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.acquireNodeLease.func1({0x8a53b58, 0xc02bf2cdc0})
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +	pkg/sql/catalog/lease/lease.go:864 +0x1e5
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight.(*Group).doCall.func1({0x8a53b58?, 0xc02bf2cdc0?})
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +	pkg/util/syncutil/singleflight/singleflight.go:384 +0x2d
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTask(0xc0032814d0, {0x8a53b58, 0xc02bf2cdc0}, {0xc010cfbde0, 0x10}, 0xc00440cf50)
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +	pkg/util/stop/stopper.go:317 +0x15d
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight.(*Group).doCall(0xc00205f200, {0x8a53df8?, 0xc00f47caa0?}, 0xc00fa666e0, {0x73e1d4a, 0x2}, {0xc0032814d0?, 0x4c?}, 0xc02abe9680)
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +	pkg/util/syncutil/singleflight/singleflight.go:383 +0x269
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +created by github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight.(*Group).DoChan in goroutine 64371
I250218 21:42:17.158680 2244413 sql/user.go:201 ⋮ [T1,Vsystem,n5,client=99.213.0.162:51079,hostssl,user=‹roachprod›] 1619 +	pkg/util/syncutil/singleflight/singleflight.go:353 +0x53c

fqazi added a commit to fqazi/cockroach that referenced this issue Feb 19, 2025
Previously, the lease manager did not properly handle cases when the
session ID changes. This could happen because of failover scenarios,
where a new session ID would be assigned. The lease manager would not
update the in memory or on disk state to pick up the new session ID, if
the descriptor version did not change.
This could lead to an infinite loop inside lease acquisition. To address
this, this patch will allow upserting leases with a new session ID and
the same version.

Fixes: cockroachdb#141567
Fixes: cockroachdb#141556
Fixes: cockroachdb#141555
Fixes: cockroachdb#141554
Fixes: cockroachdb#141553
Fixes: cockroachdb#141552
Fixes: cockroachdb#141549
Fixes: cockroachdb#141548
Fixes: cockroachdb#141547
Fixes: cockroachdb#141546
Fixes: cockroachdb#141545
Fixes: cockroachdb#141544
Fixes: cockroachdb#141543
Fixes: cockroachdb#141542
Fixes: cockroachdb#141541
Fixes: cockroachdb#141540
Fixes: cockroachdb#141539
Fixes: cockroachdb#141538
Fixes: cockroachdb#141484
Fixes: cockroachdb#141481
Fixes: cockroachdb#141480
Fixes :cockroachdb#141473
Fixes: cockroachdb#141473
Fixes: cockroachdb#141467
Fixes: cockroachdb#141685
Fixes: cockroachdb#141585
Fixes: cockroachdb#141566
Fixes: cockroachdb#141513
Fixes: cockroachdb#141479

Release note: None
@cockroach-teamcity
Copy link
Member Author

roachtest.failover/non-system/pause/lease=expiration failed with artifacts on master @ 3e38c3357dc026e41cf48d192611608b34fc064a:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/failover.go:1631
	            				pkg/cmd/roachtest/tests/failover.go:863
	            				pkg/cmd/roachtest/monitor.go:115
	            				external/org_golang_x_sync/errgroup/errgroup.go:78
	            				src/runtime/asm_amd64.s:1695
	Error:      	Received unexpected error:
	            	pq: internal error while retrieving user account memberships: operation "get-user-session" timed out after 10s (given timeout 10s): internal error while retrieving user account: get auth info error: interrupted during singleflight acquire-lease:1: context deadline exceeded
	Test:       	failover/non-system/pause/lease=expiration
(require.go:1357).NoError: FailNow called
(monitor.go:149).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/failover/non-system/pause/lease=expiration/run_1

Parameters:

  • arch=amd64
  • cloud=azure
  • coverageBuild=false
  • cpu=2
  • encrypted=false
  • metamorphicBufferedSender=false
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for azure clusters

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.failover/non-system/pause/lease=expiration failed with artifacts on master @ 3e38c3357dc026e41cf48d192611608b34fc064a:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/failover.go:1631
	            				pkg/cmd/roachtest/tests/failover.go:863
	            				pkg/cmd/roachtest/monitor.go:115
	            				external/org_golang_x_sync/errgroup/errgroup.go:78
	            				src/runtime/asm_amd64.s:1695
	Error:      	Received unexpected error:
	            	pq: internal error while retrieving user account memberships: operation "get-user-session" timed out after 10s (given timeout 10s): internal error while retrieving user account: get auth info error: interrupted during singleflight acquire-lease:1: context deadline exceeded
	Test:       	failover/non-system/pause/lease=expiration
(require.go:1357).NoError: FailNow called
(monitor.go:149).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/failover/non-system/pause/lease=expiration/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=2
  • encrypted=false
  • fs=ext4
  • localSSD=true
  • metamorphicBufferedSender=false
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

craig bot pushed a commit that referenced this issue Feb 20, 2025
141749: catalog/lease: properly handle session ID changes r=fqazi a=fqazi

Previously, the lease manager did not properly handle cases when the session ID changes. This could happen because of failover scenarios, where a new session ID would be assigned. The lease manager would not update the in memory or on disk state to pick up the new session ID, if the descriptor version did not change.
This could lead to an infinite loop inside lease acquisition. To address this, this patch will allow upserting leases with a new session ID and the same version.

Fixes: #141567
Fixes: #141556
Fixes: #141555
Fixes: #141554
Fixes: #141553
Fixes: #141552
Fixes: #141549
Fixes: #141548
Fixes: #141547
Fixes: #141546
Fixes: #141545
Fixes: #141544
Fixes: #141543
Fixes: #141542
Fixes: #141541
Fixes: #141540
Fixes: #141539
Fixes: #141538
Fixes: #141484
Fixes: #141481
Fixes: #141480
Fixes :#141473
Fixes: #141473
Fixes: #141467
Fixes: #141685
Fixes: #141585
Fixes: #141566
Fixes: #141513
Fixes: #141479

Release note: None

Co-authored-by: Faizan Qazi <[email protected]>
@craig craig bot closed this as completed in #141749 Feb 20, 2025
@craig craig bot closed this as completed in aeb5a7b Feb 20, 2025
sambhav-jain-16 pushed a commit to sambhav-jain-16/cockroach that referenced this issue Mar 10, 2025
Previously, the lease manager did not properly handle cases when the
session ID changes. This could happen because of failover scenarios,
where a new session ID would be assigned. The lease manager would not
update the in memory or on disk state to pick up the new session ID, if
the descriptor version did not change.
This could lead to an infinite loop inside lease acquisition. To address
this, this patch will allow upserting leases with a new session ID and
the same version.

Fixes: cockroachdb#141567
Fixes: cockroachdb#141556
Fixes: cockroachdb#141555
Fixes: cockroachdb#141554
Fixes: cockroachdb#141553
Fixes: cockroachdb#141552
Fixes: cockroachdb#141549
Fixes: cockroachdb#141548
Fixes: cockroachdb#141547
Fixes: cockroachdb#141546
Fixes: cockroachdb#141545
Fixes: cockroachdb#141544
Fixes: cockroachdb#141543
Fixes: cockroachdb#141542
Fixes: cockroachdb#141541
Fixes: cockroachdb#141540
Fixes: cockroachdb#141539
Fixes: cockroachdb#141538
Fixes: cockroachdb#141484
Fixes: cockroachdb#141481
Fixes: cockroachdb#141480
Fixes :cockroachdb#141473
Fixes: cockroachdb#141473
Fixes: cockroachdb#141467
Fixes: cockroachdb#141685
Fixes: cockroachdb#141585
Fixes: cockroachdb#141566
Fixes: cockroachdb#141513
Fixes: cockroachdb#141479

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) target-release-25.2.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants