Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: multitenant-upgrade failed #140507

Closed
cockroach-teamcity opened this issue Feb 5, 2025 · 5 comments · Fixed by #143055
Closed

roachtest: multitenant-upgrade failed #140507

cockroach-teamcity opened this issue Feb 5, 2025 · 5 comments · Fixed by #143055
Assignees
Labels
B-runtime-assertions-enabled branch-master Failures and bugs on the master branch. branch-release-25.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months T-db-server target-release-25.1.4 v25.2.0-prerelease

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Feb 5, 2025

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.multitenant-upgrade failed with artifacts on master @ 2e7d5a6d0194411d0f91fe4f3eba23fcea63ccac:

(mixedversion.go:804).Run: mixed-version test failure while running step 8 (run "run workload on tenants"): full command output in run_130154.988668740_n5_v24111cockroach-work.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/multitenant-upgrade/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=4
  • encrypted=false
  • fs=ext4
  • localSSD=true
  • mvtDeploymentMode=system-only
  • mvtVersions=v24.1.11 → v24.2.9 → v24.3.4 → master
  • runtimeAssertionsBuild=true
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

/cc @cockroachdb/server

This test on roachdash | Improve this report!

Jira issue: CRDB-47197

@cockroach-teamcity cockroach-teamcity added B-runtime-assertions-enabled branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Feb 5, 2025
@cockroach-teamcity
Copy link
Member Author

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.multitenant-upgrade failed with artifacts on master @ 67a82bfc09766413d7f8ae8edb10035940b94c7e:

(mixedversion.go:804).Run: mixed-version test failure while running step 2 (start cluster at version "v24.1.12"): COMMAND_PROBLEM: exit status 1 [owner=test-eng]
test artifacts and logs in: /artifacts/multitenant-upgrade/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=4
  • encrypted=false
  • fs=ext4
  • localSSD=true
  • metamorphicBufferedSender=true
  • mvtDeploymentMode=system-only
  • mvtVersions=v24.1.12 → v24.3.5 → v25.1.0-rc.1 → master
  • runtimeAssertionsBuild=true
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.multitenant-upgrade failed with artifacts on master @ 34e34c2d5b9598441b935f03a794277e9329a256:

(mixedversion.go:804).Run: mixed-version test failure while running step 1 (start cluster at version "v24.2.10"): COMMAND_PROBLEM: exit status 1 [owner=test-eng]
test artifacts and logs in: /artifacts/multitenant-upgrade/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=4
  • encrypted=false
  • fs=ext4
  • localSSD=true
  • metamorphicBufferedSender=true
  • mvtDeploymentMode=system-only
  • mvtVersions=v24.2.10 → v24.3.5 → v25.1.0-rc.1 → master
  • runtimeAssertionsBuild=true
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.multitenant-upgrade failed with artifacts on master @ 04e40fe28382799df6e009574e94770d42d7afd4:

(mixedversion.go:804).Run: mixed-version test failure while running step 2 (start cluster at version "v24.3.5"): COMMAND_PROBLEM: exit status 1 [owner=test-eng]
test artifacts and logs in: /artifacts/multitenant-upgrade/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=4
  • encrypted=false
  • fs=ext4
  • localSSD=true
  • metamorphicBufferedSender=true
  • mvtDeploymentMode=system-only
  • mvtVersions=v24.3.5 → v25.1.0-rc.1 → master
  • runtimeAssertionsBuild=true
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.multitenant-upgrade failed with artifacts on master @ f3193fa36f11d583dd7e7ba505a89e7e309996aa:

(mixedversion.go:804).Run: mixed-version test failure while running step 1 (start cluster at version "v24.1.12"): COMMAND_PROBLEM: exit status 1 [owner=test-eng]
test artifacts and logs in: /artifacts/multitenant-upgrade/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=4
  • encrypted=false
  • fs=ext4
  • localSSD=true
  • metamorphicBufferedSender=true
  • mvtDeploymentMode=system-only
  • mvtVersions=v24.1.12 → v24.3.5 → v25.1.0-rc.1 → master
  • runtimeAssertionsBuild=true
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

This test on roachdash | Improve this report!

@shubhamdhama shubhamdhama removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Mar 18, 2025
shubhamdhama added a commit to shubhamdhama/cockroach that referenced this issue Mar 18, 2025
Summary: In multitenant upgrade tests, the TPC-C workload may fail if the
required binary is missing on a node. This issue can occur when no tenant
is created on nodes with the previous binary version, and the workload
attempts to run using that binary for compatibility.

A sample excerpt from the upgrade plan illustrates the process:
```
├── start cluster at version "v23.2.20" (1)
├── wait for all nodes (:1-4) to acknowledge cluster version '23.2' on system tenant (2)
├── set cluster setting "storage.ingest_split.enabled" to 'false' on system tenant (3)
├── run "maybe create some tenants" (4)
├── upgrade cluster from "v23.2.20" to "v24.1.13"
│   ├── prevent auto-upgrades on system tenant by setting `preserve_downgrade_option` (5)
│   ├── upgrade nodes :1-4 from "v23.2.20" to "v24.1.13"
│   │   ├── restart node 2 with binary version v24.1.13 (6)
│   │   ├── restart node 1 with binary version v24.1.13 (7)
│   │   ├── allow upgrade to happen on system tenant by resetting `preserve_downgrade_option` (8)
│   │   ├── restart node 3 with binary version v24.1.13 (9)
│   │   ├── restart node 4 with binary version v24.1.13 (10)
│   │   └── run "run workload on tenants" (11)
│   ├── run "run workload on tenants" (12)
```

Once all the nodes are upgraded (step 10), we enter the finalizing phase in
step 11. Our cluster configuration would then look like this,

```
[mixed-version-test/11_run-run-workload-on-tenants] 2025/03/13 10:47:21 runner.go:423: current cluster configuration:
                      n1           n2           n3           n4
released versions     v24.1.13     v24.1.13     v24.1.13     v24.1.13
binary versions       24.1         24.1         24.1         24.1
cluster versions      24.1         24.1         24.1         24.1
```

This implies that our tenant would also start with the target version as we
finalize (see cockroachdb#138233). Then we run the TPC-C workload on tenant nodes
using the version we are migrating from—likely for compatibility reasons.
However, the required binary may be absent if, during step 4, we did not
create any tenants with the previous version due to probabilistic
selection. The fix is simple: upload the binary used to run TPC-C. The
process first checks whether the binary is already present, so no extra
performance overhead occurs if it is..

Fixes: cockroachdb#140507
Informs: cockroachdb#142807
Release note: None
Epic: None
@shubhamdhama shubhamdhama added the P-2 Issues/test failures with a fix SLA of 3 months label Mar 19, 2025
craig bot pushed a commit that referenced this issue Mar 20, 2025
143055: roachtest: fix missing binary for TPC-C in multitenant upgrade test r=rimadeodhar a=shubhamdhama

Summary: In multitenant upgrade tests, the TPC-C workload may fail if the required binary is missing on a node. This issue can occur when no tenant is created on nodes with the previous binary version, and the workload attempts to run using that binary.

A sample excerpt from the upgrade plan illustrates the process:
```
├── start cluster at version "v23.2.20" (1)
├── wait for all nodes (:1-4) to acknowledge cluster version '23.2' on system tenant (2)
├── set cluster setting "storage.ingest_split.enabled" to 'false' on system tenant (3)
├── run "maybe create some tenants" (4)
├── upgrade cluster from "v23.2.20" to "v24.1.13"
│   ├── prevent auto-upgrades on system tenant by setting `preserve_downgrade_option` (5)
│   ├── upgrade nodes :1-4 from "v23.2.20" to "v24.1.13"
│   │   ├── restart node 2 with binary version v24.1.13 (6)
│   │   ├── restart node 1 with binary version v24.1.13 (7)
│   │   ├── allow upgrade to happen on system tenant by resetting `preserve_downgrade_option` (8)
│   │   ├── restart node 3 with binary version v24.1.13 (9)
│   │   ├── restart node 4 with binary version v24.1.13 (10)
│   │   └── run "run workload on tenants" (11)
│   ├── run "run workload on tenants" (12)
```

Once all the nodes are upgraded (step 10), we enter the finalizing phase in step 11. Our cluster configuration would then look like this,

```
[mixed-version-test/11_run-run-workload-on-tenants] 2025/03/13 10:47:21 runner.go:423: current cluster configuration:
                      n1           n2           n3           n4
released versions     v24.1.13     v24.1.13     v24.1.13     v24.1.13
binary versions       24.1         24.1         24.1         24.1
cluster versions      24.1         24.1         24.1         24.1
```

This implies that our tenant would also start with the target version as we finalize (see #138233). Then we run the TPC-C workload on tenant nodes using the version we are migrating from—likely for compatibility reasons. However, the required binary may be absent if, during step 4, we did not create any tenants with the previous version due to probabilistic selection. The fix is simple: upload the binary used to run TPC-C. The process first checks whether the binary is already present, so no extra performance overhead occurs if it is.

Fixes: #140507
Informs: #142807
Release note: None
Epic: None

Co-authored-by: Shubham Dhama <[email protected]>
@craig craig bot closed this as completed in 424437d Mar 20, 2025
Copy link

blathers-crl bot commented Mar 20, 2025

Based on the specified backports for linked PR #143055, I applied the following new label(s) to this issue: branch-release-25.1. Please adjust the labels as needed to match the branches actually affected by this issue, including adding any known older branches.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

blathers-crl bot pushed a commit that referenced this issue Mar 20, 2025
Summary: In multitenant upgrade tests, the TPC-C workload may fail if the
required binary is missing on a node. This issue can occur when no tenant
is created on nodes with the previous binary version, and the workload
attempts to run using that binary.

A sample excerpt from the upgrade plan illustrates the process:
```
├── start cluster at version "v23.2.20" (1)
├── wait for all nodes (:1-4) to acknowledge cluster version '23.2' on system tenant (2)
├── set cluster setting "storage.ingest_split.enabled" to 'false' on system tenant (3)
├── run "maybe create some tenants" (4)
├── upgrade cluster from "v23.2.20" to "v24.1.13"
│   ├── prevent auto-upgrades on system tenant by setting `preserve_downgrade_option` (5)
│   ├── upgrade nodes :1-4 from "v23.2.20" to "v24.1.13"
│   │   ├── restart node 2 with binary version v24.1.13 (6)
│   │   ├── restart node 1 with binary version v24.1.13 (7)
│   │   ├── allow upgrade to happen on system tenant by resetting `preserve_downgrade_option` (8)
│   │   ├── restart node 3 with binary version v24.1.13 (9)
│   │   ├── restart node 4 with binary version v24.1.13 (10)
│   │   └── run "run workload on tenants" (11)
│   ├── run "run workload on tenants" (12)
```

Once all the nodes are upgraded (step 10), we enter the finalizing phase in
step 11. Our cluster configuration would then look like this,

```
[mixed-version-test/11_run-run-workload-on-tenants] 2025/03/13 10:47:21 runner.go:423: current cluster configuration:
                      n1           n2           n3           n4
released versions     v24.1.13     v24.1.13     v24.1.13     v24.1.13
binary versions       24.1         24.1         24.1         24.1
cluster versions      24.1         24.1         24.1         24.1
```

This implies that our tenant would also start with the target version as we
finalize (see #138233). Then we run the TPC-C workload on tenant nodes
using the version we are migrating from—likely for compatibility reasons.
However, the required binary may be absent if, during step 4, we did not
create any tenants with the previous version due to probabilistic
selection. The fix is simple: upload the binary used to run TPC-C. The
process first checks whether the binary is already present, so no extra
performance overhead occurs if it is.

Fixes: #140507
Informs: #142807
Release note: None
Epic: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B-runtime-assertions-enabled branch-master Failures and bugs on the master branch. branch-release-25.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months T-db-server target-release-25.1.4 v25.2.0-prerelease
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants