New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Validate the resiliency of the ODF + OpenShift Virtualization system in case of Worker node failure #11552

Open

ayush-patni wants to merge 5 commits into red-hat-storage:master from ayush-patni:worker-node

+161 −0

Contributor

ayush-patni commented Mar 4, 2025

Ensure that both OpenShift Virtualization and ODF can recover from a worker node failure that hosts critical pods (such as OpenShift Virtualization VMs, OSD pods, or mon pods)

AYUSH-D-PATNI added 4 commits

March 4, 2025 10:47


          added the tc

6bba5a7

Signed-off-by: AYUSH-D-PATNI <[email protected]>


          added vmi obj func and vm's validation

f296d79

Signed-off-by: AYUSH-D-PATNI <[email protected]>


          Merge branch 'master' of https://github.com/ayush-patni/ocs-ci into w…

cc61480

…orker-node

Signed-off-by: AYUSH-D-PATNI <[email protected]>


          upd cmts

921126d

Signed-off-by: AYUSH-D-PATNI <[email protected]>

ayush-patni requested review from a team as code owners

March 4, 2025 10:02

openshift-ci bot added the do-not-merge/work-in-progress label

pull-request-size bot added the size/L label

openshift-ci bot commented Mar 4, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ayush-patni

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ocs-ci reviewed

View reviewed changes

ocs-ci left a comment

PR validation on existing cluster

Cluster Name: apatni-cnv-test
Cluster Configuration:
PR Test Suite:
PR Test Path: tests/functional/workloads/cnv/test_vm_worker_node_fail.py
Additional Test Params:
OCP VERSION: 4.18
OCS VERSION: 4.18
tested against branch: master

Job UNSTABLE (some or all tests failed).


          removed node failure steps

edea5ae

Signed-off-by: AYUSH-D-PATNI <[email protected]>

ocs-ci reviewed

View reviewed changes

ocs-ci left a comment

PR validation on existing cluster

Cluster Name: apatni-cnv-test
Cluster Configuration:
PR Test Suite:
PR Test Path: tests/functional/workloads/cnv/test_vm_worker_node_fail.py
Additional Test Params:
OCP VERSION: 4.18
OCS VERSION: 4.18
tested against branch: master

ayush-patni changed the title ~~[WIP] Validate the resiliency of the ODF + OpenShift Virtualization system in case of Worker node failure~~ Validate the resiliency of the ODF + OpenShift Virtualization system in case of Worker node failure

openshift-ci bot removed the do-not-merge/work-in-progress label

ayush-patni requested review from avd-sagare, PrasadDesala and hnallurv

March 6, 2025 05:43

avd-sagare reviewed

View reviewed changes

tests/functional/workloads/cnv/test_vm_worker_node_fail.py

    
                      can recover from a worker node failure that

                      hosts critical pods (such as OpenShift Virtualization VMs,

                      OSD pods, or mon pods)

                      """

Contributor

avd-sagare Mar 12, 2025

Add test steps

avd-sagare reviewed

View reviewed changes

tests/functional/workloads/cnv/test_vm_worker_node_fail.py

    
                          log.error(

                              f"Pods did not return to running state, attempting node restart: {e}"

                          )

                          nodes.restart_nodes(node.get_node_objs([node_name]))

Contributor

avd-sagare Mar 12, 2025

Dont we need to check again pod status after restart?

hnallurv reviewed

View reviewed changes

tests/functional/workloads/cnv/test_vm_worker_node_fail.py

    
                          for vm_obj in vm_objs_def + vm_objs_aggr

                      }

                      log.info(f"Final VM states: {final_vm_states}")

hnallurv Mar 12, 2025

Please add code to check data integrity after recovery

avd-sagare reviewed

View reviewed changes

tests/functional/workloads/cnv/test_vm_worker_node_fail.py

    
                                  f" on node {node_name}, still on the same node"

                              )

                      ceph_health_check(tries=80)

Contributor

avd-sagare Mar 12, 2025

You are checking this at line 112 ,then why you are checking it here again?

PrasadDesala reviewed

View reviewed changes

tests/functional/workloads/cnv/test_vm_worker_node_fail.py

    
              @magenta_squad

              @workloads

              @ignore_leftovers

              @pytest.mark.polarion_id("OCS-")

Contributor

PrasadDesala Mar 12, 2025

Please create a test case in polarion and add the ID here

tests/functional/workloads/cnv/test_vm_worker_node_fail.py

    
                  (such as OpenShift Virtualization VMs, OSD pods, or mon pods)

                  """

                  short_nw_fail_time = 300

Contributor

PrasadDesala Mar 12, 2025

you are stopping and starting the node. This constant can be removed

tests/functional/workloads/cnv/test_vm_worker_node_fail.py

    
              @magenta_squad

              @workloads

              @ignore_leftovers

Contributor

PrasadDesala Mar 12, 2025

what is the left over here?

tests/functional/workloads/cnv/test_vm_worker_node_fail.py

Comment on lines +62 to +82

    
                      sample = TimeoutSampler(

                          timeout=600,

                          sleep=10,

                          func=wait_for_pods_to_be_running,

                          namespace=odf_namespace,

                      )

                      assert sample.wait_for_func_status(

                          result=True

                      ), f"Not all pods are running in {odf_namespace} before node failure"

                      sample = TimeoutSampler(

                          timeout=600,

                          sleep=10,

                          func=wait_for_pods_to_be_running,

                          namespace=cnv_namespace,

                      )

                      assert sample.wait_for_func_status(

                          result=True

                      ), f"Not all pods are running in {cnv_namespace} before node failure"

                      ceph_health_check(tries=80)

Contributor

PrasadDesala Mar 12, 2025

as discussed, this will be taken care at the start of the test run by the framework. It can be removed

tests/functional/workloads/cnv/test_vm_worker_node_fail.py

    
              @workloads

              @ignore_leftovers

              @pytest.mark.polarion_id("OCS-")

              class TestVmWorkerNodeResiliency(E2ETest):

Contributor

PrasadDesala Mar 12, 2025

you are doing a single worker node failure, please repharse it accordingly

tests/functional/workloads/cnv/test_vm_worker_node_fail.py

Comment on lines +90 to +91

    
                      if config.ENV_DATA["platform"].lower() == constants.GCP_PLATFORM:

                          nodes.restart_nodes_by_stop_and_start(node_obj, force=False)

Contributor

PrasadDesala Mar 12, 2025

GCP is already handled, wouldn't the platform run according to the platform?

tests/functional/workloads/cnv/test_vm_worker_node_fail.py

    
                                  f"VM {vm_name}: Rescheduling failed. Initially, VM is scheduled"

                                  f" on node {node_name}, still on the same node"

                              )

Contributor

PrasadDesala Mar 12, 2025

also write some IO after node recovery

PrasadDesala reviewed

View reviewed changes

tests/functional/workloads/cnv/test_vm_worker_node_fail.py

Comment on lines +84 to +86

    
                      worker_nodes = node.get_osd_running_nodes()

                      node_name = random.sample(worker_nodes, 1)

                      node_name = node_name[0]

Contributor

PrasadDesala Mar 12, 2025

how are you making sure that the randomly selected node is having VM running on it?

PrasadDesala reviewed

View reviewed changes

tests/functional/workloads/cnv/test_vm_worker_node_fail.py


		log = logging.getLogger(__name__)

Contributor

PrasadDesala Mar 12, 2025

teardown code is missing, please add

tests/functional/workloads/cnv/test_vm_worker_node_fail.py

    
                      except ResourceWrongStatusException as e:

                          log.error(

                              f"Pods did not return to running state, attempting node restart: {e}"

                          )

Contributor

PrasadDesala Mar 12, 2025

why do you need to restart the node again?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/L