Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate the resiliency of the ODF + OpenShift Virtualization system in case of Worker node failure #11552

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

ayush-patni
Copy link
Contributor

Ensure that both OpenShift Virtualization and ODF can recover from a worker node failure that hosts critical pods (such as OpenShift Virtualization VMs, OSD pods, or mon pods)

AYUSH-D-PATNI added 4 commits March 4, 2025 10:47
Signed-off-by: AYUSH-D-PATNI <[email protected]>
Signed-off-by: AYUSH-D-PATNI <[email protected]>
@ayush-patni ayush-patni requested review from a team as code owners March 4, 2025 10:02
@pull-request-size pull-request-size bot added the size/L PR that changes 100-499 lines label Mar 4, 2025
Copy link

openshift-ci bot commented Mar 4, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ayush-patni

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

@ocs-ci ocs-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR validation on existing cluster

Cluster Name: apatni-cnv-test
Cluster Configuration:
PR Test Suite:
PR Test Path: tests/functional/workloads/cnv/test_vm_worker_node_fail.py
Additional Test Params:
OCP VERSION: 4.18
OCS VERSION: 4.18
tested against branch: master

Job UNSTABLE (some or all tests failed).

Signed-off-by: AYUSH-D-PATNI <[email protected]>
Copy link

@ocs-ci ocs-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR validation on existing cluster

Cluster Name: apatni-cnv-test
Cluster Configuration:
PR Test Suite:
PR Test Path: tests/functional/workloads/cnv/test_vm_worker_node_fail.py
Additional Test Params:
OCP VERSION: 4.18
OCS VERSION: 4.18
tested against branch: master

Job PASSED.

@ayush-patni ayush-patni changed the title [WIP] Validate the resiliency of the ODF + OpenShift Virtualization system in case of Worker node failure Validate the resiliency of the ODF + OpenShift Virtualization system in case of Worker node failure Mar 6, 2025
can recover from a worker node failure that
hosts critical pods (such as OpenShift Virtualization VMs,
OSD pods, or mon pods)
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add test steps

log.error(
f"Pods did not return to running state, attempting node restart: {e}"
)
nodes.restart_nodes(node.get_node_objs([node_name]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dont we need to check again pod status after restart?

for vm_obj in vm_objs_def + vm_objs_aggr
}
log.info(f"Final VM states: {final_vm_states}")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add code to check data integrity after recovery

f" on node {node_name}, still on the same node"
)

ceph_health_check(tries=80)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are checking this at line 112 ,then why you are checking it here again?

@magenta_squad
@workloads
@ignore_leftovers
@pytest.mark.polarion_id("OCS-")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create a test case in polarion and add the ID here

(such as OpenShift Virtualization VMs, OSD pods, or mon pods)
"""

short_nw_fail_time = 300
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are stopping and starting the node. This constant can be removed


@magenta_squad
@workloads
@ignore_leftovers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the left over here?

Comment on lines +62 to +82
sample = TimeoutSampler(
timeout=600,
sleep=10,
func=wait_for_pods_to_be_running,
namespace=odf_namespace,
)
assert sample.wait_for_func_status(
result=True
), f"Not all pods are running in {odf_namespace} before node failure"

sample = TimeoutSampler(
timeout=600,
sleep=10,
func=wait_for_pods_to_be_running,
namespace=cnv_namespace,
)
assert sample.wait_for_func_status(
result=True
), f"Not all pods are running in {cnv_namespace} before node failure"

ceph_health_check(tries=80)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed, this will be taken care at the start of the test run by the framework. It can be removed

@workloads
@ignore_leftovers
@pytest.mark.polarion_id("OCS-")
class TestVmWorkerNodeResiliency(E2ETest):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are doing a single worker node failure, please repharse it accordingly

Comment on lines +90 to +91
if config.ENV_DATA["platform"].lower() == constants.GCP_PLATFORM:
nodes.restart_nodes_by_stop_and_start(node_obj, force=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GCP is already handled, wouldn't the platform run according to the platform?

f"VM {vm_name}: Rescheduling failed. Initially, VM is scheduled"
f" on node {node_name}, still on the same node"
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also write some IO after node recovery

Comment on lines +84 to +86
worker_nodes = node.get_osd_running_nodes()
node_name = random.sample(worker_nodes, 1)
node_name = node_name[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how are you making sure that the randomly selected node is having VM running on it?


log = logging.getLogger(__name__)


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

teardown code is missing, please add

except ResourceWrongStatusException as e:
log.error(
f"Pods did not return to running state, attempting node restart: {e}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need to restart the node again?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/L PR that changes 100-499 lines
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants