Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate the resiliency of the ODF + OpenShift Virtualization system in case of Worker node failure #11552

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions ocs_ci/ocs/cnv/virtual_machine.py
Original file line number Diff line number Diff line change
Expand Up @@ -695,6 +695,15 @@ def delete(self):
if self.ns_obj:
self.ns_obj.delete_project(project_name=self.namespace)

def get_vmi_instance(self):
"""
Get the vmi instance of VM
Returns:
VMI object: returns VMI instance of VM
"""
return self.vmi_obj


class VMCloner(VirtualMachine):
"""
Expand Down
152 changes: 152 additions & 0 deletions tests/functional/workloads/cnv/test_vm_worker_node_fail.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
import logging
import random

import pytest

from ocs_ci.framework import config
from ocs_ci.framework.pytest_customization.marks import (
magenta_squad,
workloads,
ignore_leftovers,
)
from ocs_ci.framework.testlib import E2ETest
from ocs_ci.ocs import constants, node
from ocs_ci.ocs.resources import pod
from ocs_ci.ocs.resources.pod import wait_for_pods_to_be_running
from ocs_ci.utility.utils import TimeoutSampler, ceph_health_check
from ocs_ci.ocs.exceptions import ResourceWrongStatusException

log = logging.getLogger(__name__)


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

teardown code is missing, please add

@magenta_squad
@workloads
@ignore_leftovers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the left over here?

@pytest.mark.polarion_id("OCS-")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create a test case in polarion and add the ID here

class TestVmWorkerNodeResiliency(E2ETest):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are doing a single worker node failure, please repharse it accordingly

"""
Test case for ensuring that both OpenShift Virtualization
and ODF can recover from a worker node failure that hosts critical pods
(such as OpenShift Virtualization VMs, OSD pods, or mon pods)
"""

short_nw_fail_time = 300
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are stopping and starting the node. This constant can be removed


def test_vm_worker_node_failure(
self, setup_cnv, nodes, project_factory, multi_cnv_workload
):
"""
Test case to ensure that both OpenShift Virtualization and ODF
can recover from a worker node failure that
hosts critical pods (such as OpenShift Virtualization VMs,
OSD pods, or mon pods)
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add test steps


odf_namespace = constants.OPENSHIFT_STORAGE_NAMESPACE
cnv_namespace = constants.CNV_NAMESPACE

proj_obj = project_factory()
vm_objs_def, vm_objs_aggr, sc_objs_def, sc_objs_aggr = multi_cnv_workload(
namespace=proj_obj.namespace
)
vm_list = vm_objs_def + vm_objs_aggr

log.info(f"Total VMs to process: {len(vm_list)}")

initial_vm_states = {
vm_obj.name: [vm_obj.printableStatus(), vm_obj.get_vmi_instance().node()]
for vm_obj in vm_objs_def + vm_objs_aggr
}
log.info(f"Initial VM states: {initial_vm_states}")

sample = TimeoutSampler(
timeout=600,
sleep=10,
func=wait_for_pods_to_be_running,
namespace=odf_namespace,
)
assert sample.wait_for_func_status(
result=True
), f"Not all pods are running in {odf_namespace} before node failure"

sample = TimeoutSampler(
timeout=600,
sleep=10,
func=wait_for_pods_to_be_running,
namespace=cnv_namespace,
)
assert sample.wait_for_func_status(
result=True
), f"Not all pods are running in {cnv_namespace} before node failure"

ceph_health_check(tries=80)
Comment on lines +62 to +82
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed, this will be taken care at the start of the test run by the framework. It can be removed


worker_nodes = node.get_osd_running_nodes()
node_name = random.sample(worker_nodes, 1)
node_name = node_name[0]
Comment on lines +84 to +86
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how are you making sure that the randomly selected node is having VM running on it?


log.info(f"Attempting to restart node: {node_name}")
node_obj = node.get_node_objs([node_name])
if config.ENV_DATA["platform"].lower() == constants.GCP_PLATFORM:
nodes.restart_nodes_by_stop_and_start(node_obj, force=False)
Comment on lines +90 to +91
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GCP is already handled, wouldn't the platform run according to the platform?

else:
nodes.restart_nodes_by_stop_and_start(node_obj)

log.info(f"Waiting for node {node_name} to return to Ready state")
try:
node.wait_for_nodes_status(
node_names=[node_name],
status=constants.NODE_READY,
)
log.info("Verifying all pods are running after node recovery")
if not pod.wait_for_pods_to_be_running(timeout=720):
raise ResourceWrongStatusException(
"Not all pods returned to running state after node recovery"
)
except ResourceWrongStatusException as e:
log.error(
f"Pods did not return to running state, attempting node restart: {e}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need to restart the node again?

nodes.restart_nodes(node.get_node_objs([node_name]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dont we need to check again pod status after restart?


ceph_health_check(tries=80)

log.info("Performing post-failure health checks for ODF and CNV namespaces")
sample = TimeoutSampler(
timeout=600,
sleep=10,
func=wait_for_pods_to_be_running,
namespace=odf_namespace,
)
assert sample.wait_for_func_status(
result=True
), f"Not all pods are running in {odf_namespace} after node failure and recovery"

sample = TimeoutSampler(
timeout=600,
sleep=10,
func=wait_for_pods_to_be_running,
namespace=cnv_namespace,
)
assert sample.wait_for_func_status(
result=True
), f"Not all pods are running in {cnv_namespace} after node failure and recovery"

final_vm_states = {
vm_obj.name: [vm_obj.printableStatus(), vm_obj.get_vmi_instance().node()]
for vm_obj in vm_objs_def + vm_objs_aggr
}
log.info(f"Final VM states: {final_vm_states}")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add code to check data integrity after recovery

for vm_name in initial_vm_states:
assert initial_vm_states[vm_name][0] == final_vm_states[vm_name][0], (
f"VM {vm_name}: State mismatch. Initial: {initial_vm_states[vm_name][0]}, "
f"Final: {final_vm_states[vm_name][0]}"
)
if initial_vm_states[vm_name][1] == node_name:
assert initial_vm_states[vm_name][1] != final_vm_states[vm_name][1], (
f"VM {vm_name}: Rescheduling failed. Initially, VM is scheduled"
f" on node {node_name}, still on the same node"
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also write some IO after node recovery

ceph_health_check(tries=80)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are checking this at line 112 ,then why you are checking it here again?