-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validate the resiliency of the ODF + OpenShift Virtualization system in case of Worker node failure #11552
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: AYUSH-D-PATNI <[email protected]>
Signed-off-by: AYUSH-D-PATNI <[email protected]>
…orker-node Signed-off-by: AYUSH-D-PATNI <[email protected]>
Signed-off-by: AYUSH-D-PATNI <[email protected]>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ayush-patni The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR validation on existing cluster
Cluster Name: apatni-cnv-test
Cluster Configuration:
PR Test Suite:
PR Test Path: tests/functional/workloads/cnv/test_vm_worker_node_fail.py
Additional Test Params:
OCP VERSION: 4.18
OCS VERSION: 4.18
tested against branch: master
Job UNSTABLE (some or all tests failed).
Signed-off-by: AYUSH-D-PATNI <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR validation on existing cluster
Cluster Name: apatni-cnv-test
Cluster Configuration:
PR Test Suite:
PR Test Path: tests/functional/workloads/cnv/test_vm_worker_node_fail.py
Additional Test Params:
OCP VERSION: 4.18
OCS VERSION: 4.18
tested against branch: master
can recover from a worker node failure that | ||
hosts critical pods (such as OpenShift Virtualization VMs, | ||
OSD pods, or mon pods) | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add test steps
log.error( | ||
f"Pods did not return to running state, attempting node restart: {e}" | ||
) | ||
nodes.restart_nodes(node.get_node_objs([node_name])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dont we need to check again pod status after restart?
for vm_obj in vm_objs_def + vm_objs_aggr | ||
} | ||
log.info(f"Final VM states: {final_vm_states}") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add code to check data integrity after recovery
f" on node {node_name}, still on the same node" | ||
) | ||
|
||
ceph_health_check(tries=80) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are checking this at line 112 ,then why you are checking it here again?
@magenta_squad | ||
@workloads | ||
@ignore_leftovers | ||
@pytest.mark.polarion_id("OCS-") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please create a test case in polarion and add the ID here
(such as OpenShift Virtualization VMs, OSD pods, or mon pods) | ||
""" | ||
|
||
short_nw_fail_time = 300 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are stopping and starting the node. This constant can be removed
|
||
@magenta_squad | ||
@workloads | ||
@ignore_leftovers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the left over here?
sample = TimeoutSampler( | ||
timeout=600, | ||
sleep=10, | ||
func=wait_for_pods_to_be_running, | ||
namespace=odf_namespace, | ||
) | ||
assert sample.wait_for_func_status( | ||
result=True | ||
), f"Not all pods are running in {odf_namespace} before node failure" | ||
|
||
sample = TimeoutSampler( | ||
timeout=600, | ||
sleep=10, | ||
func=wait_for_pods_to_be_running, | ||
namespace=cnv_namespace, | ||
) | ||
assert sample.wait_for_func_status( | ||
result=True | ||
), f"Not all pods are running in {cnv_namespace} before node failure" | ||
|
||
ceph_health_check(tries=80) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as discussed, this will be taken care at the start of the test run by the framework. It can be removed
@workloads | ||
@ignore_leftovers | ||
@pytest.mark.polarion_id("OCS-") | ||
class TestVmWorkerNodeResiliency(E2ETest): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are doing a single worker node failure, please repharse it accordingly
if config.ENV_DATA["platform"].lower() == constants.GCP_PLATFORM: | ||
nodes.restart_nodes_by_stop_and_start(node_obj, force=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GCP is already handled, wouldn't the platform run according to the platform?
f"VM {vm_name}: Rescheduling failed. Initially, VM is scheduled" | ||
f" on node {node_name}, still on the same node" | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also write some IO after node recovery
worker_nodes = node.get_osd_running_nodes() | ||
node_name = random.sample(worker_nodes, 1) | ||
node_name = node_name[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how are you making sure that the randomly selected node is having VM running on it?
|
||
log = logging.getLogger(__name__) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
teardown code is missing, please add
except ResourceWrongStatusException as e: | ||
log.error( | ||
f"Pods did not return to running state, attempting node restart: {e}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need to restart the node again?
Ensure that both OpenShift Virtualization and ODF can recover from a worker node failure that hosts critical pods (such as OpenShift Virtualization VMs, OSD pods, or mon pods)