-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[release-4.18] OCPBUGS-49687: Add Readiness Probe to Router Status Tests #29513
[release-4.18] OCPBUGS-49687: Add Readiness Probe to Router Status Tests #29513
Conversation
Previously, the router was configured without a readiness probe, resulting in racy startup conditions during router status stress tests. Routers would be marked as ready immediately upon starting, causing the waitForReadyReplicaSet function to proceed prematurely. This allowed the next step of route creation to occur before the routers had fully initialized. This often led to the first two routers to fight over the route status while the third router was still starting. As a result, the third router missed observing these early status contentions, leading to more writes to the route status than we were expecting. Adding the readiness probe also revealed that HAProxy was failing to start due to insufficient permissions. The anyuid SCC was added to the router's service account to resolve the issue.
@openshift-cherrypick-robot: Jira Issue OCPBUGS-44238 has been cloned as Jira Issue OCPBUGS-49687. Will retitle bug to link to clone. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@openshift-cherrypick-robot: This pull request references Jira Issue OCPBUGS-49687, which is valid. The bug has been moved to the POST state. 7 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
This change only modifies test code, so it is low risk. Furthermore, the test is currently erroneously making component readiness red. /label backport-risk-assessed |
@Miciah: Can not set label backport-risk-assessed: Must be member in one of these teams: [openshift-staff-engineers] In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Clean cherry-pick. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Miciah, openshift-cherrypick-robot The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
It's failing E2E with:
I'll try to figure out why it worked in 4.19, but not in 4.18: |
Ah this is my own fault. I depended the test update on the default cert getting updated to SHA256: openshift/router#646. So that needs to get backported too in order for this to merged, otherwise you get this.
I think it's easier (and cleaner) to just backport openshift/router#646 instead of altering this cherry-pick to specify a SHA256 default cert explicitly. |
Depends on openshift/router#648 to be merged |
/hold cancel |
Wrong PR 🫤 /hold |
releasing the hold as openshift/router#648 is expected to merge soon. /hold cancel |
/retest-required |
/skip |
Seems like an awful lot of disruption in aws-ovn-edge-zones Though I see the same outside of this pr #29427 |
was just about to comment the same thing. dug through the last 8 or so of these jobs and some have this massive disruption and some have none. but to the point, some jobs with the disruption are on different PRs so guessing it's not related to this PR at least. |
1 similar comment
/retest-required |
The e2e-aws-ovn-edge-zones test is perma-failing due to an issue with metrics-api-new-connections service, which never comes up. time="2025-02-11T02:19:58Z" level=error msg="disruption sample failed: error running request: 503 Service Unavailable: error trying to reach service: context deadline exceeded\n" auditID=ca08b173-35cd-4da3-9430-05d99bc7741a backend=metrics-api-new-connections this-instance="{Disruption map[backend-disruption-name:metrics-api-new-connections connection:new disruption:openshift-tests]}" type=new |
It appears to have passed on 2/1 but not since. I don't see other PRs hitting this when reviewing the history Most recent pass outside this pr looks like 2/4 |
and it wasn't a problem in the 4.19 version of this PR, FWIW. |
I opened a noop 4.18 pr and kicked off pull-ci-openshift-origin-release-4.18-e2e-aws-ovn-edge-zones, curious to see what the results are. |
@Miciah The test of |
@openshift-cherrypick-robot: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
f89d72e
into
openshift:release-4.18
@openshift-cherrypick-robot: Jira Issue OCPBUGS-49687: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-49687 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
This is an automated cherry-pick of #29395
/assign gcs278