Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-44238: Add Readiness Probe to Router Status Tests #29395

Conversation

gcs278
Copy link
Contributor

@gcs278 gcs278 commented Jan 3, 2025

Previously, the router was configured without a readiness probe, resulting in racy startup conditions during router status stress tests. Routers would be marked as ready immediately upon starting, causing the waitForReadyReplicaSet function to proceed prematurely. This allowed the next step of route creation to occur before the routers had fully initialized.

This often led to the first two routers to fight over the route status while the third router was still starting. As a result, the third router missed observing these early status contentions, leading to more writes to the route status than we were expecting.

Adding the readiness probe also revealed that HAProxy was failing to start due to insufficient permissions. The anyuid SCC was added to the router's service account to resolve the issue.

@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Jan 3, 2025
@openshift-ci-robot
Copy link

@gcs278: This pull request references Jira Issue OCPBUGS-44238, which is invalid:

  • expected the bug to target the "4.19.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Previously, the router was configured without a readiness probe, resulting in racy startup conditions during router status stress tests. Routers would be marked as ready immediately upon starting, causing the waitForReadyReplicaSet function to proceed prematurely. This allowed the next step of route creation to occur before the routers had fully initialized.

This often led to the first two routers to fight over the route status while the third router was still starting. As a result, the third router missed observing these early status contentions, leading to more writes to the route status than we were expecting.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jan 3, 2025
@openshift-ci openshift-ci bot requested review from frobware and knobunc January 3, 2025 01:58
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 3, 2025
@gcs278
Copy link
Contributor Author

gcs278 commented Jan 3, 2025

Still need some debugging.
/wip

@gcs278 gcs278 changed the title OCPBUGS-44238: Add Readiness Probe to Router Status Tests [WIP] OCPBUGS-44238: Add Readiness Probe to Router Status Tests Jan 3, 2025
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 3, 2025
@gcs278
Copy link
Contributor Author

gcs278 commented Jan 3, 2025

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jan 3, 2025
@openshift-ci-robot
Copy link

@gcs278: This pull request references Jira Issue OCPBUGS-44238, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.19.0) matches configured target version for branch (4.19.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@gcs278 gcs278 force-pushed the OCPBUGS-44238-status-readiness-probe branch from 671ecce to 98c3239 Compare January 3, 2025 18:27
@openshift-ci-robot
Copy link

@gcs278: This pull request references Jira Issue OCPBUGS-44238, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.19.0) matches configured target version for branch (4.19.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

In response to this:

Previously, the router was configured without a readiness probe, resulting in racy startup conditions during router status stress tests. Routers would be marked as ready immediately upon starting, causing the waitForReadyReplicaSet function to proceed prematurely. This allowed the next step of route creation to occur before the routers had fully initialized.

This often led to the first two routers to fight over the route status while the third router was still starting. As a result, the third router missed observing these early status contentions, leading to more writes to the route status than we were expecting.

WIP: I'd like to merge openshift/router#646 and remove the DEFAULT_CERTIFICATE override before merging this PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

openshift-trt bot commented Jan 3, 2025

Job Failure Risk Analysis for sha: 98c3239

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node Medium
[sig-network][Feature:Router][apigroup:route.openshift.io] The HAProxy router converges when multiple routers are writing conflicting status [Suite:openshift/conformance/parallel]
This test has passed 93.80% of 129 runs on jobs [periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-single-node] in the last 14 days.

Open Bugs
[CI] Investigate: The HAProxy router converges when multiple routers are writing conflicting status

@gcs278 gcs278 force-pushed the OCPBUGS-44238-status-readiness-probe branch 2 times, most recently from 902c029 to c5fadf0 Compare January 6, 2025 18:38
@gcs278
Copy link
Contributor Author

gcs278 commented Jan 6, 2025

Unrelated
/test e2e-awn-ovn-fips

Copy link
Contributor

openshift-ci bot commented Jan 6, 2025

@gcs278: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test e2e-aws-jenkins
/test e2e-aws-ovn-edge-zones
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-image-registry
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial
/test e2e-gcp-ovn
/test e2e-gcp-ovn-builds
/test e2e-gcp-ovn-image-ecosystem
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test images
/test lint
/test unit
/test verify
/test verify-deps

The following commands are available to trigger optional jobs:

/test 4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade-rollback
/test e2e-agnostic-ovn-cmd
/test e2e-aws
/test e2e-aws-csi
/test e2e-aws-disruptive
/test e2e-aws-etcd-certrotation
/test e2e-aws-etcd-recovery
/test e2e-aws-ovn
/test e2e-aws-ovn-cgroupsv2
/test e2e-aws-ovn-etcd-scaling
/test e2e-aws-ovn-ipsec-serial
/test e2e-aws-ovn-kube-apiserver-rollout
/test e2e-aws-ovn-kubevirt
/test e2e-aws-ovn-single-node
/test e2e-aws-ovn-single-node-serial
/test e2e-aws-ovn-single-node-techpreview
/test e2e-aws-ovn-single-node-techpreview-serial
/test e2e-aws-ovn-single-node-upgrade
/test e2e-aws-ovn-upgrade
/test e2e-aws-ovn-upgrade-rollback
/test e2e-aws-ovn-upi
/test e2e-aws-ovn-virt-techpreview
/test e2e-aws-proxy
/test e2e-azure
/test e2e-azure-ovn-etcd-scaling
/test e2e-azure-ovn-upgrade
/test e2e-baremetalds-kubevirt
/test e2e-external-aws
/test e2e-external-aws-ccm
/test e2e-external-vsphere-ccm
/test e2e-gcp-csi
/test e2e-gcp-disruptive
/test e2e-gcp-fips-serial
/test e2e-gcp-ovn-etcd-scaling
/test e2e-gcp-ovn-rt-upgrade
/test e2e-gcp-ovn-techpreview
/test e2e-gcp-ovn-techpreview-serial
/test e2e-hypershift-conformance
/test e2e-metal-ipi-ovn
/test e2e-metal-ipi-ovn-dualstack
/test e2e-metal-ipi-ovn-dualstack-local-gateway
/test e2e-metal-ipi-ovn-kube-apiserver-rollout
/test e2e-metal-ipi-serial
/test e2e-metal-ipi-serial-ovn-ipv6
/test e2e-metal-ipi-virtualmedia
/test e2e-metal-ovn-single-node-live-iso
/test e2e-metal-ovn-single-node-with-worker-live-iso
/test e2e-openstack-ovn
/test e2e-openstack-serial
/test e2e-vsphere
/test e2e-vsphere-ovn-dualstack-primaryv6
/test e2e-vsphere-ovn-etcd-scaling
/test okd-e2e-gcp
/test okd-scos-e2e-aws-ovn
/test okd-scos-images

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-origin-master-e2e-agnostic-ovn-cmd
pull-ci-openshift-origin-master-e2e-aws-csi
pull-ci-openshift-origin-master-e2e-aws-ovn-cgroupsv2
pull-ci-openshift-origin-master-e2e-aws-ovn-edge-zones
pull-ci-openshift-origin-master-e2e-aws-ovn-fips
pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout
pull-ci-openshift-origin-master-e2e-aws-ovn-microshift
pull-ci-openshift-origin-master-e2e-aws-ovn-microshift-serial
pull-ci-openshift-origin-master-e2e-aws-ovn-serial
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-serial
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade
pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade
pull-ci-openshift-origin-master-e2e-gcp-csi
pull-ci-openshift-origin-master-e2e-gcp-ovn
pull-ci-openshift-origin-master-e2e-gcp-ovn-rt-upgrade
pull-ci-openshift-origin-master-e2e-gcp-ovn-upgrade
pull-ci-openshift-origin-master-e2e-hypershift-conformance
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-ipv6
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-kube-apiserver-rollout
pull-ci-openshift-origin-master-e2e-openstack-ovn
pull-ci-openshift-origin-master-images
pull-ci-openshift-origin-master-lint
pull-ci-openshift-origin-master-okd-scos-e2e-aws-ovn
pull-ci-openshift-origin-master-unit
pull-ci-openshift-origin-master-verify
pull-ci-openshift-origin-master-verify-deps

In response to this:

Unrelated
/test e2e-awn-ovn-fips

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link

openshift-trt bot commented Jan 6, 2025

Job Failure Risk Analysis for sha: c5fadf0

Job Name Failure Risk
pull-ci-openshift-origin-master-okd-scos-e2e-aws-ovn High
[sig-arch] Only known images used by tests
This test has passed 100.00% of 95 runs on jobs [periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn] in the last 14 days.

@candita
Copy link
Contributor

candita commented Jan 8, 2025

/assign
/assign @Miciah

@candita
Copy link
Contributor

candita commented Jan 13, 2025

/test unit

@gcs278 gcs278 force-pushed the OCPBUGS-44238-status-readiness-probe branch from c5fadf0 to 4c53b85 Compare January 18, 2025 02:08
Copy link

openshift-trt bot commented Jan 18, 2025

Job Failure Risk Analysis for sha: 4c53b85

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-serial Medium
[sig-imageregistry][Serial] Image signature workflow can push a signed image to openshift registry and verify it [apigroup:user.openshift.io][apigroup:image.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/serial]
This test has passed 97.10% of 241 runs on release 4.19 [Overall] in the last week.

@gcs278 gcs278 changed the title [WIP] OCPBUGS-44238: Add Readiness Probe to Router Status Tests OCPBUGS-44238: Add Readiness Probe to Router Status Tests Jan 20, 2025
@candita
Copy link
Contributor

candita commented Jan 24, 2025

I'm not aware of any issues with this PR, but I'll defer to @alebedev87.
/unassign

@gcs278 gcs278 force-pushed the OCPBUGS-44238-status-readiness-probe branch 2 times, most recently from 3887d68 to c9b5c94 Compare January 25, 2025 01:00
@gcs278
Copy link
Contributor Author

gcs278 commented Jan 25, 2025

/assign @alebedev87

Ready for review @alebedev87

Copy link

openshift-trt bot commented Jan 25, 2025

Job Failure Risk Analysis for sha: c9b5c94

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout Low
[Conformance][Suite:openshift/kube-apiserver/rollout][Jira:"kube-apiserver"][sig-kube-apiserver] kube-apiserver should roll out new revisions without disruption [apigroup:config.openshift.io][apigroup:operator.openshift.io]
This test has passed 14.29% of 7 runs on release 4.19 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:ha Upgrade:none] in the last week.

@gcs278 gcs278 force-pushed the OCPBUGS-44238-status-readiness-probe branch 2 times, most recently from 226d0c7 to 7ebf2cc Compare January 27, 2025 15:10
test/extended/router/stress.go Outdated Show resolved Hide resolved
Comment on lines +612 to +616
SecurityContext: &corev1.SecurityContext{
// Default is true, but explicitly specified here for clarity.
AllowPrivilegeEscalation: ptr.To[bool](true),
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed as I explained here.

Copy link
Contributor Author

@gcs278 gcs278 Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry still working on a response to your other comment, doing some experimenting.

But, we do need it, if I set it to false I get:

sh-5.1$ ../reload-haproxy 
[NOTICE]   (20) : haproxy version is 2.8.10-f28885f
[NOTICE]   (20) : path to executable is /usr/sbin/haproxy
[ALERT]    (20) : Binding [/var/lib/haproxy/conf/haproxy.config:61] for frontend public: cannot bind socket (Permission denied) for [0.0.0.0:80]
[ALERT]    (20) : Binding [/var/lib/haproxy/conf/haproxy.config:91] for frontend public_ssl: cannot bind socket (Permission denied) for [0.0.0.0:443]

The production router deployment adds this for the same reason too right?

I can fix this with:

	SecurityContext: &corev1.PodSecurityContext{
		Sysctls: []corev1.Sysctl{
			{
				Name:  "net.ipv4.ip_unprivileged_port_start",
				Value: "80", // Set the desired value
			},
		},
	},

So this isn't a NET_BIND_SERVICE issue (it is enabled with restricted scc), but ip_unprivileged_port_start is still needed to allow unprivileged user bind to < 1024.

Copy link
Contributor Author

@gcs278 gcs278 Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, maybe there is something I'm still not understanding about NET_BIND_SERVICE, I do think it should provide the ability to bind to port 80 without ip_unprivileged_port_start: 80. It's definitely enabled for restricted without explicitly providing NET_BIND_SERVICE in the pod spec:

capsh --print
Current: =
Bounding set =cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_setpcap,cap_net_bind_service

EDIT: Here's my full example:

[gspence@gspence-thinkpadp1gen3 origin]$ oc get pods -n e2e-test-router-stress-qhv52 
NAME           READY   STATUS    RESTARTS   AGE
router-b27hs   0/1     Running   0          49s
router-bb4k6   0/1     Running   0          49s
router-fhx52   0/1     Running   0          49s
[gspence@gspence-thinkpadp1gen3 origin]$ oc get pods -n e2e-test-router-stress-qhv52 -o yaml | grep -i openshift.io/required-scc
      openshift.io/required-scc: restricted
      openshift.io/required-scc: restricted
      openshift.io/required-scc: restricted
[gspence@gspence-thinkpadp1gen3 origin]$ oc get pods -n e2e-test-router-stress-qhv52 -o yaml | grep -i allowPriv
        allowPrivilegeEscalation: false
        allowPrivilegeEscalation: false
        allowPrivilegeEscalation: false
[gspence@gspence-thinkpadp1gen3 origin]$ oc rsh -n e2e-test-router-stress-qhv52 router-b27hs
sh-5.1$ capsh --print | grep -i net_bind 
Bounding set =cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_setpcap,cap_net_bind_service
sh-5.1$ ../reload-haproxy 
[NOTICE]   (21) : haproxy version is 2.8.10-f28885f
[NOTICE]   (21) : path to executable is /usr/sbin/haproxy
[ALERT]    (21) : Binding [/var/lib/haproxy/conf/haproxy.config:61] for frontend public: cannot bind socket (Permission denied) for [0.0.0.0:80]
[ALERT]    (21) : Binding [/var/lib/haproxy/conf/haproxy.config:91] for frontend public_ssl: cannot bind socket (Permission denied) for [0.0.0.0:443]
[ALERT]    (21) : [/usr/sbin/haproxy.main()] Some protocols failed to start their listeners! Exiting.
sh-5.1$ 

Copy link
Contributor

@alebedev87 alebedev87 Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The production router deployment adds this for the same reason too right?

I didn't see this, interesting. I'm wondering whether this was done to match against OpenShift defined restricted SCC (instead of some custom user defined) or the router container really needs the privilege escalation. The latter does not seem to be needed for NET_BIND_SERVICE capability which is what we need to be able to bind on privilege ports. The former should be now fixed by required-scc annotation whose goal was to do the exact SCC matching when some custom SCCs are defined on the cluster.

Actually, maybe there is something I'm still not understanding about NET_BIND_SERVICE, I do think it should provide the ability to bind to port 80 without ip_unprivileged_port_start: 80. It's definitely enabled for restricted without explicitly providing NET_BIND_SERVICE in the pod spec

Right, that's what the Linux manual says too - NET_BIND_SERVICE capability is what you need to be able to bind on privileged ports:

$ man 7 capabilities | grep -A1 CAP_NET_BIND_SERVICE
       CAP_NET_BIND_SERVICE
              Bind a socket to Internet domain privileged ports (port numbers less than 1024).

Also, I agree that we don't need to set it explicitly in the container's securityContext because we set it during the image build.

What I meant here was not setting the securityContext at all. Like it was before your PR.

What puzzles me though is the fact that not setting allowPrivilegeEscalation and setting it explicitly to false give different results. I tried it on the CIO managed router and not setting allowPrivilegeEscalation works fine while setting it to false gives the permission denied, same as you showed above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I'm wondering how this router pod was even working before? Why adding a readiness probe needs an explicit match against restricted SCC? I suppose that before it was running in "default" restricted-v2 SCC.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I'm wondering how this router pod was even working before? Why adding a readiness probe needs an explicit match against restricted SCC? I suppose that before it was running in "default" restricted-v2 SCC.

It wasn't working before. HAProxy failed to start, but the pod went ready anyways because there was no readiness probe. However, we didn't really need HAProxy to run because we are just examining route status in this test. I agree it's confusing.

I wanted to prevent the test from starting prematurely before all routers are ready, which required me to add a readiness probe, which then exposed the fact that HAProxy was never getting started in the first place 😵

Copy link

openshift-trt bot commented Jan 27, 2025

Job Failure Risk Analysis for sha: 7ebf2cc

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout Low
[Conformance][Suite:openshift/kube-apiserver/rollout][Jira:"kube-apiserver"][sig-kube-apiserver] kube-apiserver should roll out new revisions without disruption [apigroup:config.openshift.io][apigroup:operator.openshift.io]
This test has passed 28.57% of 7 runs on release 4.19 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:ha Upgrade:none] in the last week.

Previously, the router was configured without a readiness probe, resulting in
racy startup conditions during router status stress tests. Routers would be
marked as ready immediately upon starting, causing the waitForReadyReplicaSet
function to proceed prematurely. This allowed the next step of route creation
to occur before the routers had fully initialized.

This often led to the first two routers to fight over the route status while
the third router was still starting. As a result, the third router missed
observing these early status contentions, leading to more writes to the
route status than we were expecting.

Adding the readiness probe also revealed that HAProxy was failing to start
due to insufficient permissions. The anyuid SCC was added to the router's
service account to resolve the issue.
@gcs278 gcs278 force-pushed the OCPBUGS-44238-status-readiness-probe branch from 7ebf2cc to f9b5bce Compare January 28, 2025 15:56
@alebedev87
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 28, 2025
Copy link
Contributor

openshift-ci bot commented Jan 28, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alebedev87, gcs278

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gcs278
Copy link
Contributor Author

gcs278 commented Jan 28, 2025

None of the failures are of the test I am fixing:
/retest

@gcs278
Copy link
Contributor Author

gcs278 commented Jan 28, 2025

This is a CI test improvement that I've tested extensively. Risk is low.

/label acknowledge-critical-fixes-only

@openshift-ci openshift-ci bot added the acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. label Jan 28, 2025
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD d30b7e1 and 2 for PR HEAD f9b5bce in total

@gcs278
Copy link
Contributor Author

gcs278 commented Jan 29, 2025

Failures not related to this test.
/retest

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD fcc0ea0 and 1 for PR HEAD f9b5bce in total

@gcs278
Copy link
Contributor Author

gcs278 commented Jan 29, 2025

i see some green for microshift jobs now, maybe it's fixed:
/retest

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD a0d3376 and 0 for PR HEAD f9b5bce in total

Copy link
Contributor

openshift-ci bot commented Jan 30, 2025

@gcs278: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit f14111c into openshift:master Jan 31, 2025
29 checks passed
@openshift-ci-robot
Copy link

@gcs278: Jira Issue OCPBUGS-44238: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-44238 has been moved to the MODIFIED state.

In response to this:

Previously, the router was configured without a readiness probe, resulting in racy startup conditions during router status stress tests. Routers would be marked as ready immediately upon starting, causing the waitForReadyReplicaSet function to proceed prematurely. This allowed the next step of route creation to occur before the routers had fully initialized.

This often led to the first two routers to fight over the route status while the third router was still starting. As a result, the third router missed observing these early status contentions, leading to more writes to the route status than we were expecting.

Adding the readiness probe also revealed that HAProxy was failing to start due to insufficient permissions. The anyuid SCC was added to the router's service account to resolve the issue.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@gcs278
Copy link
Contributor Author

gcs278 commented Jan 31, 2025

I'm going to start with 4.18 since it's affecting component readiness

/cherry-pick release-4.18

@openshift-cherrypick-robot

@gcs278: new pull request created: #29513

In response to this:

I'm going to start with 4.18 since it's affecting component readiness

/cherry-pick release-4.18

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants