-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-44238: Add Readiness Probe to Router Status Tests #29395
OCPBUGS-44238: Add Readiness Probe to Router Status Tests #29395
Conversation
@gcs278: This pull request references Jira Issue OCPBUGS-44238, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Still need some debugging. |
/jira refresh |
@gcs278: This pull request references Jira Issue OCPBUGS-44238, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request. The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
671ecce
to
98c3239
Compare
@gcs278: This pull request references Jira Issue OCPBUGS-44238, which is valid. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Job Failure Risk Analysis for sha: 98c3239
|
902c029
to
c5fadf0
Compare
Unrelated |
@gcs278: The specified target(s) for
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Job Failure Risk Analysis for sha: c5fadf0
|
/assign |
/test unit |
c5fadf0
to
4c53b85
Compare
Job Failure Risk Analysis for sha: 4c53b85
|
I'm not aware of any issues with this PR, but I'll defer to @alebedev87. |
3887d68
to
c9b5c94
Compare
/assign @alebedev87 Ready for review @alebedev87 |
Job Failure Risk Analysis for sha: c9b5c94
|
226d0c7
to
7ebf2cc
Compare
SecurityContext: &corev1.SecurityContext{ | ||
// Default is true, but explicitly specified here for clarity. | ||
AllowPrivilegeEscalation: ptr.To[bool](true), | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not needed as I explained here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry still working on a response to your other comment, doing some experimenting.
But, we do need it, if I set it to false
I get:
sh-5.1$ ../reload-haproxy
[NOTICE] (20) : haproxy version is 2.8.10-f28885f
[NOTICE] (20) : path to executable is /usr/sbin/haproxy
[ALERT] (20) : Binding [/var/lib/haproxy/conf/haproxy.config:61] for frontend public: cannot bind socket (Permission denied) for [0.0.0.0:80]
[ALERT] (20) : Binding [/var/lib/haproxy/conf/haproxy.config:91] for frontend public_ssl: cannot bind socket (Permission denied) for [0.0.0.0:443]
The production router deployment adds this for the same reason too right?
I can fix this with:
SecurityContext: &corev1.PodSecurityContext{
Sysctls: []corev1.Sysctl{
{
Name: "net.ipv4.ip_unprivileged_port_start",
Value: "80", // Set the desired value
},
},
},
So this isn't a NET_BIND_SERVICE
issue (it is enabled with restricted
scc), but ip_unprivileged_port_start
is still needed to allow unprivileged user bind to < 1024.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, maybe there is something I'm still not understanding about NET_BIND_SERVICE
, I do think it should provide the ability to bind to port 80 without ip_unprivileged_port_start: 80
. It's definitely enabled for restricted
without explicitly providing NET_BIND_SERVICE
in the pod spec:
capsh --print
Current: =
Bounding set =cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_setpcap,cap_net_bind_service
EDIT: Here's my full example:
[gspence@gspence-thinkpadp1gen3 origin]$ oc get pods -n e2e-test-router-stress-qhv52
NAME READY STATUS RESTARTS AGE
router-b27hs 0/1 Running 0 49s
router-bb4k6 0/1 Running 0 49s
router-fhx52 0/1 Running 0 49s
[gspence@gspence-thinkpadp1gen3 origin]$ oc get pods -n e2e-test-router-stress-qhv52 -o yaml | grep -i openshift.io/required-scc
openshift.io/required-scc: restricted
openshift.io/required-scc: restricted
openshift.io/required-scc: restricted
[gspence@gspence-thinkpadp1gen3 origin]$ oc get pods -n e2e-test-router-stress-qhv52 -o yaml | grep -i allowPriv
allowPrivilegeEscalation: false
allowPrivilegeEscalation: false
allowPrivilegeEscalation: false
[gspence@gspence-thinkpadp1gen3 origin]$ oc rsh -n e2e-test-router-stress-qhv52 router-b27hs
sh-5.1$ capsh --print | grep -i net_bind
Bounding set =cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_setpcap,cap_net_bind_service
sh-5.1$ ../reload-haproxy
[NOTICE] (21) : haproxy version is 2.8.10-f28885f
[NOTICE] (21) : path to executable is /usr/sbin/haproxy
[ALERT] (21) : Binding [/var/lib/haproxy/conf/haproxy.config:61] for frontend public: cannot bind socket (Permission denied) for [0.0.0.0:80]
[ALERT] (21) : Binding [/var/lib/haproxy/conf/haproxy.config:91] for frontend public_ssl: cannot bind socket (Permission denied) for [0.0.0.0:443]
[ALERT] (21) : [/usr/sbin/haproxy.main()] Some protocols failed to start their listeners! Exiting.
sh-5.1$
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The production router deployment adds this for the same reason too right?
I didn't see this, interesting. I'm wondering whether this was done to match against OpenShift defined restricted
SCC (instead of some custom user defined) or the router container really needs the privilege escalation. The latter does not seem to be needed for NET_BIND_SERVICE capability which is what we need to be able to bind on privilege ports. The former should be now fixed by required-scc
annotation whose goal was to do the exact SCC matching when some custom SCCs are defined on the cluster.
Actually, maybe there is something I'm still not understanding about NET_BIND_SERVICE, I do think it should provide the ability to bind to port 80 without ip_unprivileged_port_start: 80. It's definitely enabled for restricted without explicitly providing NET_BIND_SERVICE in the pod spec
Right, that's what the Linux manual says too - NET_BIND_SERVICE capability is what you need to be able to bind on privileged ports:
$ man 7 capabilities | grep -A1 CAP_NET_BIND_SERVICE
CAP_NET_BIND_SERVICE
Bind a socket to Internet domain privileged ports (port numbers less than 1024).
Also, I agree that we don't need to set it explicitly in the container's securityContext because we set it during the image build.
What I meant here was not setting the securityContext at all. Like it was before your PR.
What puzzles me though is the fact that not setting allowPrivilegeEscalation
and setting it explicitly to false
give different results. I tried it on the CIO managed router and not setting allowPrivilegeEscalation
works fine while setting it to false
gives the permission denied, same as you showed above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I'm wondering how this router pod was even working before? Why adding a readiness probe needs an explicit match against restricted
SCC? I suppose that before it was running in "default" restricted-v2
SCC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Xref: #29395 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I'm wondering how this router pod was even working before? Why adding a readiness probe needs an explicit match against restricted SCC? I suppose that before it was running in "default" restricted-v2 SCC.
It wasn't working before. HAProxy failed to start, but the pod went ready anyways because there was no readiness probe. However, we didn't really need HAProxy to run because we are just examining route status in this test. I agree it's confusing.
I wanted to prevent the test from starting prematurely before all routers are ready, which required me to add a readiness probe, which then exposed the fact that HAProxy was never getting started in the first place 😵
Job Failure Risk Analysis for sha: 7ebf2cc
|
Previously, the router was configured without a readiness probe, resulting in racy startup conditions during router status stress tests. Routers would be marked as ready immediately upon starting, causing the waitForReadyReplicaSet function to proceed prematurely. This allowed the next step of route creation to occur before the routers had fully initialized. This often led to the first two routers to fight over the route status while the third router was still starting. As a result, the third router missed observing these early status contentions, leading to more writes to the route status than we were expecting. Adding the readiness probe also revealed that HAProxy was failing to start due to insufficient permissions. The anyuid SCC was added to the router's service account to resolve the issue.
7ebf2cc
to
f9b5bce
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alebedev87, gcs278 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
None of the failures are of the test I am fixing: |
This is a CI test improvement that I've tested extensively. Risk is low. /label acknowledge-critical-fixes-only |
Failures not related to this test. |
i see some green for microshift jobs now, maybe it's fixed: |
@gcs278: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
@gcs278: Jira Issue OCPBUGS-44238: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-44238 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
I'm going to start with 4.18 since it's affecting component readiness /cherry-pick release-4.18 |
@gcs278: new pull request created: #29513 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Previously, the router was configured without a readiness probe, resulting in racy startup conditions during router status stress tests. Routers would be marked as ready immediately upon starting, causing the waitForReadyReplicaSet function to proceed prematurely. This allowed the next step of route creation to occur before the routers had fully initialized.
This often led to the first two routers to fight over the route status while the third router was still starting. As a result, the third router missed observing these early status contentions, leading to more writes to the route status than we were expecting.
Adding the readiness probe also revealed that HAProxy was failing to start due to insufficient permissions. The anyuid SCC was added to the router's service account to resolve the issue.