Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate CRI-O jobs away from kubernetes_e2e.py #32567

Open
saschagrunert opened this issue May 6, 2024 · 42 comments
Open

Migrate CRI-O jobs away from kubernetes_e2e.py #32567

saschagrunert opened this issue May 6, 2024 · 42 comments
Assignees
Labels
priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@saschagrunert
Copy link
Member

saschagrunert commented May 6, 2024

The kubernetes_e2e.py script is deprecated and we should use kubetest2 instead.

All affected tests are listed in https://testgrid.k8s.io/sig-node-cri-o

cc @kubernetes/sig-node-cri-o-test-maintainers

Ref: https://github.com/kubernetes/test-infra/tree/master/scenarios, #20760

@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label May 6, 2024
@haircommander
Copy link
Contributor

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 6, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 4, 2024
@saschagrunert
Copy link
Member Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 5, 2024
@kannon92
Copy link
Contributor

/triage accepted
/priority important-longterm

@kannon92 kannon92 moved this from Triage to Issues - To do in SIG Node CI/Test Board Aug 21, 2024
@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Aug 21, 2024
@elieser1101
Copy link
Contributor

Does this still need help? can i start looking at it?

@saschagrunert
Copy link
Member Author

@elieser1101 I'd appreciate your eyes on that. 🙏

@elieser1101
Copy link
Contributor

/assign

@bart0sh
Copy link
Contributor

bart0sh commented Dec 18, 2024

@elieser1101 I can see a lot of green kubetest2 jobs in the test grid. Is there anything that prevents replacing kubernetes_e2e.py jobs with them? I did it for splitfs and imagefs jobs as I was involved in fixing them. I can do it for the rest of jobs if needed.

@elieser1101
Copy link
Contributor

@bart0sh thank you very much for the splitfs/imagefs that was a great finding


What would come next is to validate that the kubetest2 are actually working. Meaning, I noticed that some of the jobs are completing but are skipping all the specs. We would like to ensure we are running the jobs properly before replacing the kubernetes_e2e.py jobs.

At the moment im loking at the DRA ones wich were missing some kubetest2 features and this

@bart0sh
Copy link
Contributor

bart0sh commented Jan 2, 2025

@elieser1101 pull-crio-cgroupv2-node-e2e-eviction-kubetest2 fails with Context was cancelled (cause: suite timeout occurred) after 235.856s., which is quite strange as I don't see this timeout specified anywhere. correspondent non-kubetest2 test case has longer timeout and passes. So, this seems to be caused by kubetest2. Do you happen to know the reason? Did you see this error in other job logs?

@elieser1101
Copy link
Contributor

Have not seen that before, but seem like the kubetest2 job is missing the --timeout flag, we could try adding it
@bart0sh

@bart0sh
Copy link
Contributor

bart0sh commented Jan 3, 2025

@elieser1101 Thanks! Added --timeout option to the job configs: #34067
However, kubetest2 modifies it's value, it seems. I run eviction job locally this way:

kubetest2-gce --test=node --down=false -- --parallelism=1 --gcp-zone=us-west1-a  --repo-root=. --image-config-file=/home/prow/go/src/k8s.io/test-infra/jobs/e2e_node/crio/latest/image-config-cgroupv1.yaml --delete-instances=false --test-args='--container-runtime-endpoint=unix:///var/run/crio/crio.sock --container-runtime-process-name=/usr/local/bin/crio --container-runtime-pid-file= --kubelet-flags="--cgroup-driver=systemd --cgroups-per-qos=true --cgroup-root=/ --runtime-cgroups=/system.slice/crio.service --kubelet-cgroups=/system.slice/kubelet.service" --extra-log="{\"name\": \"crio.log\", \"journalctl\": [\"-u\", \"crio\"]}"' --skip-regex='' --focus-regex='\[NodeFeature:Eviction\]' --timeout 300m 

And it runs ginkgo this way:

I0103 13:18:22.467151  250332 node_e2e.go:195] Starting tests on "test-fedora-coreos-41-20241122-3-0-gcp-x86-64"
I0103 13:18:22.467281  250332 ssh.go:146] Running the command ssh, with args: [-o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o CheckHostIP=no -o StrictHostKeyChecking=no -o ServerAliveInterval=30 -o LogLevel=ERROR -i /home/ed/.ssh/google_compute_engine [email protected] -- sudo /bin/bash -c 'cd /tmp/node-e2e-20250103T131740 && set -o pipefail; timeout -k 30s 18000.000000s ./ginkgo -timeout=24h -focus="\[NodeFeature:Eviction\]"  --no-color -v --timeout=180m ./e2e_node.test -- --system-spec-name= --system-spec-file= --extra-envs= --runtime-config= --v 4 --node-name=test-fedora-coreos-41-20241122-3-0-gcp-x86-64 --report-dir=/tmp/node-e2e-20250103T131740/results --report-prefix=fedora --image-description="fedora-coreos-41-20241122-3-0-gcp-x86-64" --kubelet-flags="--cluster-domain=cluster.local" --dns-domain="cluster.local" --prepull-images=false  --container-runtime-endpoint=unix:///run/containerd/containerd.sock --container-runtime-endpoint=unix:///var/run/crio/crio.sock --container-runtime-process-name=/usr/local/bin/crio --container-runtime-pid-file= --kubelet-flags="--cgroup-driver=systemd --cgroups-per-qos=true --cgroup-root=/ --runtime-cgroups=/system.slice/crio.service --kubelet-cgroups=/system.slice/kubelet.service" --extra-log="{\"name\": \"crio.log\", \"journalctl\": [\"-u\", \"crio\"]}" 2>&1 | tee -i /tmp/node-e2e-20250103T131740/results/test-fedora-coreos-41-20241122-3-0-gcp-x86-64-ginkgo.log']

So, kubetest2 changes --timeout 300m to ginkgo's --timeout=180m for some reason. Do you have any idea why?

@aojea
Copy link
Member

aojea commented Jan 3, 2025

actually there are two timeouts there

./ginkgo -timeout=24h -focus="\[NodeFeature:Eviction\]"  --no-color -v --timeout=180m

it seems is added in https://github.com/kubernetes/kubernetes/blob/master/hack/make-rules/test-e2e-node.sh

EDIT

@bart0sh you are not passing the flag to kubetest2 IIUIC , it has to be added before the --

@bart0sh
Copy link
Contributor

bart0sh commented Jan 3, 2025

@aojea > you are not passing the flag to kubetest2 IIUIC , it has to be added before the --

I'm not passing it to kubetest2 because kubetest2 doesn't have this flag:

$ kubetest2 gce --help 2>&1 |grep timeout
      --boskos-acquire-timeout-seconds int      How long (in seconds) to hang on a request to Boskos to acquire a resource before erroring. (default 300)

And I tested the fix, btw.

@aojea
Copy link
Member

aojea commented Jan 3, 2025

I'm not passing it to kubetest2 because kubetest2 doesn't have this flag:

is not this one ?

https://github.com/kubernetes-sigs/kubetest2/blob/22d5b1410bef09ae679fa5813a5f0d196b6079de/pkg/testers/node/node.go#L73

or are these changes not for e2e-node?

@bart0sh
Copy link
Contributor

bart0sh commented Jan 3, 2025

They are for e2e-node, but I couldn't use --timeout for kubetest2 when I run it manually. Am I missing something obvious here?

BTW, here is a job logs before and after
adding --timeout option to the job configuration. You can see there how a value of ginkgo's --timeout option has changed to 180m for some reason.

@elieser1101
Copy link
Contributor

elieser1101 commented Jan 3, 2025

So, kubetest2 changes --timeout 300m to ginkgo's --timeout=180m for some reason. Do you have any idea why?

I have seen that before, and I cant point to the WHY is that. but i think is more of test-e2e-node.sh and e2e_node/remote/remote.go change

is not this one ?
https://github.com/kubernetes-sigs/kubetest2/blob/22d5b1410bef09ae679fa5813a5f0d196b6079de/pkg/testers/node/node.go#L73

Yeah that is the flag we are using(tester flags), but then under the hood, the rabithole transforms the timeout in several places

When we pass to kubetest2 --timeout=300m we got this

Running the command ssh, with args: [-o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o CheckHostIP=no -o StrictHostKeyChecking=no -o ServerAliveInterval=30 -o LogLevel=ERROR -i /root/.ssh/google_compute_engine [email protected] -- sudo /bin/bash -c 'cd /tmp/node-e2e-20250103T183438 && set -o pipefail; timeout -k 30s 18000.000000s ./ginkgo -timeout=24h -focus="\[NodeFeature:Eviction\]"  -skip=""""  --no-color -v --timeout=180m ./e2e_node.test -- --system-spec-name= --system-spec-file= --extra-envs= --runtime-config= --v 4 --node-name=test-fedora-coreos-41-20241122-3-0-gcp-x86-64 --report-dir=/tmp/node-e2e-20250103T183438/results --report-prefix=fedora --image-description="fedora-coreos-41-20241122-3-0-gcp-x86-64" --kubelet-flags="--cluster-domain=cluster.local" --dns-domain="cluster.local" --prepull-images=false  --container-runtime-endpoint=unix:///run/containerd/containerd.sock --container-runtime-endpoint=unix:///var/run/crio/crio.sock --container-runtime-process-name=/usr/local/bin/crio --container-runtime-pid-file= --kubelet-flags="--cgroup-driver=systemd --cgroups-per-qos=true --cgroup-root=/ --runtime-cgroups=/system.slice/crio.service --kubelet-cgroups=/system.slice/kubelet.service" --extra-log="{\"name\": \"crio.log\", \"journalctl\": [\"-u\", \"crio\"]}" 2>&1 | tee -i /tmp/node-e2e-20250103T183438/results/test-fedora-coreos-41-20241122-3-0-gcp-x86-64-ginkgo.log']
  • Which results in a process timeout of 18000.000000s
  • also test-e2e-node.sh introduces -timeout=24h no matter if you pass other timeout
  • And finaly the timeout we specified but trimmed by the remote.go resulting in --timeout=180m

so setting up 300min -> (300 + 60) /2 = 180min passed to ginkgo

@bart0sh
Copy link
Contributor

bart0sh commented Jan 3, 2025

I hope that timeout recalculation has some reason. It's not obvious, but hopefully it exists :)

BTW, increasing timeout helped the job, but not fixed it. One test case still fails.

@kannon92 @elieser1101 Any ideas how to fix it?

@elieser1101
Copy link
Contributor

Is it possible the test itself is flaky? I can see the nonkubetest2 works intermittently and also can found one run with a similar error to the job running with kubetest2

@bart0sh

@kannon92
Copy link
Contributor

kannon92 commented Jan 6, 2025

eviction crio tests have some issues. I wouldn't worry about that.

@bart0sh
Copy link
Contributor

bart0sh commented Jan 7, 2025

Is it possible the test itself is flaky?

Could be, but I've never managed to run -kubetest2 tests without failure. non-kubetest2 tests are almost always green.

eviction crio tests have some issues.

It's probably off-topic here, so feel free to ignore.
I've noticed unexpectedly long timeouts in e2e eviction test cases. Is it considered normal for eviction to start 10 minutes after the issue (disk/pid pressure) started to manifest itself?

$ grep 'pressureTimeout :=' test/e2e_node/eviction_test.go 
        pressureTimeout := 15 * time.Minute
        pressureTimeout := 10 * time.Minute
        pressureTimeout := 10 * time.Minute
        pressureTimeout := 15 * time.Minute
        pressureTimeout := 10 * time.Minute
        pressureTimeout := 10 * time.Minute
        pressureTimeout := 10 * time.Minute
        pressureTimeout := 15 * time.Minute
        pressureTimeout := 10 * time.Minute

@elieser1101
Copy link
Contributor

elieser1101 commented Jan 16, 2025

Opened PR #34164 to promote the kubetest2 jobs that have been consistently working, pendings for rework are still

Evented pleg where non kubetest seem not working

Hugepages

Eviction

REsource manager

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: Issues - To do
Development

No branches or pull requests

9 participants