neonvm: Use crictl to change container CPU, ditch cgroup #738

sharnoff · 2024-01-15T07:22:17Z

We recently realized¹ that under cgroups v2, kubernetes uses cgroup namespaces which has a few effects:

The output of /proc/self/cgroup shows as if the container were at the root of the hierarchy
It's very difficult for us to determine the actual cgroup that the container corresponds to on the host
We still can't directly create a cgroup in the container's namespace because /sys/fs/cgroup is mounted read-only

So, neonvm-runner currently does not work as expected with cgroups v2; it creates a new cgroup for the VM, at the top of the hierarchy, and doesn't clean it up on exit.

How do we fix this? The aim of this PR is to remove the special cgroup handling entirely, and "just" go through the Container Runtime Interface (CRI) exposed by containerd to modify the existing container we're running in.

This requires access to /run/containerd/containerd.sock, which a malicious user could use to perform priviledged operations on the host (or in any other container on the host). Obviously we'd like to prevent that as much as possible, so the CPU handling is now runs alongside neonvm-runner as a separate container. neonvm-runner does not have access to the containerd socket.

On the upside, one key benefit we get from this is being able to set cpu shares, the abstraction underlying container resources.requests. The other options weren't looking so great², so if this works, this would be a nice compromise.

~~Honestly have no clue whether this'll work. Haven't tested it at all (except for some ad-hoc interactions with crictl from ssh-ing into a kind node).~~ Wanted to open this for visibility, in case there's something silly I missed.

Fixes #755.
Fixes #591.

Omrigan · 2024-01-15T12:24:36Z

Looks good. I guess the difference will be is that neonvm-runner will always be subject to cgroup restrictions, not just QEMU process. But that shouldn't be a problem. Right?

Also, why do we have to set container limits through crictl, isn't there a proper k8s way of setting pod's resource limits on a cluster scale?

sharnoff · 2024-01-15T15:34:35Z

neonvm-runner will always be subject to cgroup restrictions, not just QEMU process

Yeah- this is different, but I figured probably better anyways — it's more thorough resource limiting, or something like that.

isn't there a proper k8s way of setting pod's resource limits on a cluster scale?

Unfortunately, k8s <1.27 doesn't allow changing a pod/container's resources after it's created (partly why we're in this whole mess in the first place), and the upgrade process is complex. See also #591, neondatabase/cloud#7922

cicdteam · 2024-01-16T09:17:04Z

LGTM, but... I still don't understand why we need this whole cgroup dance if we have cpuLimitOvercommitFactor = 4 (IIUC, which means that CPU cgroup limits are not relevant, since qemu limits the guest itself by CPU cores)

areyou1or0 · 2024-01-16T11:53:46Z

Replying to your questions from Slack:

Big risk comes from providing direct access to containerd API from within the pod, but in a separate, dedicated container.

In case of exploiting a vulnerability on the CPU handling container, an attacker could gain access to the CRI and potentially escalate their privileges. (container breakout, bypassing network controls etc.)

One way to prevent this scenario besides isolating the CPU handling container is to scanning the CPU handling container for vulnerabilities (which we already do) & monitoring pod activity regularly. (I'm skeptical on the current capability with vuln.scanner, perhaps guardduty may cover this as well)

From the pod-to-pod exploitation, in case of a cross-container attack where an attacker exploit a vulnerability on another container in the pod, could access to CRI and to CPU handling container.

Most obvious way to prevent this scenario is implementing least privilege, restricting access to the CRI., and implementing strong network security controls.

Wondering whether it adequately prevents privilege escalation (are there exploits that allow moving laterally within a pod but not further?)

Moving CPU handling to a separate container naturally helps preventing PE scenarios within the pod itself at some level, but in case of gaining unauthorized access to the CRI via another possible vulnerability can still lead to escalate privileges in theory.

sharnoff · 2024-01-17T05:31:13Z

I still don't understand why we need this whole cgroup dance if we have cpuLimitOvercommitFactor = 4 (IIUC, which means that CPU cgroup limits are not relevant, since qemu limits the guest itself by CPU cores)

There's two reasons:

QEMU can use more CPU time than the number of CPUs would indicate, often when doing IO (this is actually the original reason for the change - see neonvm/runner: Add overcommit factor for CPU limits #509)
We do want to have some upper bound on the CPU usage, just in case.

This also would resolve #591 (see there for more info).

Omrigan · 2024-01-17T13:56:11Z

#711 suggests a hypothesis the number of auxiliary containers might impact compute cold start time. So this might make the situation worse.

sharnoff · 2024-01-18T17:33:22Z

@Omrigan IIUC the impact is from adidtional Spec.InitContainers, not Spec.Containers, right? (iiuc, impact is specifically because k8s doesn't create the next InitContainer until the previous one finished, so sandbox creation isn't pipelined)

shayanh · 2024-01-19T04:04:21Z

Update reserved volume names here:

autoscaling/neonvm/apis/neonvm/v1/virtualmachine_webhook.go

Line 118 in 65e4393

"sysfscgroup",

shayanh

Looks good in general. Added several small things. Also I have a question about testing. Have you checked out cAdvisor metrics to work as expected here?

Makefile

neonvm/runner/containermgr.go

Omrigan · 2024-01-19T07:14:40Z

@Omrigan IIUC the impact is from adidtional Spec.InitContainers, not Spec.Containers, right? (iiuc, impact is specifically because k8s doesn't create the next InitContainer until the previous one finished, so sandbox creation isn't pipelined)

Not sure yet. At least, additional "main" container will put more pressure on the containerd, potentially delaying other parallel requests.

sharnoff · 2024-01-19T18:29:35Z

@shayanh I'm going to resolve the comments you left about the http handling for now. All the code there was directly copied from the existing implementation in runner/main.go, and I'd rather either move or modify, not both (this also means that it keeps the same API).

Happy to address things in a follow-up.

shayanh · 2024-01-19T19:21:04Z

@sharnoff makes sense. I'll make a note to open an issue regarding them later.

Extracted from #738, which adds a second container to the runner pods. Because of that second container, if only one container exits, the pod will still have .status.phase = Running, so we need to proactively notice that one of the containers has stopped and propagate that status to the VM itself. This also introduces some funky logic around how we handle restarts: Because the 'Succeeded' and 'Failed' phases no longer imply that QMEU itself has stopped, we need to explicitly wait until either the pod is gone or the neonvm-runner container has stopped; otherwise we could end up with >1 instance of the VM running at a time.

Extracted from #738. It was kind of weird before, because the annotation was added to the pod after-the-fact during creation, but then ignored later on. Makes more sense to fully include it as one of the expected annotations.

Noticed while working on #738. In short, because the runner API version was part of labelsForVirtualMachine, any update to the runner version would be updated for *all* VM pods, not just new one. This is (probably) not an issue in prod right now, but could be an issue for future changes. Right now, #738 would trigger this bug, but remain unaffected (it doesn't have version-dependent logic *currently*). This PR fixes the behavior by adding the runner API version as an explicit argument to labelsForVirtualMachine and ignoring the label in updatePodMetadataIfNecessary.

sharnoff · 2024-01-22T05:23:02Z

neonvm/runner/main.go

-	if err := setCgroupLimit(logger, qemuCPUs.use, cgroupPath); err != nil {
-		logger.Fatal("Failed to set cgroup limit", zap.Error(err))
-	}


PR in its current state hasn't replicated this — so, right now there's a regression where a 0.25 CPU VM starts at 1 CPU and needs to scale down.

Probably need to add milli CPU as an arg to container-mgr or something, and then do a similar "set on startup". (and actually, that'd also remove the need for nil-able cpu returned from the runner pod as well, so can revert 5acf536)

fixed in 2af7af5

I don't understand why we need the new arguments here. Can't we just set the initial limits once container-mgr is starting?

In theory, yeah. In practice, it seemed like it wasn't simple to make sure that we didn't get any weird transitory states — we'd have to update the runner before moving from Pending to Running, or e.g. if .status.cpus is nil, update the runner before fetching the info.

The PR in its present form (reverted nillable runner CPU; container-mgr has the new arg) doesn't need any of the meat of the reconcile logic to be updated (with the notable exception of #749). That's my primary motivation (because the existing logic is so complex).

sharnoff · 2024-01-22T14:33:06Z

suggestion from @kelvich: use a flag to enable this behavior, so that we can do very slow rollout (e.g. 1 week in a region before moving on).

Edit: this has now been implemented, in 5c93572

neonvm/runner/container-mgr/crictl.go

neonvm/controllers/virtualmachine_controller.go

shayanh · 2024-01-23T01:17:59Z

I added a few more comments. With the new commits the changes here are quite large and tricky (particularly, dealing with multiple containers in the controller).

Extracted from #738. It was kind of weird before, because the annotation was added to the pod after-the-fact during creation, but then ignored later on. Makes more sense to fully include it as one of the expected annotations.

shayanh

Looks good to me. There is only a small thing mentioned in this comment:
#738 (comment)

sharnoff · 2024-01-30T00:46:33Z

only a small thing mentioned in this comment

Ah right, good catch - thanks!

sharnoff · 2024-01-30T00:52:09Z

Current status:

Blocked on (approving and merging) neonvm: Use container statuses, not pod phase, to trigger restart #749
Need to test on staging — as yet, haven't checked it works on EKS (although I expect it should be ok)
- Feature works, but it substantially increased reconcile times. Commentary here.

Currently, the behavior is disabled by default, with a neonvm-controller CLI flag to enable it. The plan will be to gradually enable that flag, region-by-region, over the course of 1-2 weeks.

sharnoff · 2024-01-30T18:49:09Z

Follow-up that was brought up in a meeting: We should try out using the client library directly, rather than via the CLI. May substantially reduce latency.

Extracted from #738, which adds a second container to the runner pods. Because of that second container, if only one container exits, the pod will still have `.status.phase = Running`, so we need to proactively notice that one of the containers has stopped and propagate that status to the VM itself. This also introduces some funky logic around how we handle restarts: Because the `Succeeded` and `Failed` phases no longer imply that QMEU itself has stopped, we need to explicitly wait until either the pod is gone or the neonvm-runner container has stopped; otherwise we could end up with >1 instance of the VM running at a time.

NB: This PR is conditionally enabled via the --enable-container-mgr flag on neonvm-controller. There are no effects without that. --- We recently realized[^1] that under cgroups v2, kubernetes uses cgroup namespaces which has a few effects: 1. The output of /proc/self/cgroup shows as if the container were at the root of the hierarchy 2. It's very difficult for us to determine the actual cgroup that the container corresponds to on the host 3. We still can't directly create a cgroup in the container's namespace because /sys/fs/cgroup is mounted read-only So, neonvm-runner currently *does not* work as expected with cgroups v2; it creates a new cgroup for the VM, at the top of the hierarchy, and doesn't clean it up on exit. How do we fix this? The aim of this PR is to remove the special cgroup handling entirely, and "just" go through the Container Runtime Interface (CRI) exposed by containerd to modify the existing container we're running in. This requires access to /run/containerd/containerd.sock, which a malicious user could use to perform priviledged operations on the host (or in any other container on the host). Obviously we'd like to prevent that as much as possible, so the CPU handling is now runs alongside neonvm-runner as a separate container. neonvm-runner does not have access to the containerd socket. On the upside, one key benefit we get from this is being able to set cpu shares, the abstraction underlying container resources.requests. The other options weren't looking so great[^2], so if this works, this would be a nice compromise. [^1]: https://neondb.slack.com/archives/C03TN5G758R/p1705092611188719 [^2]: #591

shayanh · 2024-02-07T05:38:45Z

Related: I spent a couple of hours setting up autoscaling on a fresh Ubuntu bare metal machine and it didn't work out first due to our legacy cgroup stuff. Apparently, we don't use the right cgroup version in some cases, which doesn't allow the VMs to start. It's nice that we are moving away from manually reading and modifying cgroup files.

This reverts commit d30687b. Signed-off-by: Oleg Vasilev <[email protected]>

…up (#738)" Signed-off-by: Oleg Vasilev <[email protected]>

This reverts commit d30687b. Signed-off-by: Oleg Vasilev <[email protected]>

…up (#738)" Signed-off-by: Oleg Vasilev <[email protected]>

This reverts commit d30687b. Signed-off-by: Oleg Vasilev <[email protected]>

sharnoff marked this pull request as ready for review January 16, 2024 04:34

sharnoff requested review from cicdteam, Omrigan and shayanh January 16, 2024 04:34

sharnoff force-pushed the sharnoff/neonvm-runner-crictl branch from db600d3 to 5723c36 Compare January 16, 2024 04:39

sharnoff mentioned this pull request Jan 16, 2024

move qmp handling to neonvm-runner #742

Open

shayanh reviewed Jan 19, 2024

View reviewed changes

sharnoff mentioned this pull request Jan 19, 2024

neonvm: Use container statuses, not pod phase, to trigger restart #749

Merged

sharnoff mentioned this pull request Jan 22, 2024

neonvm-controller: Refactor kubectl default container annotation #752

Merged

sharnoff mentioned this pull request Jan 22, 2024

neonvm-controller: Fix overwriting runner version #753

Merged

sharnoff commented Jan 22, 2024

View reviewed changes

shayanh reviewed Jan 23, 2024

View reviewed changes

neonvm/runner/container-mgr/crictl.go Outdated Show resolved Hide resolved

shayanh reviewed Jan 23, 2024

View reviewed changes

neonvm/controllers/virtualmachine_controller.go Outdated Show resolved Hide resolved

sharnoff force-pushed the sharnoff/neonvm-runner-crictl branch from 0b7f279 to a9acb4e Compare January 29, 2024 00:03

sharnoff changed the base branch from main to sharnoff/neonvm-ctrlr-runner-container-statuses January 29, 2024 00:06

sharnoff mentioned this pull request Jan 29, 2024

Bug: neonvm-runner creates qemu cgroup outside pod if k8s using cgroups v2 #755

Closed

sharnoff force-pushed the sharnoff/neonvm-runner-crictl branch from a9acb4e to 5c93572 Compare January 29, 2024 02:19

sharnoff mentioned this pull request Jan 29, 2024

Add eBPF for network monitoring based on containermgr #758

Closed

shayanh reviewed Jan 29, 2024

View reviewed changes

sharnoff mentioned this pull request Jan 29, 2024

Feature: Improve observability around VM restarts #449

Closed

Base automatically changed from sharnoff/neonvm-ctrlr-runner-container-statuses to main February 1, 2024 19:19

sharnoff force-pushed the sharnoff/neonvm-runner-crictl branch from fb54a27 to cafeeb3 Compare February 2, 2024 21:45

sharnoff added 2 commits February 2, 2024 14:00

neonvm-controller: Intermediate commit: Indent (#738)

ebc9c17

sharnoff force-pushed the sharnoff/neonvm-runner-crictl branch from cafeeb3 to 5a82cd8 Compare February 2, 2024 22:00

sharnoff merged commit d30687b into main Feb 2, 2024
15 checks passed

sharnoff added a commit that referenced this pull request Feb 2, 2024

neonvm-controller: Intermediate commit: Indent (#738)

2b20c10

sharnoff deleted the sharnoff/neonvm-runner-crictl branch February 2, 2024 22:03

Omrigan added a commit that referenced this pull request Aug 29, 2024

Revert "neonvm: Use crictl to change container CPU, ditch cgroup (#738)"

40cd920

This reverts commit d30687b. Signed-off-by: Oleg Vasilev <[email protected]>

Omrigan added a commit that referenced this pull request Aug 29, 2024

Revert "neonvm: Use crictl to change container CPU, ditch cgroup (#738)"

211488d

This reverts commit d30687b. Signed-off-by: Oleg Vasilev <[email protected]>

Omrigan added a commit that referenced this pull request Aug 29, 2024

Revert "neonvm: Use crictl to change container CPU, ditch cgroup (#738)"

709b572

This reverts commit d30687b. Signed-off-by: Oleg Vasilev <[email protected]>

Omrigan added a commit that referenced this pull request Sep 19, 2024

Revert "neonvm: Use crictl to change container CPU, ditch cgroup (#738)"

40ec8e2

This reverts commit d30687b. Signed-off-by: Oleg Vasilev <[email protected]>

Omrigan added a commit that referenced this pull request Sep 19, 2024

fixup! Revert "neonvm: Use crictl to change container CPU, ditch cgro…

9df80d8

…up (#738)" Signed-off-by: Oleg Vasilev <[email protected]>

Omrigan added a commit that referenced this pull request Sep 24, 2024

Revert "neonvm: Use crictl to change container CPU, ditch cgroup (#738)"

34ca64d

This reverts commit d30687b. Signed-off-by: Oleg Vasilev <[email protected]>

Omrigan added a commit that referenced this pull request Sep 24, 2024

fixup! Revert "neonvm: Use crictl to change container CPU, ditch cgro…

ee24c72

…up (#738)" Signed-off-by: Oleg Vasilev <[email protected]>

Omrigan added a commit that referenced this pull request Sep 25, 2024

Revert "neonvm: Use crictl to change container CPU, ditch cgroup (#738)"

d98a22a

This reverts commit d30687b. Signed-off-by: Oleg Vasilev <[email protected]>

Omrigan added a commit that referenced this pull request Sep 30, 2024

Revert "neonvm: Use crictl to change container CPU, ditch cgroup (#738)"

3e6da9c

This reverts commit d30687b. Signed-off-by: Oleg Vasilev <[email protected]>

Omrigan added a commit that referenced this pull request Oct 7, 2024

Revert "neonvm: Use crictl to change container CPU, ditch cgroup (#738)"

b96a1eb

This reverts commit d30687b. Signed-off-by: Oleg Vasilev <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

neonvm: Use crictl to change container CPU, ditch cgroup #738

neonvm: Use crictl to change container CPU, ditch cgroup #738

sharnoff commented Jan 15, 2024 •

edited

Loading

Omrigan commented Jan 15, 2024

sharnoff commented Jan 15, 2024

cicdteam commented Jan 16, 2024

areyou1or0 commented Jan 16, 2024

sharnoff commented Jan 17, 2024

Omrigan commented Jan 17, 2024

sharnoff commented Jan 18, 2024

shayanh commented Jan 19, 2024

shayanh left a comment

Omrigan commented Jan 19, 2024

sharnoff commented Jan 19, 2024

shayanh commented Jan 19, 2024

sharnoff Jan 22, 2024

sharnoff Jan 22, 2024

shayanh Jan 23, 2024

sharnoff Jan 28, 2024

sharnoff commented Jan 22, 2024 •

edited

Loading

shayanh commented Jan 23, 2024 •

edited

Loading

shayanh left a comment

sharnoff commented Jan 30, 2024

sharnoff commented Jan 30, 2024 •

edited

Loading

sharnoff commented Jan 30, 2024

shayanh commented Feb 7, 2024

neonvm: Use crictl to change container CPU, ditch cgroup #738

neonvm: Use crictl to change container CPU, ditch cgroup #738

Conversation

sharnoff commented Jan 15, 2024 • edited Loading

Footnotes

Omrigan commented Jan 15, 2024

sharnoff commented Jan 15, 2024

cicdteam commented Jan 16, 2024

areyou1or0 commented Jan 16, 2024

sharnoff commented Jan 17, 2024

Omrigan commented Jan 17, 2024

sharnoff commented Jan 18, 2024

shayanh commented Jan 19, 2024

shayanh left a comment

Choose a reason for hiding this comment

Omrigan commented Jan 19, 2024

sharnoff commented Jan 19, 2024

shayanh commented Jan 19, 2024

sharnoff Jan 22, 2024

Choose a reason for hiding this comment

sharnoff Jan 22, 2024

Choose a reason for hiding this comment

shayanh Jan 23, 2024

Choose a reason for hiding this comment

sharnoff Jan 28, 2024

Choose a reason for hiding this comment

sharnoff commented Jan 22, 2024 • edited Loading

shayanh commented Jan 23, 2024 • edited Loading

shayanh left a comment

Choose a reason for hiding this comment

sharnoff commented Jan 30, 2024

sharnoff commented Jan 30, 2024 • edited Loading

sharnoff commented Jan 30, 2024

shayanh commented Feb 7, 2024

sharnoff commented Jan 15, 2024 •

edited

Loading

sharnoff commented Jan 22, 2024 •

edited

Loading

shayanh commented Jan 23, 2024 •

edited

Loading

sharnoff commented Jan 30, 2024 •

edited

Loading