Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

neonvm: Use crictl to change container CPU, ditch cgroup #738

Merged
merged 2 commits into from
Feb 2, 2024

Conversation

sharnoff
Copy link
Member

@sharnoff sharnoff commented Jan 15, 2024

We recently realized1 that under cgroups v2, kubernetes uses cgroup namespaces which has a few effects:

  1. The output of /proc/self/cgroup shows as if the container were at the root of the hierarchy
  2. It's very difficult for us to determine the actual cgroup that the container corresponds to on the host
  3. We still can't directly create a cgroup in the container's namespace because /sys/fs/cgroup is mounted read-only

So, neonvm-runner currently does not work as expected with cgroups v2; it creates a new cgroup for the VM, at the top of the hierarchy, and doesn't clean it up on exit.

How do we fix this? The aim of this PR is to remove the special cgroup handling entirely, and "just" go through the Container Runtime Interface (CRI) exposed by containerd to modify the existing container we're running in.

This requires access to /run/containerd/containerd.sock, which a malicious user could use to perform priviledged operations on the host (or in any other container on the host). Obviously we'd like to prevent that as much as possible, so the CPU handling is now runs alongside neonvm-runner as a separate container. neonvm-runner does not have access to the containerd socket.

On the upside, one key benefit we get from this is being able to set cpu shares, the abstraction underlying container resources.requests. The other options weren't looking so great2, so if this works, this would be a nice compromise.


Honestly have no clue whether this'll work. Haven't tested it at all (except for some ad-hoc interactions with crictl from ssh-ing into a kind node). Wanted to open this for visibility, in case there's something silly I missed.


Fixes #755.
Fixes #591.

Footnotes

  1. https://neondb.slack.com/archives/C03TN5G758R/p1705092611188719

  2. https://github.com/neondatabase/autoscaling/issues/591

@Omrigan
Copy link
Contributor

Omrigan commented Jan 15, 2024

Looks good. I guess the difference will be is that neonvm-runner will always be subject to cgroup restrictions, not just QEMU process. But that shouldn't be a problem. Right?

Also, why do we have to set container limits through crictl, isn't there a proper k8s way of setting pod's resource limits on a cluster scale?

@sharnoff
Copy link
Member Author

neonvm-runner will always be subject to cgroup restrictions, not just QEMU process

Yeah- this is different, but I figured probably better anyways — it's more thorough resource limiting, or something like that.

isn't there a proper k8s way of setting pod's resource limits on a cluster scale?

Unfortunately, k8s <1.27 doesn't allow changing a pod/container's resources after it's created (partly why we're in this whole mess in the first place), and the upgrade process is complex. See also #591, neondatabase/cloud#7922

@sharnoff sharnoff marked this pull request as ready for review January 16, 2024 04:34
@sharnoff sharnoff force-pushed the sharnoff/neonvm-runner-crictl branch from db600d3 to 5723c36 Compare January 16, 2024 04:39
@cicdteam
Copy link
Contributor

LGTM, but... I still don't understand why we need this whole cgroup dance if we have cpuLimitOvercommitFactor = 4 (IIUC, which means that CPU cgroup limits are not relevant, since qemu limits the guest itself by CPU cores)

@areyou1or0
Copy link

Replying to your questions from Slack:

Big risk comes from providing direct access to containerd API from within the pod, but in a separate, dedicated container.

In case of exploiting a vulnerability on the CPU handling container, an attacker could gain access to the CRI and potentially escalate their privileges. (container breakout, bypassing network controls etc.)

One way to prevent this scenario besides isolating the CPU handling container is to scanning the CPU handling container for vulnerabilities (which we already do) & monitoring pod activity regularly. (I'm skeptical on the current capability with vuln.scanner, perhaps guardduty may cover this as well)

From the pod-to-pod exploitation, in case of a cross-container attack where an attacker exploit a vulnerability on another container in the pod, could access to CRI and to CPU handling container.

Most obvious way to prevent this scenario is implementing least privilege, restricting access to the CRI., and implementing strong network security controls.

Wondering whether it adequately prevents privilege escalation (are there exploits that allow moving laterally within a pod but not further?)

Moving CPU handling to a separate container naturally helps preventing PE scenarios within the pod itself at some level, but in case of gaining unauthorized access to the CRI via another possible vulnerability can still lead to escalate privileges in theory.

@sharnoff
Copy link
Member Author

I still don't understand why we need this whole cgroup dance if we have cpuLimitOvercommitFactor = 4 (IIUC, which means that CPU cgroup limits are not relevant, since qemu limits the guest itself by CPU cores)

There's two reasons:

  1. QEMU can use more CPU time than the number of CPUs would indicate, often when doing IO (this is actually the original reason for the change - see neonvm/runner: Add overcommit factor for CPU limits #509)
  2. We do want to have some upper bound on the CPU usage, just in case.

This also would resolve #591 (see there for more info).

@Omrigan
Copy link
Contributor

Omrigan commented Jan 17, 2024

#711 suggests a hypothesis the number of auxiliary containers might impact compute cold start time. So this might make the situation worse.

@sharnoff
Copy link
Member Author

@Omrigan IIUC the impact is from adidtional Spec.InitContainers, not Spec.Containers, right? (iiuc, impact is specifically because k8s doesn't create the next InitContainer until the previous one finished, so sandbox creation isn't pipelined)

@shayanh
Copy link

shayanh commented Jan 19, 2024

Update reserved volume names here:

Copy link

@shayanh shayanh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general. Added several small things. Also I have a question about testing. Have you checked out cAdvisor metrics to work as expected here?

@Omrigan
Copy link
Contributor

Omrigan commented Jan 19, 2024

@Omrigan IIUC the impact is from adidtional Spec.InitContainers, not Spec.Containers, right? (iiuc, impact is specifically because k8s doesn't create the next InitContainer until the previous one finished, so sandbox creation isn't pipelined)

Not sure yet. At least, additional "main" container will put more pressure on the containerd, potentially delaying other parallel requests.

@sharnoff
Copy link
Member Author

@shayanh I'm going to resolve the comments you left about the http handling for now. All the code there was directly copied from the existing implementation in runner/main.go, and I'd rather either move or modify, not both (this also means that it keeps the same API).

Happy to address things in a follow-up.

@shayanh
Copy link

shayanh commented Jan 19, 2024

@sharnoff makes sense. I'll make a note to open an issue regarding them later.

sharnoff added a commit that referenced this pull request Jan 19, 2024
Extracted from #738, which adds a second container to the runner pods.
Because of that second container, if only one container exits, the pod
will still have .status.phase = Running, so we need to proactively
notice that one of the containers has stopped and propagate that status
to the VM itself.

This also introduces some funky logic around how we handle restarts:
Because the 'Succeeded' and 'Failed' phases no longer imply that QMEU
itself has stopped, we need to explicitly wait until either the pod is
gone or the neonvm-runner container has stopped; otherwise we could end
up with >1 instance of the VM running at a time.
sharnoff added a commit that referenced this pull request Jan 22, 2024
Extracted from #738. It was kind of weird before, because the annotation
was added to the pod after-the-fact during creation, but then ignored
later on. Makes more sense to fully include it as one of the expected
annotations.
sharnoff added a commit that referenced this pull request Jan 22, 2024
Noticed while working on #738. In short, because the runner API version
was part of labelsForVirtualMachine, any update to the runner version
would be updated for *all* VM pods, not just new one.

This is (probably) not an issue in prod right now, but could be an issue
for future changes. Right now, #738 would trigger this bug, but remain
unaffected (it doesn't have version-dependent logic *currently*).

This PR fixes the behavior by adding the runner API version as an
explicit argument to labelsForVirtualMachine and ignoring the label in
updatePodMetadataIfNecessary.
Comment on lines 694 to 703
if err := setCgroupLimit(logger, qemuCPUs.use, cgroupPath); err != nil {
logger.Fatal("Failed to set cgroup limit", zap.Error(err))
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR in its current state hasn't replicated this — so, right now there's a regression where a 0.25 CPU VM starts at 1 CPU and needs to scale down.

Probably need to add milli CPU as an arg to container-mgr or something, and then do a similar "set on startup". (and actually, that'd also remove the need for nil-able cpu returned from the runner pod as well, so can revert 5acf536)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 2af7af5

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we need the new arguments here. Can't we just set the initial limits once container-mgr is starting?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, yeah. In practice, it seemed like it wasn't simple to make sure that we didn't get any weird transitory states — we'd have to update the runner before moving from Pending to Running, or e.g. if .status.cpus is nil, update the runner before fetching the info.

The PR in its present form (reverted nillable runner CPU; container-mgr has the new arg) doesn't need any of the meat of the reconcile logic to be updated (with the notable exception of #749). That's my primary motivation (because the existing logic is so complex).

@sharnoff
Copy link
Member Author

sharnoff commented Jan 22, 2024

suggestion from @kelvich: use a flag to enable this behavior, so that we can do very slow rollout (e.g. 1 week in a region before moving on).

Edit: this has now been implemented, in 5c93572

@shayanh
Copy link

shayanh commented Jan 23, 2024

I added a few more comments. With the new commits the changes here are quite large and tricky (particularly, dealing with multiple containers in the controller).

sharnoff added a commit that referenced this pull request Jan 23, 2024
Extracted from #738. It was kind of weird before, because the annotation
was added to the pod after-the-fact during creation, but then ignored
later on. Makes more sense to fully include it as one of the expected
annotations.
@sharnoff sharnoff force-pushed the sharnoff/neonvm-runner-crictl branch from 0b7f279 to a9acb4e Compare January 29, 2024 00:03
@sharnoff sharnoff changed the base branch from main to sharnoff/neonvm-ctrlr-runner-container-statuses January 29, 2024 00:06
@sharnoff sharnoff force-pushed the sharnoff/neonvm-runner-crictl branch from a9acb4e to 5c93572 Compare January 29, 2024 02:19
Copy link

@shayanh shayanh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. There is only a small thing mentioned in this comment:
#738 (comment)

@sharnoff
Copy link
Member Author

only a small thing mentioned in this comment

Ah right, good catch - thanks!

@sharnoff
Copy link
Member Author

sharnoff commented Jan 30, 2024

Current status:

Currently, the behavior is disabled by default, with a neonvm-controller CLI flag to enable it. The plan will be to gradually enable that flag, region-by-region, over the course of 1-2 weeks.

@sharnoff
Copy link
Member Author

Follow-up that was brought up in a meeting: We should try out using the client library directly, rather than via the CLI. May substantially reduce latency.

Base automatically changed from sharnoff/neonvm-ctrlr-runner-container-statuses to main February 1, 2024 19:19
sharnoff added a commit that referenced this pull request Feb 1, 2024
Extracted from #738, which adds a second container to the runner pods.
Because of that second container, if only one container exits, the pod
will still have `.status.phase = Running`, so we need to proactively
notice that one of the containers has stopped and propagate that status
to the VM itself.

This also introduces some funky logic around how we handle restarts:
Because the `Succeeded` and `Failed` phases no longer imply that QMEU
itself has stopped, we need to explicitly wait until either the pod is
gone or the neonvm-runner container has stopped; otherwise we could end
up with >1 instance of the VM running at a time.
@sharnoff sharnoff force-pushed the sharnoff/neonvm-runner-crictl branch from fb54a27 to cafeeb3 Compare February 2, 2024 21:45
NB: This PR is conditionally enabled via the --enable-container-mgr flag
on neonvm-controller. There are no effects without that.

---

We recently realized[^1] that under cgroups v2, kubernetes uses cgroup
namespaces which has a few effects:

1. The output of /proc/self/cgroup shows as if the container were at the
   root of the hierarchy
2. It's very difficult for us to determine the actual cgroup that the
   container corresponds to on the host
3. We still can't directly create a cgroup in the container's namespace
   because /sys/fs/cgroup is mounted read-only

So, neonvm-runner currently *does not* work as expected with cgroups v2;
it creates a new cgroup for the VM, at the top of the hierarchy, and
doesn't clean it up on exit.

How do we fix this? The aim of this PR is to remove the special cgroup
handling entirely, and "just" go through the Container Runtime Interface
(CRI) exposed by containerd to modify the existing container we're
running in.

This requires access to /run/containerd/containerd.sock, which a
malicious user could use to perform priviledged operations on the host
(or in any other container on the host).
Obviously we'd like to prevent that as much as possible, so the CPU
handling is now runs alongside neonvm-runner as a separate container.
neonvm-runner does not have access to the containerd socket.

On the upside, one key benefit we get from this is being able to set cpu
shares, the abstraction underlying container resources.requests.
The other options weren't looking so great[^2], so if this works, this
would be a nice compromise.

[^1]: https://neondb.slack.com/archives/C03TN5G758R/p1705092611188719
[^2]: #591
@sharnoff sharnoff force-pushed the sharnoff/neonvm-runner-crictl branch from cafeeb3 to 5a82cd8 Compare February 2, 2024 22:00
@sharnoff sharnoff merged commit d30687b into main Feb 2, 2024
15 checks passed
@sharnoff sharnoff deleted the sharnoff/neonvm-runner-crictl branch February 2, 2024 22:03
@shayanh
Copy link

shayanh commented Feb 7, 2024

Related: I spent a couple of hours setting up autoscaling on a fresh Ubuntu bare metal machine and it didn't work out first due to our legacy cgroup stuff. Apparently, we don't use the right cgroup version in some cases, which doesn't allow the VMs to start. It's nice that we are moving away from manually reading and modifying cgroup files.

Omrigan added a commit that referenced this pull request Aug 29, 2024
Omrigan added a commit that referenced this pull request Aug 29, 2024
Omrigan added a commit that referenced this pull request Aug 29, 2024
Omrigan added a commit that referenced this pull request Sep 19, 2024
Omrigan added a commit that referenced this pull request Sep 19, 2024
Omrigan added a commit that referenced this pull request Sep 24, 2024
Omrigan added a commit that referenced this pull request Sep 24, 2024
Omrigan added a commit that referenced this pull request Sep 25, 2024
Omrigan added a commit that referenced this pull request Sep 30, 2024
Omrigan added a commit that referenced this pull request Oct 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants