-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
neonvm: Use crictl to change container CPU, ditch cgroup #738
Conversation
Looks good. I guess the difference will be is that Also, why do we have to set container limits through |
Yeah- this is different, but I figured probably better anyways — it's more thorough resource limiting, or something like that.
Unfortunately, k8s <1.27 doesn't allow changing a pod/container's resources after it's created (partly why we're in this whole mess in the first place), and the upgrade process is complex. See also #591, neondatabase/cloud#7922 |
db600d3
to
5723c36
Compare
LGTM, but... I still don't understand why we need this whole cgroup dance if we have |
Replying to your questions from Slack:
In case of exploiting a vulnerability on the CPU handling container, an attacker could gain access to the CRI and potentially escalate their privileges. (container breakout, bypassing network controls etc.) One way to prevent this scenario besides isolating the CPU handling container is to scanning the CPU handling container for vulnerabilities (which we already do) & monitoring pod activity regularly. (I'm skeptical on the current capability with vuln.scanner, perhaps guardduty may cover this as well) From the pod-to-pod exploitation, in case of a cross-container attack where an attacker exploit a vulnerability on another container in the pod, could access to CRI and to CPU handling container. Most obvious way to prevent this scenario is implementing least privilege, restricting access to the CRI., and implementing strong network security controls.
Moving CPU handling to a separate container naturally helps preventing PE scenarios within the pod itself at some level, but in case of gaining unauthorized access to the CRI via another possible vulnerability can still lead to escalate privileges in theory. |
There's two reasons:
This also would resolve #591 (see there for more info). |
#711 suggests a hypothesis the number of auxiliary containers might impact compute cold start time. So this might make the situation worse. |
@Omrigan IIUC the impact is from adidtional Spec.InitContainers, not Spec.Containers, right? (iiuc, impact is specifically because k8s doesn't create the next InitContainer until the previous one finished, so sandbox creation isn't pipelined) |
Update reserved volume names here:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good in general. Added several small things. Also I have a question about testing. Have you checked out cAdvisor metrics to work as expected here?
Not sure yet. At least, additional "main" container will put more pressure on the |
@shayanh I'm going to resolve the comments you left about the http handling for now. All the code there was directly copied from the existing implementation in Happy to address things in a follow-up. |
@sharnoff makes sense. I'll make a note to open an issue regarding them later. |
Extracted from #738, which adds a second container to the runner pods. Because of that second container, if only one container exits, the pod will still have .status.phase = Running, so we need to proactively notice that one of the containers has stopped and propagate that status to the VM itself. This also introduces some funky logic around how we handle restarts: Because the 'Succeeded' and 'Failed' phases no longer imply that QMEU itself has stopped, we need to explicitly wait until either the pod is gone or the neonvm-runner container has stopped; otherwise we could end up with >1 instance of the VM running at a time.
Extracted from #738. It was kind of weird before, because the annotation was added to the pod after-the-fact during creation, but then ignored later on. Makes more sense to fully include it as one of the expected annotations.
Noticed while working on #738. In short, because the runner API version was part of labelsForVirtualMachine, any update to the runner version would be updated for *all* VM pods, not just new one. This is (probably) not an issue in prod right now, but could be an issue for future changes. Right now, #738 would trigger this bug, but remain unaffected (it doesn't have version-dependent logic *currently*). This PR fixes the behavior by adding the runner API version as an explicit argument to labelsForVirtualMachine and ignoring the label in updatePodMetadataIfNecessary.
neonvm/runner/main.go
Outdated
if err := setCgroupLimit(logger, qemuCPUs.use, cgroupPath); err != nil { | ||
logger.Fatal("Failed to set cgroup limit", zap.Error(err)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR in its current state hasn't replicated this — so, right now there's a regression where a 0.25 CPU VM starts at 1 CPU and needs to scale down.
Probably need to add milli CPU as an arg to container-mgr or something, and then do a similar "set on startup". (and actually, that'd also remove the need for nil-able cpu returned from the runner pod as well, so can revert 5acf536)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in 2af7af5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we need the new arguments here. Can't we just set the initial limits once container-mgr
is starting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory, yeah. In practice, it seemed like it wasn't simple to make sure that we didn't get any weird transitory states — we'd have to update the runner before moving from Pending to Running, or e.g. if .status.cpus is nil, update the runner before fetching the info.
The PR in its present form (reverted nillable runner CPU; container-mgr has the new arg) doesn't need any of the meat of the reconcile logic to be updated (with the notable exception of #749). That's my primary motivation (because the existing logic is so complex).
I added a few more comments. With the new commits the changes here are quite large and tricky (particularly, dealing with multiple containers in the controller). |
Extracted from #738. It was kind of weird before, because the annotation was added to the pod after-the-fact during creation, but then ignored later on. Makes more sense to fully include it as one of the expected annotations.
0b7f279
to
a9acb4e
Compare
a9acb4e
to
5c93572
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. There is only a small thing mentioned in this comment:
#738 (comment)
Ah right, good catch - thanks! |
Current status:
Currently, the behavior is disabled by default, with a neonvm-controller CLI flag to enable it. The plan will be to gradually enable that flag, region-by-region, over the course of 1-2 weeks. |
Follow-up that was brought up in a meeting: We should try out using the client library directly, rather than via the CLI. May substantially reduce latency. |
Extracted from #738, which adds a second container to the runner pods. Because of that second container, if only one container exits, the pod will still have `.status.phase = Running`, so we need to proactively notice that one of the containers has stopped and propagate that status to the VM itself. This also introduces some funky logic around how we handle restarts: Because the `Succeeded` and `Failed` phases no longer imply that QMEU itself has stopped, we need to explicitly wait until either the pod is gone or the neonvm-runner container has stopped; otherwise we could end up with >1 instance of the VM running at a time.
fb54a27
to
cafeeb3
Compare
NB: This PR is conditionally enabled via the --enable-container-mgr flag on neonvm-controller. There are no effects without that. --- We recently realized[^1] that under cgroups v2, kubernetes uses cgroup namespaces which has a few effects: 1. The output of /proc/self/cgroup shows as if the container were at the root of the hierarchy 2. It's very difficult for us to determine the actual cgroup that the container corresponds to on the host 3. We still can't directly create a cgroup in the container's namespace because /sys/fs/cgroup is mounted read-only So, neonvm-runner currently *does not* work as expected with cgroups v2; it creates a new cgroup for the VM, at the top of the hierarchy, and doesn't clean it up on exit. How do we fix this? The aim of this PR is to remove the special cgroup handling entirely, and "just" go through the Container Runtime Interface (CRI) exposed by containerd to modify the existing container we're running in. This requires access to /run/containerd/containerd.sock, which a malicious user could use to perform priviledged operations on the host (or in any other container on the host). Obviously we'd like to prevent that as much as possible, so the CPU handling is now runs alongside neonvm-runner as a separate container. neonvm-runner does not have access to the containerd socket. On the upside, one key benefit we get from this is being able to set cpu shares, the abstraction underlying container resources.requests. The other options weren't looking so great[^2], so if this works, this would be a nice compromise. [^1]: https://neondb.slack.com/archives/C03TN5G758R/p1705092611188719 [^2]: #591
cafeeb3
to
5a82cd8
Compare
Related: I spent a couple of hours setting up autoscaling on a fresh Ubuntu bare metal machine and it didn't work out first due to our legacy cgroup stuff. Apparently, we don't use the right cgroup version in some cases, which doesn't allow the VMs to start. It's nice that we are moving away from manually reading and modifying cgroup files. |
This reverts commit d30687b. Signed-off-by: Oleg Vasilev <[email protected]>
This reverts commit d30687b. Signed-off-by: Oleg Vasilev <[email protected]>
This reverts commit d30687b. Signed-off-by: Oleg Vasilev <[email protected]>
This reverts commit d30687b. Signed-off-by: Oleg Vasilev <[email protected]>
…up (#738)" Signed-off-by: Oleg Vasilev <[email protected]>
This reverts commit d30687b. Signed-off-by: Oleg Vasilev <[email protected]>
…up (#738)" Signed-off-by: Oleg Vasilev <[email protected]>
This reverts commit d30687b. Signed-off-by: Oleg Vasilev <[email protected]>
This reverts commit d30687b. Signed-off-by: Oleg Vasilev <[email protected]>
This reverts commit d30687b. Signed-off-by: Oleg Vasilev <[email protected]>
We recently realized1 that under cgroups v2, kubernetes uses cgroup namespaces which has a few effects:
So, neonvm-runner currently does not work as expected with cgroups v2; it creates a new cgroup for the VM, at the top of the hierarchy, and doesn't clean it up on exit.
How do we fix this? The aim of this PR is to remove the special cgroup handling entirely, and "just" go through the Container Runtime Interface (CRI) exposed by containerd to modify the existing container we're running in.
This requires access to
/run/containerd/containerd.sock
, which a malicious user could use to perform priviledged operations on the host (or in any other container on the host). Obviously we'd like to prevent that as much as possible, so the CPU handling is now runs alongside neonvm-runner as a separate container. neonvm-runner does not have access to the containerd socket.On the upside, one key benefit we get from this is being able to set cpu shares, the abstraction underlying container resources.requests. The other options weren't looking so great2, so if this works, this would be a nice compromise.
Honestly have no clue whether this'll work. Haven't tested it at all (except for some ad-hoc interactions withWanted to open this for visibility, in case there's something silly I missed.crictl
from ssh-ing into akind
node).Fixes #755.
Fixes #591.
Footnotes
https://neondb.slack.com/archives/C03TN5G758R/p1705092611188719 ↩
https://github.com/neondatabase/autoscaling/issues/591 ↩