-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GPU support #3257
Add GPU support #3257
Conversation
This implements kubernetes-sigs#3164
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: samos123 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
pkg/apis/config/v1alpha4/types.go
Outdated
@@ -118,6 +118,10 @@ type Node struct { | |||
// binded to a host Port | |||
ExtraPortMappings []PortMapping `yaml:"extraPortMappings,omitempty" json:"extraPortMappings,omitempty"` | |||
|
|||
// GPUs allows to access GPU devices from the kind node. Setting this to | |||
// "all" will pass all the available GPUs to the kind node. | |||
Gpus string `yaml:"gpus,omitempty" json:"gpus,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use GPUs string
?
All letters in the acronym should have the same case, using the appropriate case for the situation
https://github.com/zecke/Kubernetes/blob/master/docs/devel/api-conventions.md#naming-conventions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, also this field should be validated (with "all" being the only valid value currently), and we should put a note that in the future we'll look at supporting specifying specific devices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented more below and on the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pkg/internal/apis/config/types.go
Outdated
@@ -98,6 +98,10 @@ type Node struct { | |||
// binded to a host Port | |||
ExtraPortMappings []PortMapping | |||
|
|||
// GPUs allows to access GPU devices from the kind node. Setting this to | |||
// "all" will pass all the available GPUs to the kind node. | |||
Gpus string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -285,6 +285,20 @@ nodes: | |||
|
|||
**Note**: Kubernetes versions are expressed as x.y.z, where x is the major version, y is the minor version, and z is the patch version, following [Semantic Versioning](https://semver.org/) terminology. For more information, see [Kubernetes Release Versioning.](https://github.com/kubernetes/sig-release/blob/master/release-engineering/versioning.md#kubernetes-release-versioning) | |||
|
|||
### GPU Support | |||
|
|||
Kind nodes can utilize GPUs by setting the following: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kind nodes can utilize GPUs by setting the following: | |
Kind nodes can utilize GPU devices from the host, by setting the following: |
aligning with the API godoc comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -255,6 +255,10 @@ func runArgsForNode(node *config.Node, clusterIPFamily config.ClusterIPFamily, n | |||
args = append(args, "-e", "KUBECONFIG=/etc/kubernetes/admin.conf") | |||
} | |||
|
|||
if len(node.Gpus) > 0 { | |||
args = append(args, fmt.Sprintf("--gpus=%v", node.Gpus)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's not plumb this directly since the values are incompatible across backends
instead we can have a more structured format in the internal type (for now gpus true/false) and we need valdiatino on the external type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added validation to ensure right now only "all" can be passed so it's now safe to to use node.Gpus directly in the docker provisioner. I prefer to keep the code simpler and passthrough instead of adding another layer of validation in the docker provisioning code. Happy to add additional validation in docker/povision.go if you have strong opinion on it.
For podman we can not plumb it through directly. So I will work out a way to convert node.Gpus into a podman compatible format. I don't have podman installed myself.
Only 'all' will be supported for now
One interesting thing I had to do was to run the following after kidn cluster came up:
So I'm thinking there might needs to be a change to the base image to include that symlink. This could also be due to my system being on Archlinux. |
@samos123: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Sorry to be a party pooper, but I do not believe this is the right approach for adding GPU support to kind. The With the upcoming CDI support that will be available in Docker 25, having the Once this is available in docker, both of these will work in a unified way:
|
Edit: I posted a full tutorial on how to configure Kind + GPU support here: https://www.substratus.ai/blog/kind-with-gpus I confirmed the steps provided by @klueska worked except hitting 1 minor issue. The only thing I had to do was run this:
Should that be something that should be included in the base image? Here were the full steps I used to verify:
|
For more context see: kubernetes-sigs/kind#3257 (comment)
@samos123 You shouldn't need to create that symlink. Can you show me the contents of your |
@klueska content of my file: https://gist.github.com/samos123/cc816b91a7a03651c71441e0949c3bb6 Note I don't seem to be the only one hitting this issue: NVIDIA/nvidia-docker#614 (comment) That's the source of the workaround |
Following @samos123 steps I can't fully deploy nvidia-device-plugin-daemonset and nvidia-operator-validator when I have more than one node deployed by using this kind-config:
This is the error I get:
But when I deploy it over a kind with a control-plane and a single worker it works. |
I think thats a limitation of the approach. Maybe @klueska as another cool workaround to solve that? :D |
Only one of your workers has:
|
Yes because I only one one node to has gpu, the others I want to emulate that doesn't have GPU. Still, it fails if I add the hostpath to the another worker:
Error of nvidia-device-plugin-daemonset:
Error on nvidia-operator-validator: |
Nevermind, it was a problem related to my sysctl limit related to this topic: Now it works on 2 workers node! Thanks |
I had to downgrade the nvidia driver from 545->535 for it to work. Based on the compatibility stuff of the gpu-operator: |
Hello, I follow this comment and meet new error
How to solve this problem |
Anyone can help with this issue?
But now got
|
cuda version and driver version do not match. |
Add nvidia-driver host map to kind node. Edited `generate_kind_config.py` to add: ```yaml # part of GPU workaround extraMounts: - hostPath: /dev/null containerPath: /var/run/nvidia-container-devices/all ``` For more details, see: kubernetes-sigs/kind#3257 (comment)
You may want to look at this comment running Kubeflow, gpu-operator and wsl2 qbo with kind mages or directly into kind |
FYI, in addition to #3257 (comment), the workaround NVIDIA/nvidia-container-toolkit#198 (comment) is needed. I installed GPU Operator in kind as follows:
Maybe, NVIDIA/gpu-operator#441 (comment) is needed too. |
WIP, still need to add podman support
This implements GPU support as discussed in #3164