We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
环境: 2台GPU服务器,每台主机2张A100卡
在Rancher安装HAMi调度插件
nvidia.deviceSplitCount预设值为10,后修改为20
先部署一个deployment,使用普通的nginx镜像简单测试:
只配置参数gpucores: nvidia.com/gpucores: 1 (1相当于1%?)
nvidia.com/gpu: 1 ( 这个参数在deployment的yaml是没有加上去的,但启动后查看pod的yaml发现会自动加上去的)
当我一次性设置spec.replicas = 60个副本, 有一个pod创建容器时报错:error run hook(这个pod则一直无法启动,只有删掉重新拉起才行),偶尔有一个pod报:binding rejected locked within 5 minutes(这个pod通常是pending状态,等一会就恢复),而最终最多只能正常启动4个pod。
问题1:报错:binding rejected locked within 5 minutes 这种是不是因为并发太高,HAMi调度插件性能不足导致?能否优化?
问题2:报错:error run hook是什么原因?为什么删掉重新拉起又可以?
问题3:为什么最多只能拉起4个副本pod?gpu和gpucores应该理论上足够支撑80个pod才对呀?
The text was updated successfully, but these errors were encountered:
因为不指定显存的话默认使用全部的显存,所以最多只能分配4个pod
Sorry, something went wrong.
了解,那问题1和问题2知道是什么原因吗?
No branches or pull requests
环境:
2台GPU服务器,每台主机2张A100卡
在Rancher安装HAMi调度插件
nvidia.deviceSplitCount预设值为10,后修改为20
先部署一个deployment,使用普通的nginx镜像简单测试:
只配置参数gpucores:
nvidia.com/gpucores: 1 (1相当于1%?)
nvidia.com/gpu: 1 ( 这个参数在deployment的yaml是没有加上去的,但启动后查看pod的yaml发现会自动加上去的)
当我一次性设置spec.replicas = 60个副本, 有一个pod创建容器时报错:error run hook(这个pod则一直无法启动,只有删掉重新拉起才行),偶尔有一个pod报:binding rejected locked within 5 minutes(这个pod通常是pending状态,等一会就恢复),而最终最多只能正常启动4个pod。
问题1:报错:binding rejected locked within 5 minutes 这种是不是因为并发太高,HAMi调度插件性能不足导致?能否优化?
问题2:报错:error run hook是什么原因?为什么删掉重新拉起又可以?
问题3:为什么最多只能拉起4个副本pod?gpu和gpucores应该理论上足够支撑80个pod才对呀?
The text was updated successfully, but these errors were encountered: