Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deployment多副本同时启动异常且副本数不符合预期 #915

Open
497254693 opened this issue Mar 5, 2025 · 2 comments
Open

deployment多副本同时启动异常且副本数不符合预期 #915

497254693 opened this issue Mar 5, 2025 · 2 comments

Comments

@497254693
Copy link

497254693 commented Mar 5, 2025

环境:
2台GPU服务器,每台主机2张A100卡

在Rancher安装HAMi调度插件

nvidia.deviceSplitCount预设值为10,后修改为20

Image

先部署一个deployment,使用普通的nginx镜像简单测试:

只配置参数gpucores:
nvidia.com/gpucores: 1 (1相当于1%?)

Image
nvidia.com/gpu: 1 ( 这个参数在deployment的yaml是没有加上去的,但启动后查看pod的yaml发现会自动加上去的)

当我一次性设置spec.replicas = 60个副本, 有一个pod创建容器时报错:error run hook(这个pod则一直无法启动,只有删掉重新拉起才行),偶尔有一个pod报:binding rejected locked within 5 minutes(这个pod通常是pending状态,等一会就恢复),而最终最多只能正常启动4个pod。

Image

Image

Image

问题1:报错:binding rejected locked within 5 minutes 这种是不是因为并发太高,HAMi调度插件性能不足导致?能否优化?

问题2:报错:error run hook是什么原因?为什么删掉重新拉起又可以?

问题3:为什么最多只能拉起4个副本pod?gpu和gpucores应该理论上足够支撑80个pod才对呀?

@archlitchi
Copy link
Collaborator

因为不指定显存的话默认使用全部的显存,所以最多只能分配4个pod

@497254693
Copy link
Author

因为不指定显存的话默认使用全部的显存,所以最多只能分配4个pod

了解,那问题1和问题2知道是什么原因吗?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants