-
Notifications
You must be signed in to change notification settings - Fork 453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
api_server 方式部署有概率卡住 #2691
Comments
Maybe related to #2706 |
This issue is marked as stale because it has been marked as invalid or awaiting response for 7 days without any further response. It will be closed in 5 days if the stale label is not removed or if there is no further response. |
遇到了相同的问题,50个实例开TP=2,有几个会卡死,其中一个卡GPU利用率为0,另一个为100% |
@LiYtao 看着是关了P2P, NCCL_P2P_DIRECT_DISABLE=1。我在不支持P2P的卡上测试也有类似的情况,请问您这边解决了这个问题了吗? |
This issue is marked as stale because it has been marked as invalid or awaiting response for 7 days without any further response. It will be closed in 5 days if the stale label is not removed or if there is no further response. |
This issue is closed because it has been stale for 5 days. Please open a new issue if you have similar issues or you have any new updates now. |
Checklist
Describe the bug
使用api_server 方式部署internvl2-40B时有概率卡住,使用了4张A100,debug日志的最后一行是
[TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = __nv_bfloat16] start
Reproduction
CUDA_VISIBLE_DEVICES=0,3,4,7 NCCL_P2P_DIRECT_DISABLE=1 lmdeploy serve api_server /var/aigc/model/InternVL2-40B --tp 4 --server-port 23333 --log-level DEBUG
Environment
Error traceback
The text was updated successfully, but these errors were encountered: