Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

很慢 GPU 没有被有效利用起来 #38

Open
leegang opened this issue Jul 31, 2024 · 10 comments
Open

很慢 GPU 没有被有效利用起来 #38

leegang opened this issue Jul 31, 2024 · 10 comments

Comments

@leegang
Copy link

leegang commented Jul 31, 2024

2024-07-31 20:15:45.650643419 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 12 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message. 2024-07-31 20:15:45.652126712 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf. 2024-07-31 20:15:45.652142112 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.


CUDA 11.8
cuDNN 8.9.2
onnxruntime 1.8

@smthemex
Copy link
Owner

smthemex commented Aug 1, 2024

可以试试把cuda,torch三件套版本升高一点(torch上限2.2.1或2.2.0),不过或许也提升不了多少,我本机测试,cuda能跑满,但是曲线不平,有很大的波动。我回头试试其他几种方法。

@leegang
Copy link
Author

leegang commented Aug 1, 2024

可以试试把cuda,torch三件套版本升高一点(torch上限2.2.1或2.2.0),不过或许也提升不了多少,我本机测试,cuda能跑满,但是曲线不平,有很大的波动。我回头试试其他几种方法。

torch==2.2.0

@peizhiluo007
Copy link

也是很慢,迭代一步要10多分钟。5秒的音频要一个多小时才跑完。
看GPU的利用率也是100%的。

@smthemex
Copy link
Owner

你不是开了lowram模式了,还有你的显存是多大的

@peizhiluo007
Copy link

peizhiluo007 commented Oct 28, 2024

你不是开了lowram模式了,还有你的显存是多大的

没有开lowram,显存是12G的。
|=========================================+======================+======================|
| 0 NVIDIA GeForce GTX TITAN X Off | 00000000:06:00.0 Off | N/A |
| 49% 82C P2 154W / 250W | 9676MiB / 12288MiB | 100% Default |
| | | N/A |

选用的还是acc版的模型

@smthemex
Copy link
Owner

我用4070 12G 跑一段几分钟的事, 我查了下GTX TITAN X 好像不支持混合精度,你修改utils.py的代码22行把 torch.float16改成torch.float32试试

@peizhiluo007
Copy link

peizhiluo007 commented Oct 28, 2024

我用4070 12G 跑一段几分钟的事, 我查了下GTX TITAN X 好像不支持混合精度,你修改utils.py的代码22行把 torch.float16改成torch.float32试试

谢谢大佬。
1、试了直接改torch.float32之后,报显存不够了。
2、又试了下,改torch.float32同时把lowram开启。这种情况下确实比之前快了近一倍,迭代一次差不多3-4分钟。5秒跑完20分钟左右。

请问下,有没有可能是安装的时候出的问题有关。因为在pip install facenet-pytorch 的时候,提示"facenet-pytorch 2.6.0 requires torch<2.3.0,>=2.2.0" uninstalling torch 2.5.0 will break other nodes",我已经装上 torch 2.5.0 不想重装就加了--no-deps,把facenet-pytorch装了。跑是能跑起来的,不知道慢和这个有没有关系?

@peizhiluo007
Copy link

总之,非常谢谢大佬。
改成torch.float32同时把lowram开启之后,确实快很多了。

@smthemex
Copy link
Owner

cuda版本按理是向下兼容的,还是半精 全精的问题,facenet-pytorch主要是安装的时候会默认强制装torch 而不是torch-cuda(已有的还给你卸载,如果不使用--no-deps),所以很多comfyUI用户会因为找到不到cuda而进不去。

@peizhiluo007
Copy link

cuda版本按理是向下兼容的,还是半精 全精的问题,facenet-pytorch主要是安装的时候会默认强制装torch 而不是torch-cuda(已有的还给你卸载,如果不使用--no-deps),所以很多comfyUI用户会因为找到不到cuda而进不去。

好的,明白,谢谢您的详解。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants