-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
--set devicePlugin.deviceMemoryScaling=2 无效 #705
Comments
/assign elrondwong |
我修改了里面的参数,并重启了kubectl rollout restart daemonset hami-device-plugin -n kube-system,但是看显存超卖还是没有生效,请大佬帮忙看看。 |
After investigation, there is no issue with the code. You need to configure the global |
我发现一个新的问题,我4090显卡,实际是24G显存,62G内存,我启动了一个60000MIB 的pod,里面看显卡都是正常的,但是在使用的时候报错,请大佬帮忙看看。 torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 784.00 MiB. GPU 0 has a total capacity of 58.59 GiB of which 58.22 GiB is free. Process 167892 has 23.04 GiB memory in use. Of the allocated memory 22.58 GiB is allocated by PyTorch, and 17.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) |
我理解,你的显存是24g,如果你的程序使用的内存超过24g就会OOM。你的内存是60G,但是当你使用了GPU时,处理程序是加载在GPU显存的而不是内存 |
不是这样的吧,超分vgpu的意义是可以把24G切割成多快GPU,但是超卖的话,我在容器内部已经可以看到60G显存了,所以我应该是可以使用60G的,只不过是它底层会把内存当成显存使用。 |
@su161021 我这开启显存虚拟化后测试的结果跟 @elrondwong理解的一样,所以我认为 hami实现的显存超卖跟没实现一样,我理解的显存虚拟化,应该是如果我开启了显存虚拟化,如果显存不够,可以直接使用内存 |
helm install hami hami-charts/hami --set devicePlugin.deviceMemoryScaling=2 -n kube-system
这是我执行的命令,发现并没有把显存变大,我使用的是v2.4.1版本测试的,其他功能都是正常,显存变大功能测试无效
The text was updated successfully, but these errors were encountered: