You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What happened:
when the device-lugin pod restared on the node, all use vgpu business pod will error. What you expected to happen:
when the device-lugin pod restared on the node, all use vgpu business pod will work normal. How to reproduce it (as minimally and precisely as possible):
create a pod use vgpu, for example: in pod container ,execute nvidia-smi every seconde. Anything else we need to know?:
The output of nvidia-smi -a on your host
normal output.
Your docker or containerd configuration file (e.g: /etc/docker/daemon.json)
containerd
The hami-device-plugin container logs
The hami-scheduler container logs
The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)
Any relevant kernel output lines from dmesg
Environment:
HAMi version:
hami:latest
nvidia driver or other AI device driver version:
Docker version from docker version
Docker command, image and tag used
Kernel version from uname -a
Others:
The text was updated successfully, but these errors were encountered:
yes, restart the device-plugin will replace libvgpu.so mounted inside container, since the .so file is mmapped into the address of container process, this action may crash the process, and lead to error.
What happened:
when the device-lugin pod restared on the node, all use vgpu business pod will error.
What you expected to happen:
when the device-lugin pod restared on the node, all use vgpu business pod will work normal.
How to reproduce it (as minimally and precisely as possible):
create a pod use vgpu, for example: in pod container ,execute nvidia-smi every seconde.
Anything else we need to know?:
nvidia-smi -a
on your hostnormal output.
/etc/docker/daemon.json
)containerd
sudo journalctl -r -u kubelet
)dmesg
Environment:
hami:latest
docker version
uname -a
The text was updated successfully, but these errors were encountered: