Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hami device-plugin pod restarted, all use vgpu business pod status error #727

Open
huangjiasingle opened this issue Dec 24, 2024 · 2 comments
Labels
kind/bug Something isn't working

Comments

@huangjiasingle
Copy link

What happened:
when the device-lugin pod restared on the node, all use vgpu business pod will error.
What you expected to happen:
when the device-lugin pod restared on the node, all use vgpu business pod will work normal.
How to reproduce it (as minimally and precisely as possible):
create a pod use vgpu, for example: in pod container ,execute nvidia-smi every seconde.
Anything else we need to know?:

  • The output of nvidia-smi -a on your host
    normal output.
  • Your docker or containerd configuration file (e.g: /etc/docker/daemon.json)
    containerd
  • The hami-device-plugin container logs
  • The hami-scheduler container logs
  • The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)
  • Any relevant kernel output lines from dmesg

Environment:

  • HAMi version:
    hami:latest
  • nvidia driver or other AI device driver version:
  • Docker version from docker version
  • Docker command, image and tag used
  • Kernel version from uname -a
  • Others:
@huangjiasingle huangjiasingle added the kind/bug Something isn't working label Dec 24, 2024
@archlitchi
Copy link
Collaborator

yes, restart the device-plugin will replace libvgpu.so mounted inside container, since the .so file is mmapped into the address of container process, this action may crash the process, and lead to error.

@huangjiasingle
Copy link
Author

@archlitchi it's very dangerous!because issue https://github.com/Project-HAMi/HAMi/issues/710,only restart device-plugin methed to fix. but all the business pod error. because this reason,hami is very hard to use in prod env.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants