Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Containerd][Nvidia runtime] Missing CRI plugin option for the nvidia container toolkit support #112

Open
mimnix opened this issue Feb 5, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@mimnix
Copy link

mimnix commented Feb 5, 2025

Hi guys,
on a GPU-powered kubernetes node (v. 1.29.10), with the nvidia runtime set as default runtime, the containers kept crashing in an infinite loop.
Rigenerating the patched containerd.toml configuration file with:

nvidia-ctk runtime configure --runtime=containerd

I've realized there's a drift with the configuration proposed within the containerd package of this module. The rendered configuration by nvidia-ctk is:

oom_score = 0
root = "/var/lib/containerd"
state = "/run/containerd"
version = 2

[debug]
  level = "info"

[grpc]
  max_recv_message_size = 16777216
  max_send_message_size = 16777216

[metrics]
  address = ""
  grpc_histogram = false

[plugins]

  [plugins."io.containerd.grpc.v1.cri"]
    max_container_log_line_size = 16384
    sandbox_image = "registry.sighup.io/fury/on-premises/pause:3.9"

    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"
      snapshotter = "overlayfs"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          base_runtime_spec = ""
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"
            SystemdCgroup = true

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
          base_runtime_spec = ""
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
            SystemdCgroup = true

    [plugins."io.containerd.grpc.v1.cri".registry]

      [plugins."io.containerd.grpc.v1.cri".registry.mirrors]

        [plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
          endpoint = ["https://registry-1.docker.io"]

As you can see there's the option SystemdCgroup = true incuded under the section [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]. Also, the runc snippet is not removed. I propose to add at least the missing plugin option in the upstream config.toml.j2 jinja template.

Tested on:

  • Fury 1.29 legacy
  • On prem module v1.31.4
  • Nvidia container toolkit v1.14.6
  • Node with Nvidia 1080Ti
@mimnix mimnix added the bug Something isn't working label Feb 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
1 participant