GPU

Installing NVIDIA GPU Software

GPU Driver install

NVIDIA has multiple instructions on how to install the needed software. They all have slight differences. But, this page has the best reference.

The "Network Repo Installation" flavor of the instructions work well. You can specify different versions of the software when installing the rpms (ex cuda-12-2 instead of just cuda which will install latest).

nvidia-container-runtime

To be able to use gpu(s) from within a docker container nvidia-container-runtime needs to be installed.

You need to be careful to get a version of nvidia-container-runtime that is compatible with the CUDA version you have installed. I don't know of a sytematic way of figuring this out. Generally "newer" versions of CUDA work with the latest nvidia-container-runtime. For older, CUDA versions you will have to try and see (starting with the latest is a fine place to start).

It is confusing which packages are needed nvidia-docker2, nvidia-container-tookit, and/or nvidia-container-runtime. This comment clears up the confusion https://github.com/NVIDIA/nvidia-docker/issues/1268#issuecomment-632692949. We just need nvidia-container-toolkit since we don't use Kubernetes and have Docker > 19.03

curl -s -L https://nvidia.github.io/nvidia-container-runtime/centos7/nvidia-container-runtime.repo \
    | install -m 444 /dev/stdin /etc/yum.repos.d/nvidia-container-runtime.repo
yum install -y nvidia-container-runtime

Upgrading

When upgrading from one CUDA version to another I've had the best luck with uninstalling all nvidia software, reboot'ing, and then installing from scratch. In theory upgrading without uninstalling should work but I've run into problems (ex some packages don't update and are incompatible).

To uninstall all nvidia software:

yum list installed | grep -E 'nvidia|cuda'

You should be able to uninstall all packages from that list. But, verify the packages look correct before removing.

Install an older version of CUDA

These are instructions to install an older version of CUDA. Specifically 11.4 (since that is what IndeX requires) but the general pattern should work for any old version. These instructions are for installing using the "rpm local" manner of install. The "rpm network" manner is prefferable so it should be attempted first. It is just a matter of following the instructions above but instead of installing latest installing a specific version (ex yum install cuda-11-4 instead of yum install cuda). In the event the repos don't have the version you want or there is some other problem the below instructions can be used.

https://developer.nvidia.com/cuda-11-4-4-download-archive?target_os=Linux&target_arch=x86_64&Distribution=CentOS&target_version=7&target_type=rpm_local

# need kernel source which is always the latest so do update first
yum update -y
yum install -y kernel-devel
# if new kernel, then
reboot
yum remove $(rpm -qa | grep ^kernel-3 | grep -v $(uname -r))
wget https://developer.download.nvidia.com/compute/cuda/11.4.4/local_installers/cuda-repo-rhel7-11-4-local-11.4.4_470.82.01-1.x86_64.rpm
rpm -i cuda-repo-rhel7-11-4-local-11.4.4_470.82.01-1.x86_64.rpm
yum clean all
yum install -y nvidia-driver-latest-dkms cuda
yum -y install cuda-drivers
reboot
nvidia-smi # verify it says CUDA 11.4
curl -s -L https://nvidia.github.io/libnvidia-container/centos7/libnvidia-container.repo \
    | install -m 444 /dev/stdin /etc/yum.repos.d/nvidia-container-toolkit.repo
yum clean expire-cache
yum install -y nvidia-container-toolkit
systemctl restart docker
docker run --rm --gpus all nvidia/cuda:11.4.0-base nvidia-smi
docker image rm nvidia/cuda:11.4.0-base

Verify

docker run -it --gpus=all --net=host --rm tensorflow/tensorflow:latest-gpu python -c 'import tensorflow; tensorflow.config.experimental.list_physical_devices("gpu")'
<snip>
Found device 0 with properties:
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:1e.0
<snip>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly