Working Remotely with GPU resources

Connected to the naplab, we have several GPU resources. This page will give you a short introduction to working remotely with GPU resources.

Resources:

Nap01 Server
HPC IDUN Cluster (hpc.ntnu.no)

****

Rules

For the nap01 server, please follow the following rules:

Use nvidia-docker to run jobs
When starting a docker container, name the container with {ntnu_username}_...
When creating a docker image, name the image {ntnu_username}/image_name
ALWAYS check nvidia-smi, to be certain that nobody is using the GPU you want to use

Connecting to nap01.idi.ntnu.no

You can connect to the server by using ssh:

ssh ntnu_username@nap01.idi.ntnu.no

Nap01 has two NVIDIA V100-32GB GPUs and 2x Intel Xeon Gold 6132 CPUs.

Storage

There is two places to store data on the server:

/lhome/ntnu_username: This is a 1TB disk where you should launch your programs from. However, you should NOT store large amount of data on this disk! It is also taken backup of this disk
/work/ntnu_username: This disk is for storing larger datasets. If you don't have a directory there, contact Frank or Håkon.

Useful Commands

df -h: View disk space on the server
htop: View cpu/RAM usage on the server
nvidia-smi: View VRAM/GPU usage on the V100 cards.

Using docker

To get a short introduction to docker, we recommend you to read through the Open AI Lab's tutorial on docker.

Docker tips & tricks

Docker commands can become very long, with several static settings. To make your life easier, you can create simple python scripts to start a docker container. For example:

{% code title="run_docker" %}

#!/usr/bin/env python3
import sys
import os
import random
gpu_id = sys.argv[1]
python_args = " ".join(sys.argv[2:])

docker_name = str(random.randint(0, 10000))


docker_container = "haakohu_{}".format(docker_name) # Replace username with your ntnu username
pwd = os.path.dirname(os.path.abspath(__file__))

cmd = [
    "nvidia-docker", 
    "run",
    f"-u 1123514", # Set your user ID. 
    f"--name {docker_container}", # Set name of docker container
    "--ipc host", # --ipc=host is recommended from nvidia
    "--rm", # remove container when exited / killed
    f"-v {pwd}:/workspace", # mount directories. This mounts current directory to /workspace in the container
    f"-e CUDA_VISIBLE_DEVICES={gpu_id}", # Set GPU ID 
    "--log-opt max-size=50m", # Reduce memory usage from logs
    "-it", # Interactive
    "haakohu/pytorch", # Docker image
    f"{python_args}" # python command
]
command = " ".join(cmd)
print(command)
os.system(command)

{% endcode %}

There is a couple of important settings to change here:

Change docker_name to your ntnu username {NTNU-USERNAME}_...
Change the -u argument in the cmd list. You can find your ID by logging onto the server, then run id -u ntnu_username , for example id -u haakohu. This is to prevent the docker container to save files as administrator, which can easily mess up your project files.
The -v argument to mount folders. In the script, we are only mounting your current directory to /workspace in the docker container. If you need to mount something else, you can add several -v arguments
The docker image.

Save this with the filename run_dockerand make it an executable by running

chmod +x run_docker

Then, I can start the training script with on GPU id 0:

./run_docker_example 0 python -m deep_privacy.train

If you want to start a job without GPU, you can run:

./run_docker_example "" python -m deep_privacy.train

This will execute the following docker cmd:

nvidia-docker run --name haakohu_5556_other --ipc host --rm -v /home/haakohu/DeepPrivacy:/workspace -e CUDA_VISIBLE_DEVICES=8 --log-opt max-size=50m -it haakohu/pytorch python -m deep_privacy.train

Pre-built docker images

Nvidia GPU Cloud has several pre-built docker images for Nvidia systems:

{% embed url="https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow" %}

Mounting a server disk to your local filesystem

Working remotely can be a hassle without mounting the remote filesystem. If you mount a folder to your local computer, you can use your favorite texteditor to work on it.

We recommend you to use sshfs:

{% embed url="https://github.com/libfuse/sshfs" %}

Storage folders on nap01.idi.ntnu.no

For larger datasets, we recommend you to store your datasets on a different disk than the main SSD.

This can be found under /work/ntnu_username. To get a directory for your username, contact Håkon.

Utilizing the full potential of V100 cards

The V-100 cards are extremely powerful, and requires optimized code to realise the full computing potential.

1. nvidia-smi

You can see the utilization of the gpu's by running watch -n 0,5 nvidia-smi. Your code should be running at 90% + utilization most of the time.

2. Utilizing tensor cores

The V100 cards has about 600 tensor cores, which has some weird requirements. Most DL libraries run your operations automatic on tensor cores if you satisfy the following requirements:

Number of filters in your CNN is divisible by 8
Your batch size is divisble by 8
Your parameters/input data is floating point 16

The first two requirements are rather easy to satisfy, however, training a CNN with floating point 16 is hard. To get proper training of your network with 16 bit floating point, you are required to train with mixed precision training. Therefore, we recommend the following two resources to get started with this:

https://devblogs.nvidia.com/video-mixed-precision-techniques-tensor-cores-deep-learning/#part2
https://github.com/NVIDIA/apex - Highly recommended for Pytorch users!

With my code, I got a 220% speed up without loosing any performance.

3. Profiling your code

If your code is running slow and you can't find the bottleneck, profiling is your best friend.

You can use tools like nvprof, but there exists profiling tools to different DL libraries. For pytorch, we have the module torch.utils.bottleneck: https://pytorch.org/docs/stable/bottleneck.html.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!