Compute Canada

Account

To get access to Calcul Québec or Compute Canada resources, you need an account sponsor by a Principal Investigator (PI). Normally, this would be your research director. Contact him to get his Compute Canada Role Identifier (CCRI). To apply for an account go to the CCDB website.

Resources

To login to the different resources use ssh with your Compute Canada username and password. The available resources on each cluster and their login endpoint are summarised in the following table. Only the highest-performing nodes are included in the table.

Name	Login Endpoint	Core*	Memory*	GPU*
Cedar	cedar.computecanada.ca	24	250G	4 x NVIDIA P100 Pascal (16G)
Graham	graham.computecanada.ca	28	178G	8 x NVIDIA V100 Volta (16GB)
Niagara	niagara.computecanada.ca	40	202G	None
Béluga	beluga.computecanada.ca	40	186G	4 x NVidia V100SXM2 (16G)

* per node

Environment Setup

System Package

Most softwares needed are already available as module. By default, they are not loaded. You can use the command module load <package name> to load a module and module list to list the current module loaded. A list of all available module is available here

If you need a software that is not available as module, you have multiple options:

You could build it yourself. In this case, do not use the resources (CPUs, memory) of the login node but submit a job with the adequate resources.
You can run Docker images from Singularity. More information here.
Contact the technical support to add your module.

Python

If you are using python, you can install packages in virtual environments with pip. Compute Canada recommend to avoid Anaconda because package install with it are not optimized for the specific CPU architecture. Instead, you should use virtualenv and pip.

First, to create a virtual environment with a specific python version, ensure you have selected a Python version with module load python.

Execute the following line in your terminal to load the version of your choice using module load. :

module load python/<version>

For example: module load python/3.8

Then, use virtualenv --no-download /path/to/env to create a virtual environment

You can activate the environment using: source /path_to_env/bin/activate where path_to_env is the full path from your home directory to the environment folder then add /bin/activate

You can install a package with: pip install <package name>.

If you want to use PyTorch for example, run pip install torch torchvision once your virtual environment is activated.

If you want to know more about python on Compute Canada there is exhaustive informations about environment creation and python version installation if you follow this link: Compute Canada Python

Job Submission

Task are run with a scheduler so you need to ask for your resources and wait for them to be available. In practice this mean that you run a bash script with Slurm. In this script, you specify the resources needed, the expected running time and the command to run your task. If your time is exceeded, your job will be killed. Once your script is executed, it will be added to a waiting list and only when the resources are available your job will start.

Simple example to run a job on Graham:

#!/bin/bash
#SBATCH --account=def-someuser # Sponsor account
#SBATCH --nodes=1 # Number of node
#SBATCH --gres=gpu:v100:8 # GPU type and number per node
#SBATCH --cpus-per-task=28 # Number of core per node
#SBATCH --mem=150G # Memory per node
#SBATCH --time=1-00:00 # Running time (DD-HH:MM)

nvidia-smi

Parameters may change depending on the cluster, you should check the Compute Canada wiki page specific to your cluster before submit a job.

To submit a job, use the sbatch command follow by the path to your script. Use sq to list your jobs status and scancel follow by the job ID to cancel it.

You can also start a interactive job with : salloc --time=1:0:0 --ntasks=2 --account=def-someuser

Filesystem

The filesystem is divided in three directory: home, scratch, and project. They all have a quota on the space and number of files. They are accessible on every node of the same cluster, but they are not shared between the different clusters. Their quota is different on each cluster. For more specific information, visit this link.

Disk	Capacity	Speed	Backup	Quota
`HOME`	Low	Low	Yes	User
`PROJECT`	High	Low	Yes	Group
`SCRATCH`	High	High	No (Automatic deletion every 60 days)	User

Also, every node has is own local storage which is more suitable for dataset with a lot a small file. In most case, it is recommended to keep your dataset in tar or zip archive and uncompressed it on the local storage before stating your training.

Filesystem space per user

Disk	Capacity
`/home (username)`	50G
`/scratch (username)`	20T
`/project (group username)`	1000G

Useful Command

diskusage_report : show available space and number of files
module load or ml load: load a module
ml or module list: show loaded modules
ml spider or module spider: find detailed information about a particular package
sbatch: start a job
sq: list jobs status
scancel: kill a job

Example

Those examples run the classification task from pytorch_toolbox

Simple Training

module load python/3.6
virtualenv --no-download $HOME/pyenv
source $HOME/pyenv/bin/activate
pip install torch torchvision --no-index
cd scratch
git clone https://github.com/MathGaron/pytorch_toolbox.git

Download and copy the dataset to your home folder and run this script with the sbatch command:

#!/bin/bash
#SBATCH --account=def-someuser     # Sponsor account
#SBATCH --gres=gpu:1               # Number of GPU(s) per node
#SBATCH --cpus-per-task=4          # CPU cores/threads
#SBATCH --mem=8000M                # memory per node
#SBATCH --time=0-01:00             # time (DD-HH:MM)

module load python/3.6
source $HOME/pyenv/bin/activate

DATASET=$SLURM_TMPDIR/dataset
mkdir -p ${DATASET}/valid
unzip $HOME/train.zip $DATASET
# Move 1000 entries from the train dataset to the valid
ls $DATASET/train/ | shuf -n 1000 | xargs -i mv $DATASET/train/{} $DATASET/valid

export PYTHONPATH="${PYTHONPATH}:${SCRATCH}/pytorch_toolbox"
python train.py --output $SCRATCH/output --dataset $DATASET

Training output will be in scratch/output and slurm will create a log in your home folder.

Job Arrays

If you want to execute the same task with some different parameters(parameter exploration), job arrays is a simple and effective way to submit multiple jobs at once with varying parameters.

The following example show a simple parameter exploration with an array of 3 tasks.

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --account=def-someuser # Sponsor account
#SBATCH --gres=gpu:1           # Number of GPU(s) per node
#SBATCH --cpus-per-task=8      # CPU cores/threads
#SBATCH --mem=16000M           # memory per node
#SBATCH --time=0-19:00         # time (DD-HH:MM)
#SBATCH --array=0-11           # $SLURM_ARRAY_TASK_ID takes values from 0 to 2 (inclusive)

module load python/3.8.2
source $HOME/dev_3.8.2/bin/activate

GROUP_SIZE=(64 32 16)
N_UP=(1 1 2)
N_WCSI=(0 1 1)

python train.py --group_size ${GROUP_SIZE[$SLURM_ARRAY_TASK_ID]} --n_upsample_wcsi_blocks ${N_UP[$SLURM_ARRAY_TASK_ID]} --n_wcsi_blocks ${N_WCSI[$SLURM_ARRAY_TASK_ID]}

The individual tasks in the array of jobs are distinguished by an environment variable called: $SLURM_ARRAY_TASK_ID Slurm will set this environment variable($SLURM_ARRAY_TASK_ID) to a different value for each individual task.

Parameter optimization on a whole node

This example find the best learning rate for the previous example. It use a whole node and run the training on 8 different process at the time. Each process has his own GPU. To execute it, follow the step from the Simple training example and run this script with sbatch :

#!/bin/bash
#SBATCH --account=def-someuser  # Sponsor account
#SBATCH --gres=gpu:v100:8       # Number of GPU(s) per node
#SBATCH --cpus-per-task=28      # CPU cores/threads
#SBATCH --exclusive
#SBATCH --mem=150G              # memory per node
#SBATCH --time=1-00:00          # time (DD-HH:MM)

module load python/3.6
source $HOME/pyenv/bin/activate

DATASET=$SLURM_TMPDIR/dataset
mkdir -p ${DATASET}/valid
unzip $HOME/train.zip $DATASET
ls $DATASET/train/ | shuf -n 1000 | xargs -i mv $DATASET/train/{} $DATASET/valid

export PYTHONPATH="${PYTHONPATH}:${SCRATCH}/pytorch_toolbox"

# Enumerate all the batchsize to try
BATCHSIZE=""
for i in $(seq 0.001 0.001 0.01)
do
    BATCHSIZE="${BATCH_SIZE}${i}\n"
done

# This command use GNU parallel to distribuate the GPU to eight process at the time.
printf $BATCHSIZE | parallel -j8 'CUDA_VISIBLE_DEVICES=$(({%} - 1)) \
    python train.py --output '"${SCRATCH}/output"' --dataset '"${DATASET}"' --batchsize {} &> {#}.out'

For more complex cases, SLURM Managed Cluster could be used. It's a Python library made to parallelize your code on a slurm cluster.

Using Comet-ML

Since an internet connection is required to use the model.ml experiment python module, it is necessary to verify if it is available on the cluster.

Cluster	Availability	Note
Béluga	Yes	Comet can be used after loading the httpproxy module: module load httpproxy
Cedar	Yes	Internet access available
Graham	No	Internet access disable

Provide feedback

Saved searches

Use saved searches to filter your results more quickly