-
Notifications
You must be signed in to change notification settings - Fork 2
Compute Canada
To get access to Calcul Québec or Compute Canada resources, you need an account sponsor by a Principal Investigator (PI). Normally, this would be your research director. Contact him to get his Compute Canada Role Identifier (CCRI). To apply for an account go to the CCDB website.
To login to the different resources use ssh with your Compute Canada username and password. The available resources on each cluster and their login endpoint are summarised in the following table. Only the highest-performing nodes are included in the table.
Name | Login Endpoint | Core* | Memory* | GPU* |
---|---|---|---|---|
Cedar | cedar.computecanada.ca | 24 | 250G | 4 x NVIDIA P100 Pascal (16G) |
Graham | graham.computecanada.ca | 28 | 178G | 8 x NVIDIA V100 Volta (16GB) |
Niagara | niagara.computecanada.ca | 40 | 202G | None |
Béluga | beluga.computecanada.ca | 40 | 186G | 4 x NVidia V100SXM2 (16G) |
* per node
Most softwares needed are already available as module. By default, they are not loaded. You can use the command module load <package name>
to load a module and module list
to list the current module loaded. A list of all available module is available here
If you need a software that is not available as module, you have multiple options:
- You could build it yourself. In this case, do not use the resources (CPUs, memory) of the login node but submit a job with the adequate resources.
- You can run Docker images from
Singularity
. More information here. - Contact the technical support to add your module.
If you are using python, you can install packages in virtual environments with pip. Compute Canada recommend to avoid Anaconda because package install with it are not optimized for the specific CPU architecture. Instead, you should use virtualenv and pip.
First, to create a virtual environment with a specific python version, ensure you have selected a Python version with module load python.
Execute the following line in your terminal to load the version of your choice using module load. :
module load python/<version>
For example: module load python/3.8
Then, use virtualenv --no-download /path/to/env
to create a virtual environment
You can activate the environment using:
source /path_to_env/bin/activate
where path_to_env is the full path from your home directory to the environment folder then add /bin/activate
You can install a package with:
pip install <package name>
.
If you want to use PyTorch for example, run pip install torch torchvision
once your virtual environment is activated.
If you want to know more about python on Compute Canada there is exhaustive informations about environment creation and python version installation if you follow this link: Compute Canada Python
Task are run with a scheduler so you need to ask for your resources and wait for them to be available. In practice this mean that you run a bash script with Slurm. In this script, you specify the resources needed, the expected running time and the command to run your task. If your time is exceeded, your job will be killed. Once your script is executed, it will be added to a waiting list and only when the resources are available your job will start.
Simple example to run a job on Graham:
#!/bin/bash
#SBATCH --account=def-someuser # Sponsor account
#SBATCH --nodes=1 # Number of node
#SBATCH --gres=gpu:v100:8 # GPU type and number per node
#SBATCH --cpus-per-task=28 # Number of core per node
#SBATCH --mem=150G # Memory per node
#SBATCH --time=1-00:00 # Running time (DD-HH:MM)
nvidia-smi
Parameters may change depending on the cluster, you should check the Compute Canada wiki page specific to your cluster before submit a job.
To submit a job, use the sbatch
command follow by the path to your script. Use sq
to list your jobs status and scancel
follow by the job ID to cancel it.
You can also start a interactive job with : salloc --time=1:0:0 --ntasks=2 --account=def-someuser
The filesystem is divided in three directory: home
, scratch
, and project
.
They all have a quota on the space and number of files.
They are accessible on every node of the same cluster, but they are not shared between the different clusters.
Their quota is different on each cluster. For more specific information, visit this link.
Disk | Capacity | Speed | Backup | Quota |
---|---|---|---|---|
HOME |
Low | Low | Yes | User |
PROJECT |
High | Low | Yes | Group |
SCRATCH |
High | High | No (Automatic deletion every 60 days) | User |
Also, every node has is own local storage which is more suitable for dataset with a lot a small file. In most case, it is recommended to keep your dataset in tar
or zip
archive and uncompressed it on the local storage before stating your training.
Disk | Capacity |
---|---|
/home (username) |
50G |
/scratch (username) |
20T |
/project (group username) |
1000G |
-
diskusage_report
: show available space and number of files -
module load
orml load
: load a module -
ml
ormodule list
: show loaded modules -
ml spider
ormodule spider
: find detailed information about a particular package -
sbatch
: start a job -
sq
: list jobs status -
scancel
: kill a job
Those examples run the classification task from pytorch_toolbox
module load python/3.6
virtualenv --no-download $HOME/pyenv
source $HOME/pyenv/bin/activate
pip install torch torchvision --no-index
cd scratch
git clone https://github.com/MathGaron/pytorch_toolbox.git
Download and copy the dataset to your home folder and run this script with the sbatch
command:
#!/bin/bash
#SBATCH --account=def-someuser # Sponsor account
#SBATCH --gres=gpu:1 # Number of GPU(s) per node
#SBATCH --cpus-per-task=4 # CPU cores/threads
#SBATCH --mem=8000M # memory per node
#SBATCH --time=0-01:00 # time (DD-HH:MM)
module load python/3.6
source $HOME/pyenv/bin/activate
DATASET=$SLURM_TMPDIR/dataset
mkdir -p ${DATASET}/valid
unzip $HOME/train.zip $DATASET
# Move 1000 entries from the train dataset to the valid
ls $DATASET/train/ | shuf -n 1000 | xargs -i mv $DATASET/train/{} $DATASET/valid
export PYTHONPATH="${PYTHONPATH}:${SCRATCH}/pytorch_toolbox"
python train.py --output $SCRATCH/output --dataset $DATASET
Training output will be in scratch/output
and slurm will create a log in your home folder.
If you want to execute the same task with some different parameters(parameter exploration), job arrays is a simple and effective way to submit multiple jobs at once with varying parameters.
The following example show a simple parameter exploration with an array of 3 tasks.
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --account=def-someuser # Sponsor account
#SBATCH --gres=gpu:1 # Number of GPU(s) per node
#SBATCH --cpus-per-task=8 # CPU cores/threads
#SBATCH --mem=16000M # memory per node
#SBATCH --time=0-19:00 # time (DD-HH:MM)
#SBATCH --array=0-11 # $SLURM_ARRAY_TASK_ID takes values from 0 to 2 (inclusive)
module load python/3.8.2
source $HOME/dev_3.8.2/bin/activate
GROUP_SIZE=(64 32 16)
N_UP=(1 1 2)
N_WCSI=(0 1 1)
python train.py --group_size ${GROUP_SIZE[$SLURM_ARRAY_TASK_ID]} --n_upsample_wcsi_blocks ${N_UP[$SLURM_ARRAY_TASK_ID]} --n_wcsi_blocks ${N_WCSI[$SLURM_ARRAY_TASK_ID]}
The individual tasks in the array of jobs are distinguished by an environment variable called: $SLURM_ARRAY_TASK_ID Slurm will set this environment variable($SLURM_ARRAY_TASK_ID) to a different value for each individual task.
This example find the best learning rate for the previous example. It use a whole node and run the training on 8 different process at the time. Each process has his own GPU.
To execute it, follow the step from the Simple training example and run this script with sbatch
:
#!/bin/bash
#SBATCH --account=def-someuser # Sponsor account
#SBATCH --gres=gpu:v100:8 # Number of GPU(s) per node
#SBATCH --cpus-per-task=28 # CPU cores/threads
#SBATCH --exclusive
#SBATCH --mem=150G # memory per node
#SBATCH --time=1-00:00 # time (DD-HH:MM)
module load python/3.6
source $HOME/pyenv/bin/activate
DATASET=$SLURM_TMPDIR/dataset
mkdir -p ${DATASET}/valid
unzip $HOME/train.zip $DATASET
ls $DATASET/train/ | shuf -n 1000 | xargs -i mv $DATASET/train/{} $DATASET/valid
export PYTHONPATH="${PYTHONPATH}:${SCRATCH}/pytorch_toolbox"
# Enumerate all the batchsize to try
BATCHSIZE=""
for i in $(seq 0.001 0.001 0.01)
do
BATCHSIZE="${BATCH_SIZE}${i}\n"
done
# This command use GNU parallel to distribuate the GPU to eight process at the time.
printf $BATCHSIZE | parallel -j8 'CUDA_VISIBLE_DEVICES=$(({%} - 1)) \
python train.py --output '"${SCRATCH}/output"' --dataset '"${DATASET}"' --batchsize {} &> {#}.out'
For more complex cases, SLURM Managed Cluster could be used. It's a Python library made to parallelize your code on a slurm cluster.
Since an internet connection is required to use the model.ml experiment python module, it is necessary to verify if it is available on the cluster.
Cluster | Availability | Note |
---|---|---|
Béluga | Yes | Comet can be used after loading the httpproxy module: module load httpproxy |
Cedar | Yes | Internet access available |
Graham | No | Internet access disable |