Typically, model training is a time-consuming step during deep learning development, especially in medical imaging applications. Volumetric medical images are usually large (as multi-dimensional arrays) and the model training process can be complex. Even with powerful hardware (e.g. CPU/GPU with large RAM), it is not easy to fully leverage them to achieve high performance. NVIDIA GPUs have been widely applied in many areas of deep learning training and evaluation, and the CUDA parallel computation shows obvious acceleration when comparing to traditional computation methods. To fully leverage GPU features, many popular mechanisms raised, like automatic mixed precision (AMP), distributed data parallel, etc. MONAI can support these features and this folder provides a fast training guide to achieve the best performance and rich examples.
The document introduces details of how to profile the training pipeline, how to analyze the dataset and select suitable algorithms, and how to optimize GPU utilization in single GPU, multi-GPUs or even multi-nodes.
The examples show how to execute distributed training and evaluation based on 3 different frameworks:
- PyTorch native
DistributedDataParallel
module withtorch.distributed.launch
. - Horovod APIs with
horovodrun
. - PyTorch ignite and MONAI workflows.
They can run on several distributed nodes with multiple GPU devices on every node.
And compares the training speed and memory usage with/without AMP.
This notebook compares the performance of Dataset
, CacheDataset
and PersistentDataset
. These classes differ in how data is stored (in memory or on disk), and at which moment transforms are applied.
This tutorial compares the training performance of pure PyTorch program and optimized program in MONAI based on NVIDIA GPU device and latest CUDA library.
The optimization methods mainly include: AMP
, CacheDataset
and Novograd
.
This notebook is a quick demo for devices, run the Ignite trainer engine on CPU, GPU and multiple GPUs.
Demonstrates the use of the ThreadBuffer
class used to generate data batches during training in a separate thread.
Illustrate reading NIfTI files and test speed of different transforms on different devices.