Adding best practices for large scale deep learning (#2144)

Adding best-practices for large-scale deep learning workloads.
Azure · Mar 21, 2023 · 2cbb042 · 2cbb042
1 parent 63e3935
commit 2cbb042
Show file tree

Hide file tree

Showing 35 changed files with 3,163 additions and 0 deletions.
diff --git a/.gitmodules b/.gitmodules
@@ -0,0 +1,3 @@
+[submodule "best-practices/largescale-deep-learning/Training/Bloom-Pretrain/src/Megatron-DeepSpeed"]
+	path = best-practices/largescale-deep-learning/Training/Bloom-Pretrain/src/Megatron-DeepSpeed
+	url = https://github.com/bigscience-workshop/Megatron-DeepSpeed.git
diff --git a/best-practices/largescale-deep-learning/.images/AzureML Training Stack.png b/best-practices/largescale-deep-learning/.images/AzureML Training Stack.png
diff --git a/best-practices/largescale-deep-learning/Data-loading/data-loading.md b/best-practices/largescale-deep-learning/Data-loading/data-loading.md
diff --git a/best-practices/largescale-deep-learning/Data-loading/media/Blob-to-Tensor.vsdx b/best-practices/largescale-deep-learning/Data-loading/media/Blob-to-Tensor.vsdx
diff --git a/best-practices/largescale-deep-learning/Data-loading/media/blob-to-tensor.png b/best-practices/largescale-deep-learning/Data-loading/media/blob-to-tensor.png
diff --git a/best-practices/largescale-deep-learning/Environment/ACPT.md b/best-practices/largescale-deep-learning/Environment/ACPT.md
@@ -0,0 +1,91 @@
+## Optimized Environment for large scale distributed training
+
+To effectively run optimized and significantly faster training and inference for large models on AzureML, we recommend the new Azure Container for PyTorch (ACPT) environment which includes the best of Microsoft technologies for training with PyTorch on Azure. In addition to AzureML packages, this environment includes latest Training Optimization Technologies: [Onnx / Onnx Runtime / Onnx Runtime Training](https://onnxruntime.ai/),
+[ORT MoE](https://github.com/pytorch/ort/tree/main/ort_moe), [DeepSpeed](https://www.deepspeed.ai/), [MSCCL](https://github.com/microsoft/msccl), Nebula checkpointing and others to significantly boost the performance. 
+
+## Azure Container for PyTorch (ACPT)
+
+### Curated Environment
+
+There are multiple ready to use curated images published with latest pytorch, cuda versions, ubuntu for [ACPT curated environment](https://learn.microsoft.com/en-us/azure/machine-learning/resource-curated-environments#azure-container-for-pytorch-acpt-preview). You can find the ACPT curated environments by filtering by “ACPT” in the Studio:
+
+![image](https://user-images.githubusercontent.com/39776179/217119432-0418209c-d8e9-49c6-b47d-3612a517e47b.png)
+
+Once you’ve selected a curated environment that has the packages you need, you can refer to it in your YAML file. For example, if you want to use one of the ACPT curated environments, your command job YAML file might look like the following:
+
+```sh
+$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
+type: command
+description: Trains a 175B GPT model
+experiment_name: "large_model_experiment"
+compute: azureml:cluster-gpu
+code: ../../src
+environment: azureml:AzureML-ACPT-pytorch-1.12-py39-cuda11.6-gpu@latest
+environment_variables:
+  NCCL_DEBUG: 'WARN'
+  NCCL_DEBUG_SUBSYS: 'WARN'
+  CUDA_DEVICE_ORDER: 'PCI_BUS_ID'
+  NCCL_SOCKET_IFNAME: 'eth0'
+  NCCL_IB_PCI_RELAXED_ORDERING: '1'
+```
+
+You can also use SDK to specify the environment name
+```sh
+job = command(
+
+    environment="AzureML-ACPT-pytorch-1.12-py39-cuda11.6-gpu@latest"
+)
+```
+ @latest tag to the end of the environment name will pull the latest image. If you want to be specific about the curated environment version number, you can specify it using the following syntax:
+```sh
+ $schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
+...
+environment: azureml:AzureML-ACPT-pytorch-1.12-py39-cuda11.6-gpu:3
+...
+```
+
+### Custom Environment
+If you are looking to extend curated environment and add HF transformers or datasets to be installed, you can create a new env with docker context containing ACPT curated environment as base image and additional packages on top of it as below:
+
+![image](https://user-images.githubusercontent.com/39776179/217162558-235fe518-734d-4b89-8940-71dd4744dda1.png)
+
+In new docker context, use curated env of ACPT as base image and add pip install of transformers, datasets and others.
+![image](https://user-images.githubusercontent.com/39776179/217162413-643ef5ce-ebee-4dfe-bc42-c6b7fa60250b.png)
+
+In addition, you can also save dockerfile to your local path in a environment.yaml
+```sh
+$schema: https://azuremlschemas.azureedge.net/latest/environment.schema.json
+name: custom_aml_environment
+build:
+  path: docker
+```
+Create the custom environment using:
+```sh
+az ml environment create -f cloud/environment/environment.yml
+```
+
+```sh
+$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
+type: command
+description: Trains a 175B GPT model
+experiment_name: "large_model_experiment"
+compute: azureml:cluster-gpu
+
+code: ../../src
+environment: azureml:custom_aml_environment@latest
+```
+You can find more detail at [Custom Environment using SDK and Cli](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-environments-v2?tabs=cli#create-an-environment)
+
+## Benefits
+
+Benefits of using the ACPT curated environment include: 
+
+- Significantly faster training and inference when using ACPT environment.
+- Optimized Training framework to set up, develop, accelerate PyTorch model on large workloads. 
+- Up-to-date stack with the latest compatible versions of Ubuntu, Python, PyTorch, CUDA\RocM, etc.   
+- Ease of use: All components installed and validated against dozens of Microsoft workloads to reduce setup costs and accelerate time to value  
+- Latest Training Optimization Technologies: Onnx / Onnx Runtime / Onnx Runtime Training, ORT MoE, DeepSpeed,  MSCCL, and others.. 
+- Integration with Azure ML: Track your PyTorch experiments on ML Studio or using the AzureML SDK  
+- As-IS use with pre-installed packages or build on top of the curated environment  
+- The image is also available as a DSVM 
+- Azure Customer Support 
diff --git a/best-practices/largescale-deep-learning/Environment/README.md b/best-practices/largescale-deep-learning/Environment/README.md
@@ -0,0 +1,67 @@
+## Introduction
+
+An environment is typically the first thing to start with when doing deep learning training for several reasons:
+
+* <b>Reproducibility</b>: Setting up a proper environment ensures that the training process is repeatable and can be easily replicated by others, which is crucial for scientific research and collaboration.
+
+* <b>Dependency management</b>: Deep learning requires a lot of dependencies and libraries, such as TensorFlow, PyTorch, or Keras, to name a few. An environment provides a way to manage these dependencies and avoid conflicts with other packages or libraries installed on the system.
+
+* <b>Portability</b>: Environments can be easily exported and imported, making it possible to move the training process to another system or even cloud computing resources.
+
+* <b>Auditing</b>: Environments come with full lineage tracking to be able to associate experiments with a particular environment configuration that was used during training.
+
+Azure Machine Learning environments are an encapsulation of the environment where your machine learning training happens. The environments are managed and versioned entities within your Machine Learning workspace that enable reproducible, auditable, and portable machine learning workflows across a variety of compute targets.
+
+### Types 
+Generally, for  can broadly be divided into two main categories: curated and user-managed.
+
+Curated environments are provided by Azure Machine Learning and are available in your workspace by default. Intended to be used as is, they contain collections of Python packages and settings to help you get started with various machine learning frameworks. These pre-created environments also allow for faster deployment time. For a full list, see the curated environments article.
+
+User-managed environments, you're responsible for setting up your environment and installing every package that your training script needs on the compute target. Also be sure to include any dependencies needed for model deployment.
+
+## Building the environment for training
+We recommend starting from a curated environment and adding on top of it the remaining libraries / dependencies that are specific for your model training. For Pytorch workloads, we recommend starting from our Azure Container for Pytorch and following the steps outlined [here](./ACPT.md).
+
+## Validation
+Before running an actual training using the environment that you just created, it's always recommended to validate it. We've built a sample job to run some standard health checks on a GPU cluster to test performance and correctness of distributed multinode GPU trainings. This helps with troubleshooting performance issues related to the environment & container that you plan on using for long training jobs. 
+
+One such validation includes running Nvidia NCCL tests on the environment. Nvidia NCCL tests are relevant for this because they are a set of tools to measure the performance of NCCL, which is a library providing inter-GPU communication primitives that are topology-aware and can be easily integrated into [applications](https://developer.nvidia.com/blog/scaling-deep-learning-training-nccl/). NCCL has found great application in deep learning frameworks, where the AllReduce collective is heavily used for neural network training. Efficient scaling of neural network training is possible with the multi-GPU and multi-node communication provided by NCCL
+
+NVIDIA NCCL tests can help you verify the optimal bandwidth and latency of your NCCL operations, such as all-gather, all-reduce, broadcast, reduce, reduce-scatter as well as point-to-point send and receive. They can also help you identify any bottlenecks or errors in your network or hardware configuration, such as NVLinks, PCIe links, or network interfaces. By running NVIDIA NCCL tests before starting a large machine learning model training, you can ensure that your training environment is optimized and ready for efficient and reliable distributed inter-node GPU communications.
+
+Please see example [here](https://github.com/Azure/azureml-examples/tree/main/cli/jobs/single-step/gpu_perf) along with some expected baselines for some of the most common GPUs in Azure: 
+
+### Standard_ND40rs_v2 (V100), 2 nodes
+
+```
+#                                                       out-of-place                       in-place          
+#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
+#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
+db7284be16ae4f7d81abf17cb8e41334000002:3876:3876 [0] NCCL INFO Launch mode Parallel
+    33554432       8388608     float     sum   4393.9    7.64   14.32  4e-07   4384.4    7.65   14.35  4e-07
+    67108864      16777216     float     sum   8349.4    8.04   15.07  4e-07   8351.5    8.04   15.07  4e-07
+   134217728      33554432     float     sum    16064    8.36   15.67  4e-07    16032    8.37   15.70  4e-07
+   268435456      67108864     float     sum    31486    8.53   15.99  4e-07    31472    8.53   15.99  4e-07
+   536870912     134217728     float     sum    62323    8.61   16.15  4e-07    62329    8.61   16.15  4e-07
+  1073741824     268435456     float     sum   124011    8.66   16.23  4e-07   123877    8.67   16.25  4e-07
+  2147483648     536870912     float     sum   247301    8.68   16.28  4e-07   247285    8.68   16.28  4e-07
+  4294967296    1073741824     float     sum   493921    8.70   16.30  4e-07   493850    8.70   16.31  4e-07
+  8589934592    2147483648     float     sum   987274    8.70   16.31  4e-07   986984    8.70   16.32  4e-07
+```
+
+### Standard_ND96amsr_A100_v4 (A100), 2 nodes
+```
+#                                                       out-of-place                       in-place          
+#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
+#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
+47c46425da29465eb4f752ffa03dd537000001:4122:4122 [0] NCCL INFO Launch mode Parallel
+    33554432       8388608     float     sum    593.7   56.52  105.97  5e-07    590.2   56.86  106.60  5e-07
+    67108864      16777216     float     sum    904.7   74.18  139.09  5e-07    918.0   73.11  137.07  5e-07
+   134217728      33554432     float     sum   1629.6   82.36  154.43  5e-07   1654.3   81.13  152.12  5e-07
+   268435456      67108864     float     sum   2996.0   89.60  167.99  5e-07   3056.7   87.82  164.66  5e-07
+   536870912     134217728     float     sum   5631.9   95.33  178.74  5e-07   5639.2   95.20  178.51  5e-07
+  1073741824     268435456     float     sum    11040   97.26  182.36  5e-07    10985   97.74  183.27  5e-07
+  2147483648     536870912     float     sum    21733   98.81  185.27  5e-07    21517   99.81  187.14  5e-07
+  4294967296    1073741824     float     sum    42843  100.25  187.97  5e-07    42745  100.48  188.40  5e-07
+  8589934592    2147483648     float     sum    85710  100.22  187.91  5e-07    85070  100.98  189.33  5e-07
+```