diff --git a/docs/getting-started/run-msamp.md b/docs/getting-started/run-msamp.md index 97a88eaf..59cf5aa6 100644 --- a/docs/getting-started/run-msamp.md +++ b/docs/getting-started/run-msamp.md @@ -2,7 +2,7 @@ id: run-msamp --- -# Run examples +# Run Examples After installing MS-AMP, you can run several simple examples using MS-AMP. Please note that before running these commands, you need to change work directory to [examples](https://github.com/Azure/MS-AMP/tree/main/examples). diff --git a/docs/introduction.md b/docs/introduction.md index ac0c6283..d502db63 100644 --- a/docs/introduction.md +++ b/docs/introduction.md @@ -26,7 +26,7 @@ MS-AMP has the following benefit comparing with Transformer Engine: ### Model performance -We evaluated the training loss and validation performance of four typical models, GPT-3, Swin-Transformer, DeiT and RoBERTa, using both MS-AMP and FP16 AMP/BF16. Our observations showed that the models trained with MS-AMP achieved comparable performance to those trained using FP16 AMP/BF16. This demonstrates the effectiveness of the mixed FP8 in MS-AMP. +We evaluated the training loss and validation performance of four typical models, GPT-3, Swin-Transformer, DeiT and RoBERTa, using both MS-AMP and FP16/BF16 AMP. Our observations show that the models trained with MS-AMP achieved comparable performance to those trained using FP16/BF16 AMP. This demonstrates the effectiveness of the mixed FP8 in MS-AMP. Here are the results for GPT-3, Swin-T, DeiT-S and RoBERTa-B. @@ -34,9 +34,9 @@ Here are the results for GPT-3, Swin-T, DeiT-S and RoBERTa-B. ![image](./assets/performance.png) -### System peroformance +### System performance -MS-AMP preserves high-precision's accuracy while using only a fraction of the memory footprint on a range of tasks, including GPT-3, DeiT and Swin Transformer. For example, when training GPT-175B on NVIDIA H100 platform, MS-AMP achieves a notable 42% reduction in real memory usage compared with BF16 mixed-precision aproch and reduces training time by 17% compared with Transformer Engine. For small models, MS-AMP with O2 mode can achieve 44% memory saving for Swin-1.0B and 26% memory saving for ViT-1.2B, comparing with FP16 AMP. +MS-AMP preserves high-precision's accuracy while using only a fraction of the memory footprint on a range of tasks, including GPT-3, DeiT and Swin Transformer. For example, when training GPT-175B on NVIDIA H100 platform, MS-AMP achieves a notable 42% reduction in real memory usage compared with BF16 mixed-precision approach and reduces training time by 17% compared with Transformer Engine. For small models, MS-AMP with O2 mode can achieve 44% memory saving for Swin-1.0B and 26% memory saving for ViT-1.2B, comparing with FP16 AMP. Here are the resuls for GPT-3: diff --git a/docs/user-tutorial/usage.md b/docs/user-tutorial/usage.md index ce9c4f63..e4153a40 100644 --- a/docs/user-tutorial/usage.md +++ b/docs/user-tutorial/usage.md @@ -6,7 +6,7 @@ id: usage ## Basic usage -Enabling MS-AMP is very simple when traning model w/ or w/o data parallelism on a single node, you only need to add one line of code `msamp.initialize(model, optimizer, opt_level)` after defining model and optimizer. +Enabling MS-AMP is very simple when traning model w/o any distributed parallel technologies, you only need to add one line of code `msamp.initialize(model, optimizer, opt_level)` after defining model and optimizer. Example: @@ -22,20 +22,24 @@ model, optimizer = msamp.initialize(model, optimizer, opt_level="O2") ... ``` -## Usage in distributed parallel training +## Usage in DeepSpeed MS-AMP supports FP8 for distributed parallel training and has the capability of integrating with advanced distributed traning frameworks. We have integrated MS-AMP with several popular distributed training frameworks such as DeepSpeed, Megatron-DeepSpeed and Megatron-LM to demonstrate this capability. -For enabling MS-AMP when using ZeRO in DeepSpeed, add one line of code `import msamp` and a "msamp" section in DeepSpeed config file: +For enabling MS-AMP in DeepSpeed, add one line of code `from msamp import deepspeed` at the beginging and a "msamp" section in DeepSpeed config file: ```json "msamp": { "enabled": true, - "opt_level": "O3" + "opt_level": "O1|O2|O3" } ``` -For applying MS-AMP to Megatron-DeepSpeed and Megatron-LM, you need to do very little code change for applying it. Here is the instruction of applying MS-AMP for running [gpt-3](https://github.com/Azure/MS-AMP-Examples/tree/main/gpt3) in both Megatron-DeepSpeed and Megatron-LM. +"O3" is designed for FP8 in ZeRO optimizer, so please make sure ZeRO is enabled when using "O3". + +## Usage in Megatron-DeepSpeed and Megatron-LM + +For integrating MS-AMP with Megatron-DeepSpeed and Megatron-LM, you need to make some code changes. We provide a patch as a reference for the integration. Here is the instruction of integrating MS-AMP with Megatron-DeepSpeed/Megatron-LM and how to run [gpt-3](https://github.com/Azure/MS-AMP-Examples/tree/main/gpt3) with MS-AMP. Runnable, simple examples demonstrating good practices can be found [here](https://azure.github.io//MS-AMP/docs/getting-started/run-msamp). For more comprehensive examples, please go to [MS-AMP-Examples](https://github.com/Azure/MS-AMP-Examples).