Skip to content

Commit

Permalink
Improve docs (#109)
Browse files Browse the repository at this point in the history
**Description**
Improve usage page and fix some typos.

**Major Revision**
- Improve usage page
- fix some typos
  • Loading branch information
tocean authored Oct 30, 2023
1 parent ceadfea commit 3a4ba14
Show file tree
Hide file tree
Showing 3 changed files with 13 additions and 9 deletions.
2 changes: 1 addition & 1 deletion docs/getting-started/run-msamp.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
id: run-msamp
---

# Run examples
# Run Examples

After installing MS-AMP, you can run several simple examples using MS-AMP. Please note that before running these commands, you need to change work directory to [examples](https://github.com/Azure/MS-AMP/tree/main/examples).

Expand Down
6 changes: 3 additions & 3 deletions docs/introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,17 +26,17 @@ MS-AMP has the following benefit comparing with Transformer Engine:

### Model performance

We evaluated the training loss and validation performance of four typical models, GPT-3, Swin-Transformer, DeiT and RoBERTa, using both MS-AMP and FP16 AMP/BF16. Our observations showed that the models trained with MS-AMP achieved comparable performance to those trained using FP16 AMP/BF16. This demonstrates the effectiveness of the mixed FP8 in MS-AMP.
We evaluated the training loss and validation performance of four typical models, GPT-3, Swin-Transformer, DeiT and RoBERTa, using both MS-AMP and FP16/BF16 AMP. Our observations show that the models trained with MS-AMP achieved comparable performance to those trained using FP16/BF16 AMP. This demonstrates the effectiveness of the mixed FP8 in MS-AMP.

Here are the results for GPT-3, Swin-T, DeiT-S and RoBERTa-B.

![image](./assets/gpt-loss.png)

![image](./assets/performance.png)

### System peroformance
### System performance

MS-AMP preserves high-precision's accuracy while using only a fraction of the memory footprint on a range of tasks, including GPT-3, DeiT and Swin Transformer. For example, when training GPT-175B on NVIDIA H100 platform, MS-AMP achieves a notable 42% reduction in real memory usage compared with BF16 mixed-precision aproch and reduces training time by 17% compared with Transformer Engine. For small models, MS-AMP with O2 mode can achieve 44% memory saving for Swin-1.0B and 26% memory saving for ViT-1.2B, comparing with FP16 AMP.
MS-AMP preserves high-precision's accuracy while using only a fraction of the memory footprint on a range of tasks, including GPT-3, DeiT and Swin Transformer. For example, when training GPT-175B on NVIDIA H100 platform, MS-AMP achieves a notable 42% reduction in real memory usage compared with BF16 mixed-precision approach and reduces training time by 17% compared with Transformer Engine. For small models, MS-AMP with O2 mode can achieve 44% memory saving for Swin-1.0B and 26% memory saving for ViT-1.2B, comparing with FP16 AMP.

Here are the resuls for GPT-3:

Expand Down
14 changes: 9 additions & 5 deletions docs/user-tutorial/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ id: usage

## Basic usage

Enabling MS-AMP is very simple when traning model w/ or w/o data parallelism on a single node, you only need to add one line of code `msamp.initialize(model, optimizer, opt_level)` after defining model and optimizer.
Enabling MS-AMP is very simple when traning model w/o any distributed parallel technologies, you only need to add one line of code `msamp.initialize(model, optimizer, opt_level)` after defining model and optimizer.

Example:

Expand All @@ -22,20 +22,24 @@ model, optimizer = msamp.initialize(model, optimizer, opt_level="O2")
...
```

## Usage in distributed parallel training
## Usage in DeepSpeed

MS-AMP supports FP8 for distributed parallel training and has the capability of integrating with advanced distributed traning frameworks. We have integrated MS-AMP with several popular distributed training frameworks such as DeepSpeed, Megatron-DeepSpeed and Megatron-LM to demonstrate this capability.

For enabling MS-AMP when using ZeRO in DeepSpeed, add one line of code `import msamp` and a "msamp" section in DeepSpeed config file:
For enabling MS-AMP in DeepSpeed, add one line of code `from msamp import deepspeed` at the beginging and a "msamp" section in DeepSpeed config file:

```json
"msamp": {
"enabled": true,
"opt_level": "O3"
"opt_level": "O1|O2|O3"
}
```

For applying MS-AMP to Megatron-DeepSpeed and Megatron-LM, you need to do very little code change for applying it. Here is the instruction of applying MS-AMP for running [gpt-3](https://github.com/Azure/MS-AMP-Examples/tree/main/gpt3) in both Megatron-DeepSpeed and Megatron-LM.
"O3" is designed for FP8 in ZeRO optimizer, so please make sure ZeRO is enabled when using "O3".

## Usage in Megatron-DeepSpeed and Megatron-LM

For integrating MS-AMP with Megatron-DeepSpeed and Megatron-LM, you need to make some code changes. We provide a patch as a reference for the integration. Here is the instruction of integrating MS-AMP with Megatron-DeepSpeed/Megatron-LM and how to run [gpt-3](https://github.com/Azure/MS-AMP-Examples/tree/main/gpt3) with MS-AMP.

Runnable, simple examples demonstrating good practices can be found [here](https://azure.github.io//MS-AMP/docs/getting-started/run-msamp).
For more comprehensive examples, please go to [MS-AMP-Examples](https://github.com/Azure/MS-AMP-Examples).

0 comments on commit 3a4ba14

Please sign in to comment.