Skip to content

Commit

Permalink
add doc example about using low_cpu_fsdp
Browse files Browse the repository at this point in the history
  • Loading branch information
lchu-ibm committed Aug 3, 2023
1 parent 80a4c36 commit c453b66
Show file tree
Hide file tree
Showing 2 changed files with 20 additions and 0 deletions.
10 changes: 10 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,16 @@ torchrun --nnodes 1 --nproc_per_node 8 llama_finetuning.py --enable_fsdp --mode

```

### Fine-tuning using FSDP on 70B Model

If you are interested in running full parameter fine-tuning on the 70B model, you can enable `low_cpu_fsdp` mode as the following command. This option will load model on rank0 only before moving model to devices to construct FSDP. This can dramatically save cpu memory when loading large models like 70B (on a 8-gpu node, this reduces cpu memory from 2+T to 280G for 70B model). This has been tested with `BF16` on 16xA100, 80GB GPUs.

```bash

torchrun --nnodes 1 --nproc_per_node 8 llama_finetuning.py --enable_fsdp --low_cpu_fsdp --pure_bf16 --model_name /patht_of_model_folder/70B --batch_size_training 1 --micro_batch_size 1 --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned

```

### Multi GPU Multi Node:

```bash
Expand Down
10 changes: 10 additions & 0 deletions docs/mutli_gpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,16 @@ torchrun --nnodes 1 --nproc_per_node 8 llama_finetuning.py --enable_fsdp --mode

```

### Fine-tuning using FSDP on 70B Model

If you are interested in running full parameter fine-tuning on the 70B model, you can enable `low_cpu_fsdp` mode as the following command. This option will load model on rank0 only before moving model to devices to construct FSDP. This can dramatically save cpu memory when loading large models like 70B (on a 8-gpu node, this reduces cpu memory from 2+T to 280G for 70B model). This has been tested with `BF16` on 16xA100, 80GB GPUs.

```bash

torchrun --nnodes 1 --nproc_per_node 8 llama_finetuning.py --enable_fsdp --low_cpu_fsdp --pure_bf16 --model_name /patht_of_model_folder/70B --batch_size_training 1 --micro_batch_size 1 --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned

```

**Multi GPU multi node**:

Here we use a slurm script to schedule a job with slurm over multiple nodes.
Expand Down

0 comments on commit c453b66

Please sign in to comment.