Skip to content

Commit

Permalink
Update LLM_finetuning.md
Browse files Browse the repository at this point in the history
  • Loading branch information
HamidShojanazeri authored and mreso committed Sep 13, 2023
1 parent 64e917c commit 8d3ce4c
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion docs/LLM_finetuning.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ In this scenario depending on the model size, you might need to go beyond one GP
The way you want to think about it is, you would need enough GPU memory to keep model parameters, gradients and optimizer states. Where each of these, depending on the precision you are training, can take up multiple times of your parameter count x precision( depending on if its fp32/ 4 bytes, fp16/2 bytes/ bf16/2 bytes).
For example AdamW optimizer keeps 2 parameters for each of your parameters and in many cases these are kept in fp32. This implies that depending on how many layers you are training/ unfreezing your GPU memory can grow beyond one GPU.

**FSDP (FUlly Sharded Data Parallel)**
**FSDP (Fully Sharded Data Parallel)**


Pytorch has the FSDP package for training models that do not fit into one GPU. FSDP lets you train a much larger model with the same amount of resources. Prior to FSDP was DDP (Distributed Data Parallel) where each GPU was holding a full replica of the model and would only shard the data. At the end of backward pass it would sync up the gradients.
Expand Down

0 comments on commit 8d3ce4c

Please sign in to comment.