Initial Lightning DeepSpeed Integration #103

SeanNaren · 2021-03-15T15:07:39Z

Thanks @minimaxir for your hard work, this repo is super awesome! I learnt a lot back when this repo was released about how text generation really works :)

Related to #97.

I've enabled DeepSpeed using default parameters to start with, which does not include CPU Offloading as this comes with a speed degradation as default. I've also got a PR setup to PyTorch Lightning to update the README as information was slightly outdated!

Whilst testing DeepSpeed, I noticed that when training with default parameters the gradient clip value is set to 0, turning off any gradient clipping. Is this by choice? When training with FP16 my loss did not converge, and NaNed without DeepSpeed. To remedy, I reduced the LR to 1e-4, and set max_grad_norm to 1.0, which I think is the default in HF Transformers as well.

Since DeepSpeed really takes effect at larger parameter sizes (since the buffer themselves are around 3GB of RAM by default) I tested DeepSpeed using a much larger network size (1.5B parameters) and tested across 8 GPUs, seeing good scaling of parameter sizes. As the number of GPUs increased, I did see the memory usage decrease which is as expected as we shard the states across GPUs. I'll continue to try push the numbers on our A100 Server.

I'll try do some tests with DeepSpeed at sizes that fit DDP, but I'll continue to test. Let me know if you have any issues/feedback!!

Script:

from aitextgen.TokenDataset import TokenDataset
from aitextgen.tokenizers import train_tokenizer
from aitextgen.utils import GPT2Config
from aitextgen import aitextgen

file_name = "input.txt"

train_tokenizer(file_name)
tokenizer_file = "aitextgen.tokenizer.json"

# Roughly 1.5B parameters
config = GPT2Config(n_embd=3072, n_head=16)

ai = aitextgen(tokenizer_file=tokenizer_file, config=config)

data = TokenDataset(file_name, tokenizer_file=tokenizer_file, block_size=64)

ai.train(
    data,
    batch_size=1,
    num_steps=50000,
    generate_every=5000,
    save_every=5000,
    n_gpu=8,
    num_workers=4,
    fp16=True,
    use_deepspeed=False,
    max_grad_norm=1,
    learning_rate=1e-4
)

ai.generate(10, prompt="ROMEO:")

ai2 = aitextgen(model="trained_model/pytorch_model.bin",
                tokenizer_file="aitextgen.tokenizer.json",
                config="trained_model/config.json")

ai2.generate(10, prompt="ROMEO:")

SeanNaren · 2021-03-15T16:17:39Z

Just additionally we're working on ZeRO 3 right now, and once this in place I'm more than happy to add a separate PR :)

minimaxir · 2021-03-16T05:37:52Z

Thanks for the PR! I'll test it out and see if it works.

Whilst testing DeepSpeed, I noticed that when training with default parameters the gradient clip value is set to 0, turning off any gradient clipping. Is this by choice? When training with FP16 my loss did not converge, and NaNed without DeepSpeed. To remedy, I reduced the LR to 1e-4, and set max_grad_norm to 1.0, which I think is the default in HF Transformers as well.

I forget why I set that, so no issue changing it here.

minimaxir · 2021-03-28T23:06:40Z

Sorry about the late response: IRL has been busy!

Testing in the 1-GPU case in Colab, things look good, although unfortunately it goes OOM with your simulated 1.5B model (even w/ gradient checkpointing) despite the DeepSpeed literature asserting that the framework allows that (maybe that support is a ZeRO 3 thing).

Checking the multi-GPU case now. If that works I'm good with merging.

minimaxir · 2021-03-29T00:22:29Z

Maybe put on hold since multi-GPU is harder than expected (trying on a GCP AI Notebook w/ 4 T4 and it hangs at the spawner, which is likely unrelated to this PR but blocking testing)

minimaxir · 2021-03-29T01:44:03Z

However, I'll merge it since this PR seems to be pretty safe for non-DeepSpeed use cases. The GPU issue I mentioned above is likely unrelated (will file an issue to pytorch-lightning if I get a better test case).

Thanks again!

SeanNaren · 2021-03-29T13:47:27Z

Woop! no need to thank, you've done a lot for the community :)

Regarding the issue this is probably because we're using multiprocessing within a notebook... I think this opens up the discussion of a spawn based DeepSpeed plugin! I can make an issue and cross-reference this now

SeanNaren · 2021-03-29T15:10:52Z

Made the issue to track @minimaxir if you have the ability to test outside of a notebook via a terminal let me know if it works, or I can prioritize to get the spawn version of DeepSpeed together!

SeanNaren added 2 commits March 15, 2021 10:19

Use DeepSpeed Plugin default, update to latest PL for AMP fixes

00c166b

Update to 1.2.3, use max_grad_norm with fp16

967d33c

minimaxir merged commit fd2cfca into minimaxir:master Mar 29, 2021

SeanNaren deleted the feat/deepspeed branch March 29, 2021 13:46

SeanNaren mentioned this pull request Mar 29, 2021

Support Spawn for DeepSpeed Lightning-AI/pytorch-lightning#6721

Closed

minimaxir mentioned this pull request Apr 3, 2021

Multi-GPU training fails when using GCP Deep Learning image Lightning-AI/pytorch-lightning#6812

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial Lightning DeepSpeed Integration #103

Initial Lightning DeepSpeed Integration #103

SeanNaren commented Mar 15, 2021

SeanNaren commented Mar 15, 2021

minimaxir commented Mar 16, 2021

minimaxir commented Mar 28, 2021

minimaxir commented Mar 29, 2021

minimaxir commented Mar 29, 2021

SeanNaren commented Mar 29, 2021

SeanNaren commented Mar 29, 2021

Initial Lightning DeepSpeed Integration #103

Initial Lightning DeepSpeed Integration #103

Conversation

SeanNaren commented Mar 15, 2021

SeanNaren commented Mar 15, 2021

minimaxir commented Mar 16, 2021

minimaxir commented Mar 28, 2021

minimaxir commented Mar 29, 2021

minimaxir commented Mar 29, 2021

SeanNaren commented Mar 29, 2021

SeanNaren commented Mar 29, 2021