Add `convert_module` to FSDP #20323

tshu-w · 2024-10-06T07:48:20Z

What does this PR do?

Add convert_module for FSDP as DeepSpeed.

Fixes #19721 (comment)

Before submitting

Was this discussed/agreed via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Reviewer checklist

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

📚 Documentation preview 📚: https://pytorch-lightning--20323.org.readthedocs.build/en/20323/

codecov · 2024-10-06T08:09:21Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89%. Comparing base (5dea36c) to head (088b972).
Report is 11 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff            @@
##           master   #20323    +/-   ##
========================================
+ Coverage      88%      89%    +1%     
========================================
  Files         267      267            
  Lines       23065    23076    +11     
========================================
+ Hits        20277    20585   +308     
+ Misses       2788     2491   -297

lantiga · 2024-10-07T11:37:42Z

Thank you @tshu-w!
Looks good in general, FSDP relies on contexts, but this may not cleanly apply when recomputations are involved.

As a sanity check, can you verify that the issues in #19721 are resolved? (i.e. memory goes back to what PyTorch uses, and no inconsistency errors are produced - these may be good tests to add btw, or at least a scaled-down version thereof).

I'll be happy to run things on my end and dig deeper in parallel.

tshu-w · 2024-10-07T12:49:39Z

I indeed noticed a decrease in VRAM usage (which I will confirm again in the coming week), even when I initialize the LLM in def configure_model as follows. However, I cannot guarantee that this PR resolves the original issue, as the author has manually set the LLM torch_dtype to torch.bfloat16. Nevertheless, I believe this might be able to solve part of the problem.

def configure_model(self):
    if self.model is not None:
        return

    self.model = AutoModelForCausalLM.from_pretrained(self.model_name_or_path)
    # suppress the open-end generation warning
    self.model.generation_config.pad_token_id = (
        self.model.generation_config.pad_token_id
        or self.model.generation_config.eos_token_id
    )

    if self.hparams.peft_config:
        peft_config = get_peft_config(self.hparams.peft_config)
        self.model = get_peft_model(self.model, peft_config)

    if self.tokenizer.chat_template is None:
        self.tokenizer.chat_template = (
            self.chatml_template
            if self.hparams.use_chatml_template
            else self.base_template
        )
        if self.hparams.use_chatml_template:
            self.tokenizer.add_tokens(
                ["<|im_start|>", "<|im_end|>"], special_tokens=True
            )
            self.model.resize_token_embeddings(len(self.tokenizer))

    if self.hparams.ckpt_path:
        checkpoint = torch.load(self.hparams.ckpt_path, weights_only=True)
        self.load_state_dict(checkpoint["state_dict"])

lantiga · 2024-11-12T22:31:53Z

hey @tshu-w did you end up digging further?

tshu-w requested review from lantiga, Borda, tchaton and justusschock as code owners October 6, 2024 07:48

github-actions bot added fabric lightning.fabric.Fabric pl Generic label for PyTorch Lightning package labels Oct 6, 2024

tshu-w force-pushed the FSDP branch 2 times, most recently from 7a0c355 to baeb535 Compare October 7, 2024 08:31

tshu-w added 2 commits October 7, 2024 16:32

Add convert_module to FSDP

ab552e1

Update ChangeLog

088b972

tshu-w force-pushed the FSDP branch from baeb535 to 088b972 Compare October 7, 2024 08:32

lantiga added the waiting on author Waiting on user action, correction, or update label Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `convert_module` to FSDP #20323

Add `convert_module` to FSDP #20323

tshu-w commented Oct 6, 2024 •

edited

Loading

codecov bot commented Oct 6, 2024 •

edited

Loading

lantiga commented Oct 7, 2024

tshu-w commented Oct 7, 2024

lantiga commented Nov 12, 2024

Add convert_module to FSDP #20323

Are you sure you want to change the base?

Add convert_module to FSDP #20323

Conversation

tshu-w commented Oct 6, 2024 • edited Loading

What does this PR do?

PR review

codecov bot commented Oct 6, 2024 • edited Loading

Codecov Report

lantiga commented Oct 7, 2024

tshu-w commented Oct 7, 2024

lantiga commented Nov 12, 2024

Add `convert_module` to FSDP #20323

Add `convert_module` to FSDP #20323

tshu-w commented Oct 6, 2024 •

edited

Loading

codecov bot commented Oct 6, 2024 •

edited

Loading