Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retrieving the Trained Model #1094

Open
dheerj188 opened this issue Apr 29, 2024 · 6 comments
Open

Retrieving the Trained Model #1094

dheerj188 opened this issue Apr 29, 2024 · 6 comments

Comments

@dheerj188
Copy link

How can we get back our trained model once we train using the pipe object and Gpipe Scheduler as a normal nn.Module class?

@Xynonners
Copy link

Also interested in this. Did you ever figure it out?

@dheerj188
Copy link
Author

Not as of now.

@kwen2501
Copy link
Contributor

kwen2501 commented Jun 10, 2024

Sorry for replying late.
We have migrated the PiPPy library to torch.distributed.pipelining
Here is our new documentation: https://pytorch.org/docs/main/distributed.pipelining.html.

In section "Option 2", you can see:

The Pipe object provides a method for retrieving the “model partitions”:
stage_mod : nn.Module = pipe.get_stage_module(stage_idx)

The return object is a nn.Module, so you can save it as you would with a regular module, such as:

torch.save(stage_mod, filepath)

or

torch.save(stage_mod.state_dict, filepath)

(Reference: https://pytorch.org/tutorials/beginner/saving_loading_models.html)

@Xynonners
Copy link

Xynonners commented Jun 11, 2024

Sorry for replying late. We have migrated the PiPPy library to torch.distributed.pipelining Here is our new documentation: https://pytorch.org/docs/main/distributed.pipelining.html.

In section "Option 2", you can see:

The Pipe object provides a method for retrieving the “model partitions”:
stage_mod : nn.Module = pipe.get_stage_module(stage_idx)

The return object is a nn.Module, so you can save it as you would with a regular module, such as:

torch.save(stage_mod, filepath)

or

torch.save(stage_mod.state_dict, filepath)

(Reference: https://pytorch.org/tutorials/beginner/saving_loading_models.html)

I think the question (at least for me) was if we could turn the model back into the non-pipelined version for modification and saving?

@kwen2501
Copy link
Contributor

Hmm, do you mean getting back the full model at the end of training, but before saving the final checkpoint?
It might be hard, I think, because each stage's updated weights are now on different ranks.
So unless we do an all-gather, the weight in the pipe object would only has part of it being up-to-date.

That said, imagine we would do a torch.load later, that would be a good time for gluing the model back together, because:
(i) we have the full, original model; and
(ii) PP does not change the FQN of the weights.

It is only a matter of loading from a single checkpoint file vs multiple checkpoint files.
As far as I know, HF already uses multiple checkpoint files for large models.

@dheerj188
Copy link
Author

OK, so here is what I want to do, Obtain gradients of each layer from each rank of the stage from the pipe object, and send it to the CPUs. Get some modifications done on the gradients on the CPU, then bring it back to the subsequent ranks of the pipeline stage and update the model with modified gradients. Is this possible with Pippy?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants