-
Notifications
You must be signed in to change notification settings - Fork 529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LLaMA PRO training resume problem #1231
Comments
We haven't thoroughly tested with layer freezing + FSDP, so possible something complicated is going on here. However, we have seen this error when you try to load a checkpoint that doesn't have optimizer state. So it is possible that loading a checkpoint only containing optimizer state for some of the parameters does not work properly. |
I'm facing a similar issue with the latest release (0.8.0). When resuming from a monolithic checkpoint with
With prior version of llm-foundry I didn't have this issue (albeit I was using
Are you experiencing a similar issue? Or do you have any hints? |
I also have |
@Riccorl, it seems like your problem is separate from @germanjke's. can you file a new issue with some more information like:
|
@germanjke can you try using freezing layers as part of the Optimizer, rather the Composer layer freezing algorithm. Freezing via the optimizer is more well-tested. For example,
|
@Riccorl I have identified this as a PyTorch issue and opened a bug report on their end + a PR to fix it |
Hello,
I'm currently training LLaMA PRO. Initially, I expanded the model from 32 layers to 40 layers and proceeded to train only the newly added 8 layers (every fifth layer). Therefore, I froze 32 out of the 40 layers.
The training is going well and only the layers I need are trained.
But after following a hardware failure, I attempted to resume training using
load_path
, but I encountered an error:My
ep0-ba4500/.metadata
looks like this:Have you experienced similar issues?
The text was updated successfully, but these errors were encountered: