Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding OLMo #1827

Merged
merged 12 commits into from
Nov 13, 2024
Merged

Adding OLMo #1827

merged 12 commits into from
Nov 13, 2024

Conversation

aflah02
Copy link
Contributor

@aflah02 aflah02 commented Nov 7, 2024

PR to support OLMo. Heavily borrows from #927

Still WIP

Core code changes done. Working on training recipes

Copy link
Collaborator

@rasbt rasbt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is still WIP, but this looks good to me so far!

@Andrei-Aksionov
Copy link
Collaborator

Andrei-Aksionov commented Nov 7, 2024

Hello @aflah02

Thanks for working on it.
Your PR seems to be at the finish line.
The only part that you need to carry from that PR is changes in the tokenizer.py here

@aflah02
Copy link
Contributor Author

aflah02 commented Nov 7, 2024

Hello @aflah02

Thanks for working on it. Your PR seems to be at the finish line. The only part that you need to carry from that PR is changes in the tokenizer.py here

Makes sense! One follow-up question -- post training to export the lit checkpoint to a HF compatible OLMo class (https://huggingface.co/docs/transformers/en/model_doc/olmo#transformers.OlmoForCausalLM) don't we need custom export logic to assign this class to the exported model? As the reverse conversion (HF to lit checkpoint) piggybacks on the llama code but how will the other way around map the model to the right class?

@aflah02
Copy link
Contributor Author

aflah02 commented Nov 7, 2024

Ah nvm I just saw point 5 in this tutorial - https://github.com/Lightning-AI/litgpt/blob/main/tutorials/convert_lit_models.md#a-finetuning-and-conversion-tutorial

Is this the expected loading code post conversion -

import torch
from transformers import AutoModel

state_dict = torch.load('FINETUNED_AND_CONVERTED_OLMO_MODEL_PATH')
model = AutoModel.from_pretrained("allenai/OLMo-1b-hf", state_dict=state_dict)

If yes, then is there a way to convert it into a format which does not require this approach but rather just loads directly if you point to the custom model name on HF? or does that also just work?

@Andrei-Aksionov
Copy link
Collaborator

We have scripts that convert weights from HF --> LitGPT and LitGPT --> HF.
Since we already have those scripts for LlaMa based models, it should work for Olmo too.
(As we found out that Olmo is a special case of LlaMa models. At least Olmo-hf. The original is quite different.)

Conversion is basically a renaming of layer names and, if needed, splitting/combining attention layer weights. This is needed because LitGPT uses a combined QKV matrix, while some models have separate Q, K and V matrices.
Oh, and sometimes we need to split MLP weights, because some models might combine them.
Overall, it's not a sophisticated process 🙂.

You need to also port tests from the original PR (forgot to mention it, my bad). They check numerical equivalence after the conversion. If they pass - the conversion works without issues.


but rather just loads directly if you point to the custom model name on HF

If I understood you correctly, the first step is to convert from Lit to HF format, then use HF tools for uploading weights to their hub.
I'm not that familiar with HF transformers, but I believe they provide an option to upload weights through CLI tools.
After weights are uploaded, you can just provide repo_name/model_name and it should work.


Bottom line: after weights are converted, they can be utilized by HF transformers like any other weights on their platform.
Whether you want to just load them into transformers, or to upload them to the Hub, or to quantize with any tool that expects HF weights format - it should work.

If it doesn't - please open an issue 😉.

Copy link
Collaborator

@rasbt rasbt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good and clean!

README.md Outdated Show resolved Hide resolved
tutorials/download_model_weights.md Outdated Show resolved Hide resolved
@aflah02
Copy link
Contributor Author

aflah02 commented Nov 8, 2024

@Andrei-Aksionov @rasbt
When I copy over tests for checkpoint conversion I should use copy_weights_llama right? Since as noted here the old olmo conversion code is just a special case of copy_weights_llama. I am already doing that for the model testcase

@aflah02
Copy link
Contributor Author

aflah02 commented Nov 8, 2024

Also for now I've added both 7B and 7B Instruct into the tests but I guess since they're architecturally identical so we can remove one right?

@Andrei-Aksionov
Copy link
Collaborator

When I copy over tests for checkpoint conversion I should use copy_weights_llama right?

Correct.

Also for now I've added both 7B and 7B Instruct into the tests but I guess since they're architecturally identical so we can remove one right?

Also correct.
The difference should only in the training dataset and in the training process. The architecture should be the same.


Speaking of training.
I guess the next step is to add training recipes. Is it correct @rasbt?

@rasbt
Copy link
Collaborator

rasbt commented Nov 8, 2024

Speaking of training.
I guess the next step is to add training recipes. Is it correct @rasbt?

Historically we added the yaml config files but this would not be strictly necessary because it can take several hours to run and would be expensive for contributors. So please feel to skip. Someone else from the team can generate those.

But I suggest doing a quick test run with LoRA to make sure it seems to converge at least.

The other thing is I would also do a quick sanity check with litgpt generate and litgpt chat. Maybe with the prompt "What do Llamas eat?". You can post the responses here in the GitHub PR.

Thanks for all the work on this PR so far!

@aflah02
Copy link
Contributor Author

aflah02 commented Nov 8, 2024

Speaking of training.
I guess the next step is to add training recipes. Is it correct @rasbt?

Historically we added the yaml config files but this would not be strictly necessary because it can take several hours to run and would be expensive for contributors. So please feel to skip. Someone else from the team can generate those.

But I suggest doing a quick test run with LoRA to make sure it seems to converge at least.

I'm looking at pretraining OLMo based models so I'll be happy to work on the yaml config files as well. I'll try to run some over the weekend/early next week and share the results here

The other thing is I would also do a quick sanity check with litgpt generate and litgpt chat. Maybe with the prompt "What do Llamas eat?". You can post the responses here in the GitHub PR.

I did run these but pasted them in the Old PR instead. You can find them here - #927 (comment)

@aflah02
Copy link
Contributor Author

aflah02 commented Nov 8, 2024

I am thinking of training a TinyOLMo model same as TinyLLama but with a smaller dataset. I copied over the config and made minor changes.
Does this work and any recommendations for a smaller dataset? I was thinking of using the MicroLlama dataset

@aflah02
Copy link
Contributor Author

aflah02 commented Nov 8, 2024

Also should I continue this as a part of this PR? or should this be merged in first as the pretrain tests might take longer and I can do a few quick finetuning tests

@aflah02
Copy link
Contributor Author

aflah02 commented Nov 8, 2024

Also there seem to be quite a few other OLMo versions too. Should I add them in? See: https://huggingface.co/collections/allenai/olmo-suite-65aeaae8fe5b6b2122b46778

@aflah02
Copy link
Contributor Author

aflah02 commented Nov 8, 2024

One final thing I just realized. The names for the OLMo models seems very slightly different e.g. OLMo-7B-hf (On HF) v/s OLMo-7b-hf (in the PR), It seems HF is not case sensitive as both work but I guess keeping the same name as HF would be better? Just confirming as I copied the names from the old PR which have a different case from that on HF.

@rasbt
Copy link
Collaborator

rasbt commented Nov 8, 2024

I'm looking at pretraining OLMo based models so I'll be happy to work on the yaml config files as well. I'll try to run some over the weekend/early next week and share the results here

Thanks! I'd say test that the yaml files work, but no worries about waiting until the run finishes. The reason is the numbers won't be comparable to those in the existing table because they are based on different hardware.

I did run these but pasted them in the Old PR instead. You can find them here

They look good!

Also should I continue this as a part of this PR? or should this be merged in first as the pretrain tests might take longer and I can do a few quick finetuning tests

Oh, you don't need to do a complete pretraining run, just running it briefly to make sure that there is no bug or so.

Also there seem to be quite a few other OLMo versions too. Should I add them in? See: https://huggingface.co/collections/allenai/olmo-suite-65aeaae8fe5b6b2122b46778

Good suggestion, but this would add too much clutter in my opinion.

A user who is interested in those can already download them via

litgpt download allenai/OLMo-1B-0724-hf \
  --model_name OLMo-1B-hf

One final thing I just realized. The names for the OLMo models seems very slightly different e.g. OLMo-7B-hf (On HF) v/s OLMo-7b-hf (in the PR

Good catch. I am just seeing that we've been inconsistent with this in the config file. I would just use uppercase B as in the llama models.

Thanks for all your attention to detail here!

(Btw I will be out of office after today, so in this case, please feel free to merge when it is ready @Andrei-Aksionov )

@Andrei-Aksionov
Copy link
Collaborator

@rasbt I don't think that I have a permission to merge a PR.
Definitely not to override your change request.

@aflah02 PR looks good.
The remaining part is:

One final thing I just realized. The names for the OLMo models seems very slightly different e.g. OLMo-7B-hf (On HF) v/s OLMo-7b-hf (in the PR), It seems HF is not case sensitive as both work but I guess keeping the same name as HF would be better? Just confirming as I copied the names from the old PR which have a different case from that on HF.


Training recipes can be done in a separate PR (if you want to do it).

@aflah02
Copy link
Contributor Author

aflah02 commented Nov 12, 2024

Hi @Andrei-Aksionov
I'll make the change and also setup the pre-training run to start tomorrow. In the meanwhile, I did run a little finetuning experiments by copying over the config from another model - https://wandb.ai/aflah/finetune-OLMo-7b-hf/overview

@aflah02
Copy link
Contributor Author

aflah02 commented Nov 13, 2024

Hi @Andrei-Aksionov
I was trying to create a config which mimics the setting from the OLMo paper (See Table 5 Page 17 https://aclanthology.org/2024.acl-long.841.pdf) however some args seem to be missing in the litgpt arg file (https://github.com/Lightning-AI/litgpt/blob/main/litgpt/args.py).

I then looked at the tinyllama config for reference and I could find the initial learning rate under the optimizer header but I can't find any docs on what all is supported there like scheduler etc. and how can I provide the scheduler to use

@aflah02
Copy link
Contributor Author

aflah02 commented Nov 13, 2024

Also maybe a separate PR would be best for the pretraining since it would take some back and forth discussion so can you merge this one first?

@Andrei-Aksionov
Copy link
Collaborator

Hello @aflah02

I'll merge the PR.
Thanks for pushing it over the line! 🎉


I then looked at the tinyllama config for reference and I could find the initial learning rate under the optimizer header but I can't find any docs on what all is supported there like scheduler etc. and how can I provide the scheduler to use

LR scheduler is only one: linear followed by cosine

def get_lr_scheduler(optimizer, warmup_steps: int, max_steps: int):
# linear warmup followed by cosine annealing
scheduler1 = torch.optim.lr_scheduler.LambdaLR(optimizer, lambda step: step / warmup_steps)
scheduler2 = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=(max_steps - warmup_steps))
return torch.optim.lr_scheduler.SequentialLR(optimizer, [scheduler1, scheduler2], milestones=[warmup_steps])

I think it's fine if Olmo will use this scheduler.

@Andrei-Aksionov Andrei-Aksionov merged commit 22528bf into Lightning-AI:main Nov 13, 2024
9 checks passed
@aflah02
Copy link
Contributor Author

aflah02 commented Nov 13, 2024

Thanks! I didn't do much tbh since the code was pretty much all there already :)

For the scheduler what is the recommended route to use a custom scheduler? I want to use the plain linear scheduler for some futue OLMo runs as that is what the paper also used.

@Andrei-Aksionov
Copy link
Collaborator

Thanks! I didn't do much tbh since the code was pretty much all there already :)

Description of image

🤓


For the scheduler what is the recommended route to use a custom scheduler? I want to use the plain linear scheduler for some futue OLMo runs as that is what the paper also used.

I think there is no need for now to add the exact same scheduler as in the paper.
The codebase should be feature rich and up-to-date, but also an easy to read and not overly complicated (especially if there is no particular reason for it).

@Andrei-Aksionov Andrei-Aksionov mentioned this pull request Nov 13, 2024
14 tasks
@aflah02
Copy link
Contributor Author

aflah02 commented Nov 14, 2024

Description of image

🤓

Haha thanks!

For the scheduler what is the recommended route to use a custom scheduler? I want to use the plain linear scheduler for some futue OLMo runs as that is what the paper also used.

I think there is no need for now to add the exact same scheduler as in the paper. The codebase should be feature rich and up-to-date, but also an easy to read and not overly complicated (especially if there is no particular reason for it).

One reason I worked on integrating OLMo is that my colleagues and I are planning to run some pretraining experiments with the OLMo architecture. We want to follow the AI2 paper as closely as possible, including their choice of scheduler, to assess stability when using litgpt. Hence, we would strongly prefer to use the linear scheduler they used over a 2-phase scheduler.

@Andrei-Aksionov
Copy link
Collaborator

I see.
Then you need to create something like this, but for schedulers.
It should parse the config and instantiate the corresponding class.
But, in order to preserve the current functionality, it should be able to parse a list of schedulers.

@aflah02
Copy link
Contributor Author

aflah02 commented Nov 15, 2024

I see. Then you need to create something like this, but for schedulers. It should parse the config and instantiate the corresponding class. But, in order to preserve the current functionality, it should be able to parse a list of schedulers.

Thanks! I'll take a stab at this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants