Adding OLMo #1827

aflah02 · 2024-11-07T15:03:53Z

PR to support OLMo. Heavily borrows from #927

~~Still WIP~~

Core code changes done. Working on training recipes

rasbt

I know this is still WIP, but this looks good to me so far!

Andrei-Aksionov · 2024-11-07T17:56:34Z

Thanks for working on it.
Your PR seems to be at the finish line.
The only part that you need to carry from that PR is changes in the tokenizer.py here

aflah02 · 2024-11-07T18:31:25Z

Hello @aflah02

Thanks for working on it. Your PR seems to be at the finish line. The only part that you need to carry from that PR is changes in the tokenizer.py here

Makes sense! One follow-up question -- post training to export the lit checkpoint to a HF compatible OLMo class (https://huggingface.co/docs/transformers/en/model_doc/olmo#transformers.OlmoForCausalLM) don't we need custom export logic to assign this class to the exported model? As the reverse conversion (HF to lit checkpoint) piggybacks on the llama code but how will the other way around map the model to the right class?

aflah02 · 2024-11-07T18:36:26Z

Ah nvm I just saw point 5 in this tutorial - https://github.com/Lightning-AI/litgpt/blob/main/tutorials/convert_lit_models.md#a-finetuning-and-conversion-tutorial

Is this the expected loading code post conversion -

import torch
from transformers import AutoModel

state_dict = torch.load('FINETUNED_AND_CONVERTED_OLMO_MODEL_PATH')
model = AutoModel.from_pretrained("allenai/OLMo-1b-hf", state_dict=state_dict)

If yes, then is there a way to convert it into a format which does not require this approach but rather just loads directly if you point to the custom model name on HF? or does that also just work?

Andrei-Aksionov · 2024-11-07T19:34:39Z

We have scripts that convert weights from HF --> LitGPT and LitGPT --> HF.
Since we already have those scripts for LlaMa based models, it should work for Olmo too.
(As we found out that Olmo is a special case of LlaMa models. At least Olmo-hf. The original is quite different.)

Conversion is basically a renaming of layer names and, if needed, splitting/combining attention layer weights. This is needed because LitGPT uses a combined QKV matrix, while some models have separate Q, K and V matrices.
Oh, and sometimes we need to split MLP weights, because some models might combine them.
Overall, it's not a sophisticated process 🙂.

You need to also port tests from the original PR (forgot to mention it, my bad). They check numerical equivalence after the conversion. If they pass - the conversion works without issues.

but rather just loads directly if you point to the custom model name on HF

If I understood you correctly, the first step is to convert from Lit to HF format, then use HF tools for uploading weights to their hub.
I'm not that familiar with HF transformers, but I believe they provide an option to upload weights through CLI tools.
After weights are uploaded, you can just provide repo_name/model_name and it should work.

Bottom line: after weights are converted, they can be utilized by HF transformers like any other weights on their platform.
Whether you want to just load them into transformers, or to upload them to the Hub, or to quantize with any tool that expects HF weights format - it should work.

If it doesn't - please open an issue 😉.

rasbt

This looks really good and clean!

README.md

tutorials/download_model_weights.md

aflah02 · 2024-11-08T06:10:54Z

@Andrei-Aksionov @rasbt
When I copy over tests for checkpoint conversion I should use copy_weights_llama right? Since as noted here the old olmo conversion code is just a special case of copy_weights_llama. I am already doing that for the model testcase

aflah02 · 2024-11-08T06:33:50Z

Also for now I've added both 7B and 7B Instruct into the tests but I guess since they're architecturally identical so we can remove one right?

Andrei-Aksionov · 2024-11-08T10:25:15Z

When I copy over tests for checkpoint conversion I should use copy_weights_llama right?

Correct.

Also for now I've added both 7B and 7B Instruct into the tests but I guess since they're architecturally identical so we can remove one right?

Also correct.
The difference should only in the training dataset and in the training process. The architecture should be the same.

Speaking of training.
I guess the next step is to add training recipes. Is it correct @rasbt?

rasbt · 2024-11-08T13:31:13Z

Speaking of training.
I guess the next step is to add training recipes. Is it correct @rasbt?

Historically we added the yaml config files but this would not be strictly necessary because it can take several hours to run and would be expensive for contributors. So please feel to skip. Someone else from the team can generate those.

But I suggest doing a quick test run with LoRA to make sure it seems to converge at least.

The other thing is I would also do a quick sanity check with litgpt generate and litgpt chat. Maybe with the prompt "What do Llamas eat?". You can post the responses here in the GitHub PR.

Thanks for all the work on this PR so far!

aflah02 · 2024-11-08T17:01:13Z

Speaking of training.
I guess the next step is to add training recipes. Is it correct @rasbt?

Historically we added the yaml config files but this would not be strictly necessary because it can take several hours to run and would be expensive for contributors. So please feel to skip. Someone else from the team can generate those.

But I suggest doing a quick test run with LoRA to make sure it seems to converge at least.

I'm looking at pretraining OLMo based models so I'll be happy to work on the yaml config files as well. I'll try to run some over the weekend/early next week and share the results here

The other thing is I would also do a quick sanity check with litgpt generate and litgpt chat. Maybe with the prompt "What do Llamas eat?". You can post the responses here in the GitHub PR.

I did run these but pasted them in the Old PR instead. You can find them here - #927 (comment)

aflah02 · 2024-11-08T17:21:01Z

I am thinking of training a TinyOLMo model same as TinyLLama but with a smaller dataset. I copied over the config and made minor changes.
Does this work and any recommendations for a smaller dataset? I was thinking of using the MicroLlama dataset

aflah02 · 2024-11-08T17:36:19Z

Also should I continue this as a part of this PR? or should this be merged in first as the pretrain tests might take longer and I can do a few quick finetuning tests

aflah02 · 2024-11-08T17:43:52Z

Also there seem to be quite a few other OLMo versions too. Should I add them in? See: https://huggingface.co/collections/allenai/olmo-suite-65aeaae8fe5b6b2122b46778

aflah02 · 2024-11-08T17:48:43Z

One final thing I just realized. The names for the OLMo models seems very slightly different e.g. OLMo-7B-hf (On HF) v/s OLMo-7b-hf (in the PR), It seems HF is not case sensitive as both work but I guess keeping the same name as HF would be better? Just confirming as I copied the names from the old PR which have a different case from that on HF.

rasbt · 2024-11-08T19:47:03Z

I'm looking at pretraining OLMo based models so I'll be happy to work on the yaml config files as well. I'll try to run some over the weekend/early next week and share the results here

Thanks! I'd say test that the yaml files work, but no worries about waiting until the run finishes. The reason is the numbers won't be comparable to those in the existing table because they are based on different hardware.

I did run these but pasted them in the Old PR instead. You can find them here

They look good!

Also should I continue this as a part of this PR? or should this be merged in first as the pretrain tests might take longer and I can do a few quick finetuning tests

Oh, you don't need to do a complete pretraining run, just running it briefly to make sure that there is no bug or so.

Also there seem to be quite a few other OLMo versions too. Should I add them in? See: https://huggingface.co/collections/allenai/olmo-suite-65aeaae8fe5b6b2122b46778

Good suggestion, but this would add too much clutter in my opinion.

A user who is interested in those can already download them via

litgpt download allenai/OLMo-1B-0724-hf \
  --model_name OLMo-1B-hf

One final thing I just realized. The names for the OLMo models seems very slightly different e.g. OLMo-7B-hf (On HF) v/s OLMo-7b-hf (in the PR

Good catch. I am just seeing that we've been inconsistent with this in the config file. I would just use uppercase B as in the llama models.

Thanks for all your attention to detail here!

(Btw I will be out of office after today, so in this case, please feel free to merge when it is ready @Andrei-Aksionov )

Andrei-Aksionov · 2024-11-11T08:58:25Z

@rasbt I don't think that I have a permission to merge a PR.
Definitely not to override your change request.

@aflah02 PR looks good.
The remaining part is:

One final thing I just realized. The names for the OLMo models seems very slightly different e.g. OLMo-7B-hf (On HF) v/s OLMo-7b-hf (in the PR), It seems HF is not case sensitive as both work but I guess keeping the same name as HF would be better? Just confirming as I copied the names from the old PR which have a different case from that on HF.

Training recipes can be done in a separate PR (if you want to do it).

aflah02 · 2024-11-12T06:17:51Z

Hi @Andrei-Aksionov
I'll make the change and also setup the pre-training run to start tomorrow. In the meanwhile, I did run a little finetuning experiments by copying over the config from another model - https://wandb.ai/aflah/finetune-OLMo-7b-hf/overview

aflah02 · 2024-11-13T13:31:26Z

Hi @Andrei-Aksionov
I was trying to create a config which mimics the setting from the OLMo paper (See Table 5 Page 17 https://aclanthology.org/2024.acl-long.841.pdf) however some args seem to be missing in the litgpt arg file (https://github.com/Lightning-AI/litgpt/blob/main/litgpt/args.py).

I then looked at the tinyllama config for reference and I could find the initial learning rate under the optimizer header but I can't find any docs on what all is supported there like scheduler etc. and how can I provide the scheduler to use

aflah02 · 2024-11-13T13:45:36Z

Also maybe a separate PR would be best for the pretraining since it would take some back and forth discussion so can you merge this one first?

Andrei-Aksionov · 2024-11-13T14:11:15Z

Hello @aflah02

I'll merge the PR.
Thanks for pushing it over the line! 🎉

I then looked at the tinyllama config for reference and I could find the initial learning rate under the optimizer header but I can't find any docs on what all is supported there like scheduler etc. and how can I provide the scheduler to use

LR scheduler is only one: linear followed by cosine

litgpt/litgpt/finetune/full.py

Lines 383 to 387 in 33eab00

    
           def get_lr_scheduler(optimizer, warmup_steps: int, max_steps: int): 
        
               # linear warmup followed by cosine annealing 
        
               scheduler1 = torch.optim.lr_scheduler.LambdaLR(optimizer, lambda step: step / warmup_steps) 
        
               scheduler2 = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=(max_steps - warmup_steps)) 
        
               return torch.optim.lr_scheduler.SequentialLR(optimizer, [scheduler1, scheduler2], milestones=[warmup_steps])

I think it's fine if Olmo will use this scheduler.

aflah02 · 2024-11-13T14:14:30Z

Thanks! I didn't do much tbh since the code was pretty much all there already :)

For the scheduler what is the recommended route to use a custom scheduler? I want to use the plain linear scheduler for some futue OLMo runs as that is what the paper also used.

Andrei-Aksionov · 2024-11-13T14:26:37Z

Thanks! I didn't do much tbh since the code was pretty much all there already :)

🤓

For the scheduler what is the recommended route to use a custom scheduler? I want to use the plain linear scheduler for some futue OLMo runs as that is what the paper also used.

I think there is no need for now to add the exact same scheduler as in the paper.
The codebase should be feature rich and up-to-date, but also an easy to read and not overly complicated (especially if there is no particular reason for it).

aflah02 · 2024-11-14T06:19:21Z

🤓

Haha thanks!

For the scheduler what is the recommended route to use a custom scheduler? I want to use the plain linear scheduler for some futue OLMo runs as that is what the paper also used.

I think there is no need for now to add the exact same scheduler as in the paper. The codebase should be feature rich and up-to-date, but also an easy to read and not overly complicated (especially if there is no particular reason for it).

One reason I worked on integrating OLMo is that my colleagues and I are planning to run some pretraining experiments with the OLMo architecture. We want to follow the AI2 paper as closely as possible, including their choice of scheduler, to assess stability when using litgpt. Hence, we would strongly prefer to use the linear scheduler they used over a 2-phase scheduler.

Andrei-Aksionov · 2024-11-14T10:24:37Z

I see.
Then you need to create something like this, but for schedulers.
It should parse the config and instantiate the corresponding class.
But, in order to preserve the current functionality, it should be able to parse a list of schedulers.

aflah02 · 2024-11-15T06:33:58Z

I see. Then you need to create something like this, but for schedulers. It should parse the config and instantiate the corresponding class. But, in order to preserve the current functionality, it should be able to parse a list of schedulers.

Thanks! I'll take a stab at this

aflah02 added 2 commits November 7, 2024 15:13

Added OLMo Config

797ba99

Prompt Changes

e8f1dce

aflah02 requested review from rasbt and lantiga as code owners November 7, 2024 15:03

rasbt reviewed Nov 7, 2024

View reviewed changes

aflah02 added 2 commits November 7, 2024 19:42

Copied Tokenizer Changes

1a1b748

Added changes to README and Tutorial

b04faf3

aflah02 requested a review from williamFalcon as a code owner November 7, 2024 19:01

rasbt requested changes Nov 7, 2024

View reviewed changes

README.md Outdated Show resolved Hide resolved

tutorials/download_model_weights.md Outdated Show resolved Hide resolved

Added Model Test and Changed OLMo Source

5b80c91

aflah02 added 4 commits November 8, 2024 07:12

Use Llama Weight Copy

e2b7910

Fix pytest decorator

5d64194

Added lit_checkpoint conversion test

0a50bef

Added 7B OLMO Instruct to checkpoint conversion testing

7084a52

Removed duplicated Arch

9dce5e9

Merge branch 'main' into main

3e2c9f5

Replaced b with B in all configs for consistency

4bb739a

Andrei-Aksionov merged commit 22528bf into Lightning-AI:main Nov 13, 2024
9 checks passed

Andrei-Aksionov mentioned this pull request Nov 13, 2024

Add OLMo: 1B & 7B #927

Closed

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding OLMo #1827

Adding OLMo #1827

aflah02 commented Nov 7, 2024 •

edited

Loading

rasbt left a comment

Andrei-Aksionov commented Nov 7, 2024 •

edited

Loading

aflah02 commented Nov 7, 2024

aflah02 commented Nov 7, 2024 •

edited

Loading

Andrei-Aksionov commented Nov 7, 2024

rasbt left a comment

aflah02 commented Nov 8, 2024 •

edited

Loading

aflah02 commented Nov 8, 2024

Andrei-Aksionov commented Nov 8, 2024

rasbt commented Nov 8, 2024

aflah02 commented Nov 8, 2024

aflah02 commented Nov 8, 2024 •

edited

Loading

aflah02 commented Nov 8, 2024

aflah02 commented Nov 8, 2024

aflah02 commented Nov 8, 2024

rasbt commented Nov 8, 2024

Andrei-Aksionov commented Nov 11, 2024

aflah02 commented Nov 12, 2024

aflah02 commented Nov 13, 2024

aflah02 commented Nov 13, 2024

Andrei-Aksionov commented Nov 13, 2024

aflah02 commented Nov 13, 2024

Andrei-Aksionov commented Nov 13, 2024

aflah02 commented Nov 14, 2024 •

edited

Loading

Andrei-Aksionov commented Nov 14, 2024

aflah02 commented Nov 15, 2024

Adding OLMo #1827

Adding OLMo #1827

Conversation

aflah02 commented Nov 7, 2024 • edited Loading

rasbt left a comment

Choose a reason for hiding this comment

Andrei-Aksionov commented Nov 7, 2024 • edited Loading

aflah02 commented Nov 7, 2024

aflah02 commented Nov 7, 2024 • edited Loading

Andrei-Aksionov commented Nov 7, 2024

rasbt left a comment

Choose a reason for hiding this comment

aflah02 commented Nov 8, 2024 • edited Loading

aflah02 commented Nov 8, 2024

Andrei-Aksionov commented Nov 8, 2024

rasbt commented Nov 8, 2024

aflah02 commented Nov 8, 2024

aflah02 commented Nov 8, 2024 • edited Loading

aflah02 commented Nov 8, 2024

aflah02 commented Nov 8, 2024

aflah02 commented Nov 8, 2024

rasbt commented Nov 8, 2024

Andrei-Aksionov commented Nov 11, 2024

aflah02 commented Nov 12, 2024

aflah02 commented Nov 13, 2024

aflah02 commented Nov 13, 2024

Andrei-Aksionov commented Nov 13, 2024

aflah02 commented Nov 13, 2024

Andrei-Aksionov commented Nov 13, 2024

aflah02 commented Nov 14, 2024 • edited Loading

Andrei-Aksionov commented Nov 14, 2024

aflah02 commented Nov 15, 2024

aflah02 commented Nov 7, 2024 •

edited

Loading

Andrei-Aksionov commented Nov 7, 2024 •

edited

Loading

aflah02 commented Nov 7, 2024 •

edited

Loading

aflah02 commented Nov 8, 2024 •

edited

Loading

aflah02 commented Nov 8, 2024 •

edited

Loading

aflah02 commented Nov 14, 2024 •

edited

Loading