-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding OLMo #1827
Adding OLMo #1827
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is still WIP, but this looks good to me so far!
Makes sense! One follow-up question -- post training to export the lit checkpoint to a HF compatible OLMo class (https://huggingface.co/docs/transformers/en/model_doc/olmo#transformers.OlmoForCausalLM) don't we need custom export logic to assign this class to the exported model? As the reverse conversion (HF to lit checkpoint) piggybacks on the llama code but how will the other way around map the model to the right class? |
Ah nvm I just saw point 5 in this tutorial - https://github.com/Lightning-AI/litgpt/blob/main/tutorials/convert_lit_models.md#a-finetuning-and-conversion-tutorial Is this the expected loading code post conversion -
If yes, then is there a way to convert it into a format which does not require this approach but rather just loads directly if you point to the custom model name on HF? or does that also just work? |
We have scripts that convert weights from HF --> LitGPT and LitGPT --> HF. Conversion is basically a renaming of layer names and, if needed, splitting/combining attention layer weights. This is needed because LitGPT uses a combined QKV matrix, while some models have separate Q, K and V matrices. You need to also port tests from the original PR (forgot to mention it, my bad). They check numerical equivalence after the conversion. If they pass - the conversion works without issues.
If I understood you correctly, the first step is to convert from Lit to HF format, then use HF tools for uploading weights to their hub. Bottom line: after weights are converted, they can be utilized by HF transformers like any other weights on their platform. If it doesn't - please open an issue 😉. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really good and clean!
@Andrei-Aksionov @rasbt |
Also for now I've added both 7B and 7B Instruct into the tests but I guess since they're architecturally identical so we can remove one right? |
Correct.
Also correct. Speaking of training. |
Historically we added the yaml config files but this would not be strictly necessary because it can take several hours to run and would be expensive for contributors. So please feel to skip. Someone else from the team can generate those. But I suggest doing a quick test run with LoRA to make sure it seems to converge at least. The other thing is I would also do a quick sanity check with Thanks for all the work on this PR so far! |
I'm looking at pretraining OLMo based models so I'll be happy to work on the yaml config files as well. I'll try to run some over the weekend/early next week and share the results here
I did run these but pasted them in the Old PR instead. You can find them here - #927 (comment) |
I am thinking of training a TinyOLMo model same as TinyLLama but with a smaller dataset. I copied over the config and made minor changes. |
Also should I continue this as a part of this PR? or should this be merged in first as the pretrain tests might take longer and I can do a few quick finetuning tests |
Also there seem to be quite a few other OLMo versions too. Should I add them in? See: https://huggingface.co/collections/allenai/olmo-suite-65aeaae8fe5b6b2122b46778 |
One final thing I just realized. The names for the OLMo models seems very slightly different e.g. OLMo-7B-hf (On HF) v/s OLMo-7b-hf (in the PR), It seems HF is not case sensitive as both work but I guess keeping the same name as HF would be better? Just confirming as I copied the names from the old PR which have a different case from that on HF. |
Thanks! I'd say test that the yaml files work, but no worries about waiting until the run finishes. The reason is the numbers won't be comparable to those in the existing table because they are based on different hardware.
They look good!
Oh, you don't need to do a complete pretraining run, just running it briefly to make sure that there is no bug or so.
Good suggestion, but this would add too much clutter in my opinion. A user who is interested in those can already download them via
Good catch. I am just seeing that we've been inconsistent with this in the config file. I would just use uppercase B as in the llama models. Thanks for all your attention to detail here! (Btw I will be out of office after today, so in this case, please feel free to merge when it is ready @Andrei-Aksionov ) |
@rasbt I don't think that I have a permission to merge a PR. @aflah02 PR looks good.
Training recipes can be done in a separate PR (if you want to do it). |
Hi @Andrei-Aksionov |
Hi @Andrei-Aksionov I then looked at the tinyllama config for reference and I could find the initial learning rate under the optimizer header but I can't find any docs on what all is supported there like scheduler etc. and how can I provide the scheduler to use |
Also maybe a separate PR would be best for the pretraining since it would take some back and forth discussion so can you merge this one first? |
Hello @aflah02 I'll merge the PR.
LR scheduler is only one: linear followed by cosine litgpt/litgpt/finetune/full.py Lines 383 to 387 in 33eab00
I think it's fine if Olmo will use this scheduler. |
Thanks! I didn't do much tbh since the code was pretty much all there already :) For the scheduler what is the recommended route to use a custom scheduler? I want to use the plain linear scheduler for some futue OLMo runs as that is what the paper also used. |
I see. |
Thanks! I'll take a stab at this |
PR to support OLMo. Heavily borrows from #927
Still WIPCore code changes done. Working on training recipes