Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Number of tokens Olmo-1B was trained: 2T or 3T? #697

Closed
jiyeonkimd opened this issue Aug 9, 2024 · 1 comment
Closed

Number of tokens Olmo-1B was trained: 2T or 3T? #697

jiyeonkimd opened this issue Aug 9, 2024 · 1 comment
Labels
type/question An issue that's a question

Comments

@jiyeonkimd
Copy link

jiyeonkimd commented Aug 9, 2024

❓ The question

Hi team,
First, I would like to express my gratitude for the hard work involved in developing OLMo and making everything publicly available.
I have a question about the 1B model. The paper mentions that it was trained up to 2 trillion tokens, but the current GitHub indicates it has been trained up to 3 trillion tokens. Since all model checkpoints trained up to 3 trillion tokens are publicly available, can we consider the training beyond 2 trillion tokens as the second epoch? The Dolma dataset is described as having 2 trillion tokens in the paper, but on GitHub, there is only a data order file for epoch 1, which is confusing.

Additionally, I would like to know the exact step when the first epoch ends. The train.pt file for the 7B model explicitly indicates when it reaches the second epoch, but this is not specified for the 1B models. Could you clarify this?

Thank you in advance!

@jiyeonkimd jiyeonkimd added the type/question An issue that's a question label Aug 9, 2024
@2015aroras
Copy link
Collaborator

Hi,
The 2T in the paper is a typo. Thank you for pointing it out!

I'm not as familiar with the data side of things, but according to https://huggingface.co/datasets/allenai/dolma we used a 2T sample of dolma for 7B and a 3T version of dolma for the 1B. Thus there would be no second epoch for OLMo 1B.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/question An issue that's a question
Projects
None yet
Development

No branches or pull requests

2 participants