Question about pretraining. #7

TeddLi · 2024-05-02T21:20:28Z

I attached my training loss below， the data we are using refers to LLM360's paper, we use less data starcode.
For each training epoch our data contains 30B arxiv , Book 57B, C4 197.67B, Refined-Web 665.01, StarCoder 150B, StackExchange 21.75B, Wikipedia 23.90B.
And the hyperparameter we are using the same as LLM360 demonstrated. And the max_seq_len is 4096 instead of 2048, tokenizer is gpt tokenzier.
We are using an opensource repo to run the experiment on H100 Node with 2048 global bsize.
Currently our model can only achieve around 10.5 PPL on the falcon dataset. which is much worse than LLM360 amber model (Around 8 PPL) and llama-2 (Around 8 PPL).
Just wondering what would be the possible reason that our model perform much worse?

hunterhector · 2024-05-02T21:24:56Z

you data preprocessing steps could be relevant here as well. How do you preprocess or mix your data?

TeddLi · 2024-05-02T21:36:47Z

you data preprocessing steps could be relevant here as well. How do you preprocess or mix your data?

I am using the default mixing from https://github.com/jzhang38/TinyLlama.

TeddLi · 2024-05-02T21:38:38Z

Just wondering if the sharky loss will affect the final performance?

hunterhector · 2024-05-02T21:41:05Z

If the loss curve isn't stable it might suggest something happened during training, yes it is likely to indicate a suboptimal outcome.

TeddLi · 2024-05-02T21:42:38Z

If the loss curve isn't stable it might suggest something happened during training, yes it is likely to indicate a suboptimal outcome.

Just wondering how tokenizer affect the final results, gpt tokenizer has a much larger vocabulary

hunterhector · 2024-05-02T21:43:21Z

What's the hyperparameters? Did you use the ones from TinyLlama? Such as learning rate, etc. I am not very familiar with tinyllama ~~but it seems they don't have cool down~~. Nevermind i missread their documentation

TeddLi · 2024-05-02T21:44:30Z

Nope, I am using the one demonstrated in LLM360

TeddLi · 2024-05-02T21:45:21Z

Also if max_seq_len change will affect final results?

hunterhector · 2024-05-02T21:46:08Z

Also if max_seq_len change will affect final results?

i don't think so. this shouldn't affect too much unless there is a bug related to it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about pretraining. #7

Question about pretraining. #7

TeddLi commented May 2, 2024 •

edited

Loading

hunterhector commented May 2, 2024

TeddLi commented May 2, 2024

TeddLi commented May 2, 2024

hunterhector commented May 2, 2024

TeddLi commented May 2, 2024 •

edited

Loading

hunterhector commented May 2, 2024 •

edited

Loading

TeddLi commented May 2, 2024

TeddLi commented May 2, 2024

hunterhector commented May 2, 2024

Question about pretraining. #7

Question about pretraining. #7

Comments

TeddLi commented May 2, 2024 • edited Loading

hunterhector commented May 2, 2024

TeddLi commented May 2, 2024

TeddLi commented May 2, 2024

hunterhector commented May 2, 2024

TeddLi commented May 2, 2024 • edited Loading

hunterhector commented May 2, 2024 • edited Loading

TeddLi commented May 2, 2024

TeddLi commented May 2, 2024

hunterhector commented May 2, 2024

TeddLi commented May 2, 2024 •

edited

Loading

TeddLi commented May 2, 2024 •

edited

Loading

hunterhector commented May 2, 2024 •

edited

Loading