Initial Loss increased from 10 (0.3.0 v) to 60 (0.4.0) ! #678

Xuekai-Zhu · 2024-07-29T14:19:30Z

🐛 Describe the bug

There is a significant discrepancy in the initial loss values between different versions of olmo and the presence or absence of the step-738020 checkpoint. This suggests potential issues with the model initialization or checkpoint handling in version 0.4.0. I believe the following results can be reproduced, since this bug has costed me for a week.

Task:

Training from scratch / fine-tuning on BIoMed

Results

olmo v0.4.0 : w/ step-738020 ckpt -- intial loss is 71
olmo v0.4.0 : w/o step-738020 ckpt -- intial loss is 32
olmo v0.3.0 : w/ step-738020 ckpt -- intial loss is 2
olmo v0.3.0 : w/o step-738020 ckpt -- intial loss is 11

Versions

Build from source

olmo v0.4.0
olmo v0.3.0

Xuekai-Zhu · 2024-07-29T14:35:34Z

Not only in BIoMed data. The same results in your provided data.

https://olmo-data.org/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-000-00000.npy

AkshitaB · 2024-07-29T16:18:16Z

@Xuekai-Zhu Can you say more on what you mean by the "presence or absence" of that checkpoint? And can you share the code you're using for loading?

Xuekai-Zhu · 2024-07-30T02:57:33Z

I use the following command to run OLMo without modifying the source code.
Therefore, the default code for loading OLMo ckpt v0.3.0 and v0.4.0 is used.

torchrun --nproc_per_node=4 --master_port=29216 OLMo/scripts/train.py config/bio/OLMo-1B.yaml \
    --save_overwrite \
    --reset_trainer_state \
    --load_path=https://olmo-checkpoints.org/ai2-llm/olmo-small/g4g72enr/step738020-unsharded/

Xuekai-Zhu · 2024-07-30T03:01:44Z

@Xuekai-Zhu Can you say more on what you mean by the "presence or absence" of that checkpoint? And can you share the code you're using for loading?

you can directly refer to the above image.

Using or not using a pretrained checkpoint, as well as the version of OLMo, can result in different initial loss values.

Xuekai-Zhu · 2024-07-30T03:11:20Z

Loosely speaking, v0.3.0 produces correct loss values, but the loss values in v0.4.0 are incorrect.
using the pretrained checkpoint results in even higher loss values, which is clearly an error. There seems to be an issue with the loss calculation in the v0.4.0 code.
Due to the significant changes in this version, it's difficult for me to compare the two. Could you please take a look?

2015aroras · 2024-07-30T05:03:33Z

Since you are building from source, it's possible that you were affected by the bug that was fixed in #680. Could you pull the commit and see if that fixes your issue?

2015aroras · 2024-07-30T18:57:01Z

I am seeing your issue locally now and it is not fixed by #680. I am investigating

Xuekai-Zhu · 2024-07-31T08:52:34Z

Thank you very much!
I think this might be a rather urgent bug since it leads to training errors.
Reverting to the 0.3.0 version work for me now.

2015aroras · 2024-08-06T16:59:50Z

Upon further investigation, instances of bad loss we observed outside of #680 were due to bad setup (bad container or incorrect config).

In particular, I ran from a checkpoint while passing --force_save_unsharded --dry_run in order to get the model loaded into code and saved, without any training. Then I ran scripts/compare_model_state.py with the original checkpoint and the new checkpoint and saw that they were different. This suggested that something was corrupting model state before training even started. When doing the above in a healthy container, I saw no difference between the 2 checkpoints.

If you find out what's causing the issue for you in 0.4.0, please let us know. We will also update here if we run into the issue again.

ah-nf · 2024-09-24T17:49:28Z

Hello, wanted to add on to this since I am seeing significantly lower loss values when running the same training workload on 0.4.0 compared to 0.3.0. These tests were performed in the same environment so don't think that's a factor.

I looked at the commit history and saw that this change was made to the loss calculation after 0.3.0, do we think this could have anything to do with this behavior?

7146473

Xuekai-Zhu added the type/bug An issue about a bug label Jul 29, 2024

Xuekai-Zhu changed the title ~~Intitial Loss incread from 10 (0.3.0 v) to 60 (0.4.0) !~~ Initial Loss increased from 10 (0.3.0 v) to 60 (0.4.0) ! Jul 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial Loss increased from 10 (0.3.0 v) to 60 (0.4.0) ! #678

Initial Loss increased from 10 (0.3.0 v) to 60 (0.4.0) ! #678

Xuekai-Zhu commented Jul 29, 2024 •

edited

Loading

Xuekai-Zhu commented Jul 29, 2024 •

edited

Loading

AkshitaB commented Jul 29, 2024

Xuekai-Zhu commented Jul 30, 2024

Xuekai-Zhu commented Jul 30, 2024

Xuekai-Zhu commented Jul 30, 2024

2015aroras commented Jul 30, 2024

2015aroras commented Jul 30, 2024

Xuekai-Zhu commented Jul 31, 2024

2015aroras commented Aug 6, 2024 •

edited

Loading

ah-nf commented Sep 24, 2024 •

edited

Loading

Initial Loss increased from 10 (0.3.0 v) to 60 (0.4.0) ! #678

Initial Loss increased from 10 (0.3.0 v) to 60 (0.4.0) ! #678

Comments

Xuekai-Zhu commented Jul 29, 2024 • edited Loading

🐛 Describe the bug

Versions

Xuekai-Zhu commented Jul 29, 2024 • edited Loading

AkshitaB commented Jul 29, 2024

Xuekai-Zhu commented Jul 30, 2024

Xuekai-Zhu commented Jul 30, 2024

Xuekai-Zhu commented Jul 30, 2024

2015aroras commented Jul 30, 2024

2015aroras commented Jul 30, 2024

Xuekai-Zhu commented Jul 31, 2024

2015aroras commented Aug 6, 2024 • edited Loading

ah-nf commented Sep 24, 2024 • edited Loading

Xuekai-Zhu commented Jul 29, 2024 •

edited

Loading

Xuekai-Zhu commented Jul 29, 2024 •

edited

Loading

2015aroras commented Aug 6, 2024 •

edited

Loading

ah-nf commented Sep 24, 2024 •

edited

Loading