Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial Loss increased from 10 (0.3.0 v) to 60 (0.4.0) ! #678

Open
Xuekai-Zhu opened this issue Jul 29, 2024 · 10 comments
Open

Initial Loss increased from 10 (0.3.0 v) to 60 (0.4.0) ! #678

Xuekai-Zhu opened this issue Jul 29, 2024 · 10 comments
Labels
type/bug An issue about a bug

Comments

@Xuekai-Zhu
Copy link

Xuekai-Zhu commented Jul 29, 2024

🐛 Describe the bug

There is a significant discrepancy in the initial loss values between different versions of olmo and the presence or absence of the step-738020 checkpoint. This suggests potential issues with the model initialization or checkpoint handling in version 0.4.0. I believe the following results can be reproduced, since this bug has costed me for a week.

Task:

  • Training from scratch / fine-tuning on BIoMed

Results

  • olmo v0.4.0 : w/ step-738020 ckpt -- intial loss is 71

  • olmo v0.4.0 : w/o step-738020 ckpt -- intial loss is 32

  • olmo v0.3.0 : w/ step-738020 ckpt -- intial loss is 2

  • olmo v0.3.0 : w/o step-738020 ckpt -- intial loss is 11

W B Chart 2024_7_29 22_12_01

Versions

Build from source

  • olmo v0.4.0
  • olmo v0.3.0
@Xuekai-Zhu Xuekai-Zhu added the type/bug An issue about a bug label Jul 29, 2024
@Xuekai-Zhu Xuekai-Zhu changed the title Intitial Loss incread from 10 (0.3.0 v) to 60 (0.4.0) ! Initial Loss increased from 10 (0.3.0 v) to 60 (0.4.0) ! Jul 29, 2024
@Xuekai-Zhu
Copy link
Author

Xuekai-Zhu commented Jul 29, 2024

Not only in BIoMed data. The same results in your provided data.

@AkshitaB
Copy link
Contributor

@Xuekai-Zhu Can you say more on what you mean by the "presence or absence" of that checkpoint? And can you share the code you're using for loading?

@Xuekai-Zhu
Copy link
Author

I use the following command to run OLMo without modifying the source code.
Therefore, the default code for loading OLMo ckpt v0.3.0 and v0.4.0 is used.

torchrun --nproc_per_node=4 --master_port=29216 OLMo/scripts/train.py config/bio/OLMo-1B.yaml \
    --save_overwrite \
    --reset_trainer_state \
    --load_path=https://olmo-checkpoints.org/ai2-llm/olmo-small/g4g72enr/step738020-unsharded/

@Xuekai-Zhu
Copy link
Author

@Xuekai-Zhu Can you say more on what you mean by the "presence or absence" of that checkpoint? And can you share the code you're using for loading?

you can directly refer to the above image.

Using or not using a pretrained checkpoint, as well as the version of OLMo, can result in different initial loss values.

@Xuekai-Zhu
Copy link
Author

Loosely speaking, v0.3.0 produces correct loss values, but the loss values in v0.4.0 are incorrect.
using the pretrained checkpoint results in even higher loss values, which is clearly an error. There seems to be an issue with the loss calculation in the v0.4.0 code.
Due to the significant changes in this version, it's difficult for me to compare the two. Could you please take a look?

@2015aroras
Copy link
Collaborator

Since you are building from source, it's possible that you were affected by the bug that was fixed in #680. Could you pull the commit and see if that fixes your issue?

@2015aroras
Copy link
Collaborator

I am seeing your issue locally now and it is not fixed by #680. I am investigating

@Xuekai-Zhu
Copy link
Author

Thank you very much!
I think this might be a rather urgent bug since it leads to training errors.
Reverting to the 0.3.0 version work for me now.

@2015aroras
Copy link
Collaborator

2015aroras commented Aug 6, 2024

Upon further investigation, instances of bad loss we observed outside of #680 were due to bad setup (bad container or incorrect config).

In particular, I ran from a checkpoint while passing --force_save_unsharded --dry_run in order to get the model loaded into code and saved, without any training. Then I ran scripts/compare_model_state.py with the original checkpoint and the new checkpoint and saw that they were different. This suggested that something was corrupting model state before training even started. When doing the above in a healthy container, I saw no difference between the 2 checkpoints.

If you find out what's causing the issue for you in 0.4.0, please let us know. We will also update here if we run into the issue again.

@ah-nf
Copy link

ah-nf commented Sep 24, 2024

Hello, wanted to add on to this since I am seeing significantly lower loss values when running the same training workload on 0.4.0 compared to 0.3.0. These tests were performed in the same environment so don't think that's a factor.

I looked at the commit history and saw that this change was made to the loss calculation after 0.3.0, do we think this could have anything to do with this behavior?

7146473

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug An issue about a bug
Projects
None yet
Development

No branches or pull requests

4 participants