-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Badcase]: loss unstable #1074
Comments
If by unstable, you mean slight variations in losses across different runs. It is normal, because there are other sources of randomness than the pseudo-random number generator, which can be controlled by random seeds. See https://pytorch.org/docs/stable/notes/randomness.html for reference. If by unstable, you mean that the loss fluctuates a lot. It is not expected, and there are so many things that can caused that. |
@jklj077 Thanks for your reply😀. |
There are other sources of randomness than the pseudo-random number generator, which can be controlled by random seeds. See https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms for reference. The background info on accuracy problems on floating-point numbers can be found at https://en.wikipedia.org/wiki/Floating-point_arithmetic#Accuracy_problems. Since you are using |
Thanks for your reply, I think I found a way to deal with it. In transformers, the Qwen series models use the When I switch the Qwen attention mechanism back to ‘eager’, the loss can remain stable (without any randomness). If the This may only be a temporary solution, because reproducibility is very important to me. |
Model Series
Qwen2.5
What are the models used?
Qwen2.5-0.5B-Instruct
What is the scenario where the problem happened?
train Qwen2.5-0.5B-Instruct in transformers library for vision language model
Is this badcase known and can it be solved using avaiable techniques?
Information about environment
I use Qwen2.5 as the LM of the vision language model to perform SFT, but I find that under the same environment and command, the corresponding loss is different each time the same iteration is started. My seed is fixed. Is this normal? If not, how can I troubleshoot this unstable phenomenon?
Description
Steps to reproduce
This happens to Qwen2.5-xB-Instruct-xxx and xxx.
The badcase can be reproduced with the following steps:
The following example input & output can be used:
Expected results
The results are expected to be ...
Attempts to fix
I have tried several ways to fix this, including:
Anything else helpful for investigation
I find that this problem also happens to ...
The text was updated successfully, but these errors were encountered: