[Badcase]: loss unstable #1074

Solo4working · 2024-11-12T09:55:23Z

Model Series

Qwen2.5

What are the models used?

Qwen2.5-0.5B-Instruct

What is the scenario where the problem happened?

train Qwen2.5-0.5B-Instruct in transformers library for vision language model

Is this badcase known and can it be solved using avaiable techniques?

I have followed the GitHub README.
I have checked the Qwen documentation and cannot find a solution there.
I have checked the documentation of the related framework and cannot find useful information.
I have searched the issues and there is not a similar one.

Information about environment

Package                      Version
---------------------------- ------------
absl-py                      2.0.0
accelerate                   0.29.2
annotated-types              0.6.0
antlr4-python3-runtime       4.9.3
anyio                        4.2.0
appdirs                      1.4.4
argon2-cffi                  23.1.0
argon2-cffi-bindings         21.2.0
arrow                        1.3.0
asttokens                    2.4.1
astunparse                   1.6.3
async-lru                    2.0.4
attrdict                     2.0.1
attrs                        23.1.0
Babel                        2.14.0
beautifulsoup4               4.12.2
bitsandbytes                 0.43.1
bleach                       6.1.0
BLEURT                       0.0.2
blis                         0.7.11
cachetools                   5.3.2
catalogue                    2.0.10
certifi                      2023.11.17
cffi                         1.16.0
charset-normalizer           3.3.2
click                        8.1.7
cloudpathlib                 0.16.0
colorama                     0.4.6
comm                         0.2.0
confection                   0.1.4
ctcdecode                    1.0.3
cycler                       0.12.1
cymem                        2.0.8
de-core-news-sm              3.7.0
debugpy                      1.8.0
decorator                    4.4.2
decord                       0.6.0
deepspeed                    0.13.1
defusedxml                   0.7.1
dill                         0.3.8
docker-pycreds               0.4.0
einops                       0.5.0
et_xmlfile                   2.0.0
exceptiongroup               1.2.0
executing                    2.0.1
fastjsonschema               2.19.0
filelock                     3.13.1
flash-attn                   2.5.7
flatbuffers                  1.12
fonttools                    4.50.0
fqdn                         1.5.1
fsspec                       2023.12.2
ftfy                         6.3.1
gast                         0.4.0
gitdb                        4.0.11
GitPython                    3.1.43
google-auth                  2.25.2
google-auth-oauthlib         0.4.6
google-pasta                 0.2.0
grpcio                       1.60.0
h11                          0.14.0
h5py                         3.10.0
hjson                        3.1.0
hpargparse                   0.14.0
hpman                        0.13.0
httpcore                     1.0.5
httpx                        0.27.0
huggingface-hub              0.22.2
hydra-core                   1.3.2
idna                         3.6
imageio                      2.34.0
imageio-ffmpeg               0.5.1
importlib-metadata           7.0.0
ipykernel                    6.27.1
ipython                      8.18.1
ipython-genutils             0.2.0
isoduration                  20.11.0
jedi                         0.19.1
Jinja2                       3.1.2
joblib                       1.3.2
json5                        0.9.14
jsonpointer                  2.4
jsonschema                   4.20.0
jsonschema-specifications    2023.11.2
jupyter_client               8.6.0
jupyter_core                 5.5.1
jupyter-events               0.9.0
jupyter-lsp                  2.2.1
jupyter_server               2.12.1
jupyter_server_terminals     0.5.0
jupyterlab                   4.0.9
jupyterlab_pygments          0.3.0
jupyterlab_server            2.25.2
keras                        2.9.0
Keras-Preprocessing          1.1.2
kiwisolver                   1.4.5
langcodes                    3.3.0
libclang                     18.1.1
lmdb                         1.3.0
loguru                       0.7.0
lxml                         4.9.1
Markdown                     3.5.1
markdown-it-py               3.0.0
MarkupSafe                   2.1.3
matplotlib                   3.5.3
matplotlib-inline            0.1.6
mdurl                        0.1.2
mistune                      3.0.2
moviepy                      1.0.3
mpmath                       1.3.0
murmurhash                   1.0.10
nbclassic                    1.0.0
nbclient                     0.9.0
nbconvert                    7.13.0
nbformat                     5.9.2
nest-asyncio                 1.5.8
networkx                     3.2.1
ninja                        1.11.1.1
nose                         1.3.7
notebook                     6.5.4
notebook_shim                0.2.3
numpy                        1.22.4
nvidia-ml-py                 12.535.133
nvitop                       1.3.2
oauthlib                     3.2.2
omegaconf                    2.3.0
opencv-python                4.6.0.66
openpyxl                     3.1.5
opt-einsum                   3.3.0
ordereddict                  1.1
overrides                    7.4.0
packaging                    23.2
pandas                       1.3.5
pandocfilters                1.5.0
parso                        0.8.3
pathlib                      1.0.1
pathtools                    0.1.2
peft                         0.10.0
pexpect                      4.9.0
Pillow                       10.1.0
pip                          23.3.2
platformdirs                 4.1.0
portalocker                  2.8.2
preshed                      3.0.9
proglog                      0.1.10
prometheus-client            0.19.0
promise                      2.3
prompt-toolkit               3.0.43
protobuf                     3.19.6
psutil                       5.9.8
ptyprocess                   0.7.0
pure-eval                    0.2.2
py-cpuinfo                   9.0.0
pyasn1                       0.5.1
pyasn1-modules               0.3.0
pycparser                    2.21
pydantic                     2.6.4
pydantic_core                2.16.3
Pygments                     2.17.2
PyJWT                        2.8.0
pynvml                       11.5.3
pyparsing                    3.1.2
python-dateutil              2.8.2
python-json-logger           2.0.7
pytz                         2024.1
PyWavelets                   1.6.0
PyYAML                       6.0.1
pyzmq                        25.1.2
referencing                  0.32.0
reFILE                       0.4.1
regex                        2023.12.25
requests                     2.31.0
requests-oauthlib            1.3.1
rfc3339-validator            0.1.4
rfc3986-validator            0.1.1
rich                         13.7.1
rouge                        1.0.1
rpds-py                      0.15.2
rsa                          4.9
sacrebleu                    2.2.0
sacremoses                   0.1.1
safetensors                  0.4.2
scikit-image                 0.19.3
scikit-learn                 1.0.2
scipy                        1.7.3
seaborn                      0.12.1
Send2Trash                   1.8.2
sentencepiece                0.1.97
sentry-sdk                   1.44.0
setproctitle                 1.3.3
setuptools                   58.1.0
shortuuid                    1.0.13
six                          1.16.0
smart-open                   6.0.0
smmap                        5.0.1
sniffio                      1.3.0
soupsieve                    2.5
spacy                        3.7.4
spacy-legacy                 3.0.12
spacy-loggers                1.0.5
spacy-pkuseg                 0.0.33
srsly                        2.4.8
stack-data                   0.6.3
sympy                        1.12
tabulate                     0.9.0
tensorboard                  2.9.1
tensorboard-data-server      0.6.1
tensorboard-plugin-wit       1.8.1
tensorflow                   2.9.1
tensorflow-estimator         2.9.0
tensorflow-io-gcs-filesystem 0.36.0
termcolor                    2.4.0
terminado                    0.18.0
textpruner                   1.1.post2
tf-slim                      1.1.0
thinc                        8.2.3
threadpoolctl                3.4.0
tifffile                     2024.2.12
timm                         0.8.10.dev0
tinycss2                     1.2.1
tokenizers                   0.19.1
tomli                        2.0.1
torch                        2.1.1+cu121
torchaudio                   2.1.1+cu121
torchvision                  0.16.1+cu121
tornado                      6.4
tqdm                         4.65.0
traitlets                    5.14.0
transformers                 4.40.0
triton                       2.1.0
typer                        0.9.4
types-python-dateutil        2.8.19.14
typing_extensions            4.9.0
uri-template                 1.3.0
urllib3                      2.1.0
vidaug                       1.5
wandb                        0.16.5
wasabi                       1.1.2
wcwidth                      0.2.12
weasel                       0.3.4
webcolors                    1.13
webencodings                 0.5.1
websocket-client             1.7.0
Werkzeug                     3.0.1
wheel                        0.43.0
wrapt                        1.16.0
zh-core-web-sm               3.7.0
zhipuai                      2.0.1
zipp                         3.17.0

I use Qwen2.5 as the LM of the vision language model to perform SFT, but I find that under the same environment and command, the corresponding loss is different each time the same iteration is started. My seed is fixed. Is this normal? If not, how can I troubleshoot this unstable phenomenon?

Description

Steps to reproduce

This happens to Qwen2.5-xB-Instruct-xxx and xxx.
The badcase can be reproduced with the following steps:

...
...

The following example input & output can be used:

system: ...
user: ...
...

Expected results

The results are expected to be ...

Attempts to fix

I have tried several ways to fix this, including:

adjusting the sampling parameters, but ...
prompt engineering, but ...

Anything else helpful for investigation

I find that this problem also happens to ...

The text was updated successfully, but these errors were encountered:

jklj077 · 2024-11-14T10:40:01Z

If by unstable, you mean slight variations in losses across different runs. It is normal, because there are other sources of randomness than the pseudo-random number generator, which can be controlled by random seeds. See https://pytorch.org/docs/stable/notes/randomness.html for reference.

If by unstable, you mean that the loss fluctuates a lot. It is not expected, and there are so many things that can caused that.

Solo4working · 2024-11-15T05:38:15Z

@jklj077 Thanks for your reply😀.
In my research field, small models such as BART and T5 are commonly used.
When inserting these language models in my code, the losses across different runs do not change (same value), so I think the random seeds in my code are fixed well.
However, when I convert it to Qwen, the loss is the same in the first iteration, but in subsequent iterations, when lr is small (1e-5), the loss is only the same for some iterations, while other iterations have deviations of about 0.01~0.1, and when lr becomes larger (3e- 4), except for the first few iterations, the losses of subsequent iterations are different, probably with a deviation of more than 0.1.
I am curious about the cause of this phenomenon?
My code uses deepspeed's bf16 for training, not Trainer from the transformers library.

jklj077 · 2024-11-19T08:33:06Z

There are other sources of randomness than the pseudo-random number generator, which can be controlled by random seeds. See https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms for reference. The background info on accuracy problems on floating-point numbers can be found at https://en.wikipedia.org/wiki/Floating-point_arithmetic#Accuracy_problems.

Since you are using bfloat16 which has reduced precision, the error resulted from nondeterministic-algorithms will be more prominent.

Solo4working · 2024-11-23T15:05:52Z

There are other sources of randomness than the pseudo-random number generator, which can be controlled by random seeds. See https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms for reference. The background info on accuracy problems on floating-point numbers can be found at https://en.wikipedia.org/wiki/Floating-point_arithmetic#Accuracy_problems.

Since you are using bfloat16 which has reduced precision, the error resulted from nondeterministic-algorithms will be more prominent.

Thanks for your reply, I think I found a way to deal with it.

In transformers, the Qwen series models use the sdpa attention mechanism by default, while the T5 series models mostly use the eager attention mechanism by default.

When I switch the Qwen attention mechanism back to ‘eager’, the loss can remain stable (without any randomness). If the sdpa or flash_attn_2 attention mechanisms are used, transformers.enable_full_determinism needs to be used to fix this randomness.

This may only be a temporary solution, because reproducibility is very important to me.

Solo4working closed this as completed Nov 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Badcase]: loss unstable #1074

[Badcase]: loss unstable #1074

Solo4working commented Nov 12, 2024

jklj077 commented Nov 14, 2024

Solo4working commented Nov 15, 2024 •

edited

Loading

jklj077 commented Nov 19, 2024

Solo4working commented Nov 23, 2024

[Badcase]: loss unstable #1074

[Badcase]: loss unstable #1074

Comments

Solo4working commented Nov 12, 2024

Model Series

What are the models used?

What is the scenario where the problem happened?

Is this badcase known and can it be solved using avaiable techniques?

Information about environment

Description

Steps to reproduce

Expected results

Attempts to fix

Anything else helpful for investigation

jklj077 commented Nov 14, 2024

Solo4working commented Nov 15, 2024 • edited Loading

jklj077 commented Nov 19, 2024

Solo4working commented Nov 23, 2024

Solo4working commented Nov 15, 2024 •

edited

Loading