Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Badcase]: loss unstable #1074

Open
4 tasks done
Solo4working opened this issue Nov 12, 2024 · 2 comments
Open
4 tasks done

[Badcase]: loss unstable #1074

Solo4working opened this issue Nov 12, 2024 · 2 comments

Comments

@Solo4working
Copy link

Model Series

Qwen2.5

What are the models used?

Qwen2.5-0.5B-Instruct

What is the scenario where the problem happened?

train Qwen2.5-0.5B-Instruct in transformers library for vision language model

Is this badcase known and can it be solved using avaiable techniques?

  • I have followed the GitHub README.
  • I have checked the Qwen documentation and cannot find a solution there.
  • I have checked the documentation of the related framework and cannot find useful information.
  • I have searched the issues and there is not a similar one.

Information about environment

Package                      Version
---------------------------- ------------
absl-py                      2.0.0
accelerate                   0.29.2
annotated-types              0.6.0
antlr4-python3-runtime       4.9.3
anyio                        4.2.0
appdirs                      1.4.4
argon2-cffi                  23.1.0
argon2-cffi-bindings         21.2.0
arrow                        1.3.0
asttokens                    2.4.1
astunparse                   1.6.3
async-lru                    2.0.4
attrdict                     2.0.1
attrs                        23.1.0
Babel                        2.14.0
beautifulsoup4               4.12.2
bitsandbytes                 0.43.1
bleach                       6.1.0
BLEURT                       0.0.2
blis                         0.7.11
cachetools                   5.3.2
catalogue                    2.0.10
certifi                      2023.11.17
cffi                         1.16.0
charset-normalizer           3.3.2
click                        8.1.7
cloudpathlib                 0.16.0
colorama                     0.4.6
comm                         0.2.0
confection                   0.1.4
ctcdecode                    1.0.3
cycler                       0.12.1
cymem                        2.0.8
de-core-news-sm              3.7.0
debugpy                      1.8.0
decorator                    4.4.2
decord                       0.6.0
deepspeed                    0.13.1
defusedxml                   0.7.1
dill                         0.3.8
docker-pycreds               0.4.0
einops                       0.5.0
et_xmlfile                   2.0.0
exceptiongroup               1.2.0
executing                    2.0.1
fastjsonschema               2.19.0
filelock                     3.13.1
flash-attn                   2.5.7
flatbuffers                  1.12
fonttools                    4.50.0
fqdn                         1.5.1
fsspec                       2023.12.2
ftfy                         6.3.1
gast                         0.4.0
gitdb                        4.0.11
GitPython                    3.1.43
google-auth                  2.25.2
google-auth-oauthlib         0.4.6
google-pasta                 0.2.0
grpcio                       1.60.0
h11                          0.14.0
h5py                         3.10.0
hjson                        3.1.0
hpargparse                   0.14.0
hpman                        0.13.0
httpcore                     1.0.5
httpx                        0.27.0
huggingface-hub              0.22.2
hydra-core                   1.3.2
idna                         3.6
imageio                      2.34.0
imageio-ffmpeg               0.5.1
importlib-metadata           7.0.0
ipykernel                    6.27.1
ipython                      8.18.1
ipython-genutils             0.2.0
isoduration                  20.11.0
jedi                         0.19.1
Jinja2                       3.1.2
joblib                       1.3.2
json5                        0.9.14
jsonpointer                  2.4
jsonschema                   4.20.0
jsonschema-specifications    2023.11.2
jupyter_client               8.6.0
jupyter_core                 5.5.1
jupyter-events               0.9.0
jupyter-lsp                  2.2.1
jupyter_server               2.12.1
jupyter_server_terminals     0.5.0
jupyterlab                   4.0.9
jupyterlab_pygments          0.3.0
jupyterlab_server            2.25.2
keras                        2.9.0
Keras-Preprocessing          1.1.2
kiwisolver                   1.4.5
langcodes                    3.3.0
libclang                     18.1.1
lmdb                         1.3.0
loguru                       0.7.0
lxml                         4.9.1
Markdown                     3.5.1
markdown-it-py               3.0.0
MarkupSafe                   2.1.3
matplotlib                   3.5.3
matplotlib-inline            0.1.6
mdurl                        0.1.2
mistune                      3.0.2
moviepy                      1.0.3
mpmath                       1.3.0
murmurhash                   1.0.10
nbclassic                    1.0.0
nbclient                     0.9.0
nbconvert                    7.13.0
nbformat                     5.9.2
nest-asyncio                 1.5.8
networkx                     3.2.1
ninja                        1.11.1.1
nose                         1.3.7
notebook                     6.5.4
notebook_shim                0.2.3
numpy                        1.22.4
nvidia-ml-py                 12.535.133
nvitop                       1.3.2
oauthlib                     3.2.2
omegaconf                    2.3.0
opencv-python                4.6.0.66
openpyxl                     3.1.5
opt-einsum                   3.3.0
ordereddict                  1.1
overrides                    7.4.0
packaging                    23.2
pandas                       1.3.5
pandocfilters                1.5.0
parso                        0.8.3
pathlib                      1.0.1
pathtools                    0.1.2
peft                         0.10.0
pexpect                      4.9.0
Pillow                       10.1.0
pip                          23.3.2
platformdirs                 4.1.0
portalocker                  2.8.2
preshed                      3.0.9
proglog                      0.1.10
prometheus-client            0.19.0
promise                      2.3
prompt-toolkit               3.0.43
protobuf                     3.19.6
psutil                       5.9.8
ptyprocess                   0.7.0
pure-eval                    0.2.2
py-cpuinfo                   9.0.0
pyasn1                       0.5.1
pyasn1-modules               0.3.0
pycparser                    2.21
pydantic                     2.6.4
pydantic_core                2.16.3
Pygments                     2.17.2
PyJWT                        2.8.0
pynvml                       11.5.3
pyparsing                    3.1.2
python-dateutil              2.8.2
python-json-logger           2.0.7
pytz                         2024.1
PyWavelets                   1.6.0
PyYAML                       6.0.1
pyzmq                        25.1.2
referencing                  0.32.0
reFILE                       0.4.1
regex                        2023.12.25
requests                     2.31.0
requests-oauthlib            1.3.1
rfc3339-validator            0.1.4
rfc3986-validator            0.1.1
rich                         13.7.1
rouge                        1.0.1
rpds-py                      0.15.2
rsa                          4.9
sacrebleu                    2.2.0
sacremoses                   0.1.1
safetensors                  0.4.2
scikit-image                 0.19.3
scikit-learn                 1.0.2
scipy                        1.7.3
seaborn                      0.12.1
Send2Trash                   1.8.2
sentencepiece                0.1.97
sentry-sdk                   1.44.0
setproctitle                 1.3.3
setuptools                   58.1.0
shortuuid                    1.0.13
six                          1.16.0
smart-open                   6.0.0
smmap                        5.0.1
sniffio                      1.3.0
soupsieve                    2.5
spacy                        3.7.4
spacy-legacy                 3.0.12
spacy-loggers                1.0.5
spacy-pkuseg                 0.0.33
srsly                        2.4.8
stack-data                   0.6.3
sympy                        1.12
tabulate                     0.9.0
tensorboard                  2.9.1
tensorboard-data-server      0.6.1
tensorboard-plugin-wit       1.8.1
tensorflow                   2.9.1
tensorflow-estimator         2.9.0
tensorflow-io-gcs-filesystem 0.36.0
termcolor                    2.4.0
terminado                    0.18.0
textpruner                   1.1.post2
tf-slim                      1.1.0
thinc                        8.2.3
threadpoolctl                3.4.0
tifffile                     2024.2.12
timm                         0.8.10.dev0
tinycss2                     1.2.1
tokenizers                   0.19.1
tomli                        2.0.1
torch                        2.1.1+cu121
torchaudio                   2.1.1+cu121
torchvision                  0.16.1+cu121
tornado                      6.4
tqdm                         4.65.0
traitlets                    5.14.0
transformers                 4.40.0
triton                       2.1.0
typer                        0.9.4
types-python-dateutil        2.8.19.14
typing_extensions            4.9.0
uri-template                 1.3.0
urllib3                      2.1.0
vidaug                       1.5
wandb                        0.16.5
wasabi                       1.1.2
wcwidth                      0.2.12
weasel                       0.3.4
webcolors                    1.13
webencodings                 0.5.1
websocket-client             1.7.0
Werkzeug                     3.0.1
wheel                        0.43.0
wrapt                        1.16.0
zh-core-web-sm               3.7.0
zhipuai                      2.0.1
zipp                         3.17.0

I use Qwen2.5 as the LM of the vision language model to perform SFT, but I find that under the same environment and command, the corresponding loss is different each time the same iteration is started. My seed is fixed. Is this normal? If not, how can I troubleshoot this unstable phenomenon?

Description

Steps to reproduce

This happens to Qwen2.5-xB-Instruct-xxx and xxx.
The badcase can be reproduced with the following steps:

  1. ...
  2. ...

The following example input & output can be used:

system: ...
user: ...
...

Expected results

The results are expected to be ...

Attempts to fix

I have tried several ways to fix this, including:

  1. adjusting the sampling parameters, but ...
  2. prompt engineering, but ...

Anything else helpful for investigation

I find that this problem also happens to ...

@jklj077
Copy link
Collaborator

jklj077 commented Nov 14, 2024

If by unstable, you mean slight variations in losses across different runs. It is normal, because there are other sources of randomness than the pseudo-random number generator, which can be controlled by random seeds. See https://pytorch.org/docs/stable/notes/randomness.html for reference.

If by unstable, you mean that the loss fluctuates a lot. It is not expected, and there are so many things that can caused that.

@Solo4working
Copy link
Author

Solo4working commented Nov 15, 2024

@jklj077 Thanks for your reply😀.
In my research field, small models such as BART and T5 are commonly used.
When inserting these language models in my code, the losses across different runs do not change (same value), so I think the random seeds in my code are fixed well.
However, when I convert it to Qwen, the loss is the same in the first iteration, but in subsequent iterations, when lr is small (1e-5), the loss is only the same for some iterations, while other iterations have deviations of about 0.01~0.1, and when lr becomes larger (3e- 4), except for the first few iterations, the losses of subsequent iterations are different, probably with a deviation of more than 0.1.
I am curious about the cause of this phenomenon?
My code uses deepspeed's bf16 for training, not Trainer from the transformers library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants