Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Train bug] Gradient Explosion in SFT training stage with DeepSpeed ZeRO-2 #109

Closed
Grey4sh opened this issue Sep 27, 2024 · 7 comments
Closed

Comments

@Grey4sh
Copy link
Contributor

Grey4sh commented Sep 27, 2024

梯度爆炸

I used a self-built FIM SFT dataset for fine-tuning, and encountered abnormal loss when training with DeepSpeed ZeRO2. However, the same dataset did not have this issue on CodeQwen1.5. After switching to ZeRO3, the training proceeded normally. Is this a problem with the model architecture or an incompatibility with the DeepSpeed version?
BTW, the version of my DeepSpeed is 0.13.2

@cyente
Copy link
Collaborator

cyente commented Sep 27, 2024

Here are our best SFT practices, which you can refer to in order to verify if there are any configuration errors.

https://github.com/QwenLM/Qwen2.5-Coder/tree/main/sft

Noted that, we have made an update to the special tokens from codeqwen1.5 to qwen2.5-coder. Please confirm whether there are any issues related to special tokens during the training process.

{
  "<|fim_prefix|>": 151659, 
  "<|fim_middle|>": 151660, 
  "<|fim_suffix|>": 151661, 
  "<|fim_pad|>": 151662, 
  "<|repo_name|>": 151663, 
  "<|file_sep|>": 151664, 
  "<|im_start|>": 151644, 
  "<|im_end|>": 151645
}

@Grey4sh
Copy link
Contributor Author

Grey4sh commented Sep 27, 2024

Big shout to your team. I did check the new special token format , but still meet the same problem with ZeRO2. BTW, is there any plan to provide the unsupervised-training exmaples?

@oo0-0-0oo
Copy link

Big shout to your team. I did check the new special token format , but still meet the same problem with ZeRO2. BTW, is there any plan to provide the unsupervised-training exmaples?

small LR may work

@cyente
Copy link
Collaborator

cyente commented Sep 29, 2024

Big shout to your team. I did check the new special token format , but still meet the same problem with ZeRO2. BTW, is there any plan to provide the unsupervised-training exmaples?

Could you reproduce the current SFT script's solution? If there are any issues, please provide more detailed reproducible content to assist further.

@Grey4sh
Copy link
Contributor Author

Grey4sh commented Sep 29, 2024

Okay, I will provide further information once the script adaptation is completed.

@Grey4sh
Copy link
Contributor Author

Grey4sh commented Oct 9, 2024

@cyente

Enconter train error when reproduced the train script in official repo.

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5225f7a897 in /home/chatgpt/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe32e33 (0x7f51d98d7e33 in /home/chatgpt/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xdc253 (0x7f52256b0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x94ac3 (0x7f5226f5aac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x126850 (0x7f5226fec850 in /lib/x86_64-linux-gnu/libc.so.6)

../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [97,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [98,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [99,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [100,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [101,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [102,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [103,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [104,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [105,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [106,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [107,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [108,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [109,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [110,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [111,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [112,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [677,0,0], thread: [113,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

@cyente
Copy link
Collaborator

cyente commented Oct 12, 2024

hey we have modified the phenomena of some tokenization errors in the sft script, and you can try again now.

@cyente cyente closed this as completed Oct 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants