Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qwen2_vl模型训练异常 #5462

Open
will-wiki opened this issue Sep 18, 2024 · 3 comments
Open

qwen2_vl模型训练异常 #5462

will-wiki opened this issue Sep 18, 2024 · 3 comments
Labels
pending This problem is yet to be addressed

Comments

@will-wiki
Copy link

will-wiki commented Sep 18, 2024

训练环境:

torch==2.1.2、cuda==11.8、transformer==4.45.0.dev0、LLaMA-Factory==v0.9.0,显卡A800

训练碰到的问题:训练脚本https://github.com/hiyouga/LLaMA-Factory/blob/main/examples/train_lora/qwen2vl_lora_sft.yaml
修改了视频解析策略,非流式数据预处理,十几万数据preprocessing_num_workers==48可以在90分钟左右预处理完,只训练LLM lora的时候能够正常训练,但是 添加 “additional_target: merger”选项的时候,预处理正常,模型训练完成1个step的时候就会长时间hang住不报错(实测过三四次,稳定出现,但是在训练数据比较少的时候又能正常训练),hang住的时候显存占满且利用率长时间100%,想请问下这是为什么呢?

另外尝试了下“additional_target: merger”在流式数据处理情况下的模型训练,模型能正常训练,但是不支持混合模态数据,训练时间也变得特别的长,流式的配置如下。

buffer_size: 64
preprocessing_batch_size: 64
streaming: true
accelerator_config:
  dispatch_batches: false
@github-actions github-actions bot added the pending This problem is yet to be addressed label Sep 18, 2024
@will-wiki
Copy link
Author

hello,想问下有大佬能帮忙解答下这个问题么

@will-wiki
Copy link
Author

hello,没有人能回复一下吗...,群里也问了,但是没人理

@liangxiaowei00
Copy link

同求

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending This problem is yet to be addressed
Projects
None yet
Development

No branches or pull requests

2 participants