qwen2_vl模型训练异常 #5462

will-wiki · 2024-09-18T02:42:37Z

训练环境：

torch==2.1.2、cuda==11.8、transformer==4.45.0.dev0、LLaMA-Factory==v0.9.0，显卡A800

训练碰到的问题：训练脚本https://github.com/hiyouga/LLaMA-Factory/blob/main/examples/train_lora/qwen2vl_lora_sft.yaml
修改了视频解析策略，非流式数据预处理，十几万数据preprocessing_num_workers==48可以在90分钟左右预处理完，只训练LLM lora的时候能够正常训练，但是添加 “additional_target: merger”选项的时候，预处理正常，模型训练完成1个step的时候就会长时间hang住不报错（实测过三四次，稳定出现，但是在训练数据比较少的时候又能正常训练），hang住的时候显存占满且利用率长时间100%，想请问下这是为什么呢？

另外尝试了下“additional_target: merger”在流式数据处理情况下的模型训练，模型能正常训练，但是不支持混合模态数据，训练时间也变得特别的长，流式的配置如下。

buffer_size: 64
preprocessing_batch_size: 64
streaming: true
accelerator_config:
  dispatch_batches: false

The text was updated successfully, but these errors were encountered:

will-wiki · 2024-09-19T08:34:48Z

hello，想问下有大佬能帮忙解答下这个问题么

will-wiki · 2024-09-25T10:03:45Z

hello，没有人能回复一下吗...，群里也问了，但是没人理

liangxiaowei00 · 2024-10-12T07:46:43Z

同求

github-actions bot added the pending This problem is yet to be addressed label Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qwen2_vl模型训练异常 #5462

qwen2_vl模型训练异常 #5462

will-wiki commented Sep 18, 2024 •

edited

Loading

will-wiki commented Sep 19, 2024

will-wiki commented Sep 25, 2024

liangxiaowei00 commented Oct 12, 2024

qwen2_vl模型训练异常 #5462

qwen2_vl模型训练异常 #5462

Comments

will-wiki commented Sep 18, 2024 • edited Loading

will-wiki commented Sep 19, 2024

will-wiki commented Sep 25, 2024

liangxiaowei00 commented Oct 12, 2024

will-wiki commented Sep 18, 2024 •

edited

Loading