Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

只全参数微调Qwen2-VL-7B-Instruct的visual.merger部分,冻结其他模型参数,训练过程报错 #5472

Open
1 task done
wjx-sudo opened this issue Sep 18, 2024 · 7 comments
Labels
pending This problem is yet to be addressed

Comments

@wjx-sudo
Copy link

Reminder

  • I have read the README and searched the existing issues.

System Info

model

model_name_or_path: /Qwen2-VL-7B-Instruct

method

stage: sft
do_train: true
finetuning_type: full
train_mm_proj_only: true #训练多模态投影器
deepspeed: examples/deepspeed/ds_z2_config.json

dataset

dataset: mllm_demo,identity
template: qwen2_vl
cutoff_len: 1024
max_samples: 1000
overwrite_cache: true

preprocessing_num_workers: 16

output

output_dir: saves/qwen2_vl-7b/full/sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

train

per_device_train_batch_size: 1
gradient_accumulation_steps: 2
learning_rate: 1.0e-5
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

eval

val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500

Reproduction

File "/root/anaconda3/envs/llamafactory-w/lib/python3.8/site-packages/accelerate/accelerator.py", line 2143, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/root/anaconda3/envs/llamafactory-w/lib/python3.8/site-packages/accelerate/utils/deepspeed.py", line 166, in backward
self.engine.backward(loss, **kwargs)
File "/root/anaconda3/envs/llamafactory-w/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/root/anaconda3/envs/llamafactory-w/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1976, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/root/anaconda3/envs/llamafactory-w/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2051, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/root/anaconda3/envs/llamafactory-w/lib/python3.8/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/root/anaconda3/envs/llamafactory-w/lib/python3.8/site-packages/torch/_tensor.py", line 522, in backward
torch.autograd.backward(
File "/root/anaconda3/envs/llamafactory-w/lib/python3.8/site-packages/torch/autograd/init.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

Expected behavior

No response

Others

No response

@github-actions github-actions bot added the pending This problem is yet to be addressed label Sep 18, 2024
@nemonameless
Copy link

发现 finetuning_type: full 也不是100%都训

@GoGoZeppeli-towa
Copy link

发现 finetuning_type: full 也不是100%都训

或许是freeze_vision_tower默认为true的原因?

@wjx-sudo
Copy link
Author

发现 finetuning_type: full 也不是100%都训

确实 我尝试把vision_tower部分参数也加进去,训练过程中会卡住,只能微调llm部分

@nemonameless
Copy link

freeze_vision_tower 设置为true发现在自己数据集上训的不如false高

@will-wiki
Copy link

@wjx-sudo 同样的问题,非流式训练llm-lora+merger的时候,只训练一个step就卡主了,想问下你解决了吗

@piDack
Copy link

piDack commented Oct 11, 2024

坐等好心人解决方案

@Michael4933
Copy link

llama-factory对VIT和connector的训练支持似乎确实没做太好,好像就是不支持

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending This problem is yet to be addressed
Projects
None yet
Development

No branches or pull requests

6 participants