-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
只全参数微调Qwen2-VL-7B-Instruct的visual.merger部分,冻结其他模型参数,训练过程报错 #5472
Labels
pending
This problem is yet to be addressed
Comments
发现 finetuning_type: full 也不是100%都训 |
或许是freeze_vision_tower默认为true的原因? |
确实 我尝试把vision_tower部分参数也加进去,训练过程中会卡住,只能微调llm部分 |
freeze_vision_tower 设置为true发现在自己数据集上训的不如false高 |
@wjx-sudo 同样的问题,非流式训练llm-lora+merger的时候,只训练一个step就卡主了,想问下你解决了吗 |
坐等好心人解决方案 |
llama-factory对VIT和connector的训练支持似乎确实没做太好,好像就是不支持 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Reminder
System Info
model
model_name_or_path: /Qwen2-VL-7B-Instruct
method
stage: sft
do_train: true
finetuning_type: full
train_mm_proj_only: true #训练多模态投影器
deepspeed: examples/deepspeed/ds_z2_config.json
dataset
dataset: mllm_demo,identity
template: qwen2_vl
cutoff_len: 1024
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16
output
output_dir: saves/qwen2_vl-7b/full/sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true
train
per_device_train_batch_size: 1
gradient_accumulation_steps: 2
learning_rate: 1.0e-5
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
eval
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500
Reproduction
File "/root/anaconda3/envs/llamafactory-w/lib/python3.8/site-packages/accelerate/accelerator.py", line 2143, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/root/anaconda3/envs/llamafactory-w/lib/python3.8/site-packages/accelerate/utils/deepspeed.py", line 166, in backward
self.engine.backward(loss, **kwargs)
File "/root/anaconda3/envs/llamafactory-w/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/root/anaconda3/envs/llamafactory-w/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1976, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/root/anaconda3/envs/llamafactory-w/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2051, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/root/anaconda3/envs/llamafactory-w/lib/python3.8/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/root/anaconda3/envs/llamafactory-w/lib/python3.8/site-packages/torch/_tensor.py", line 522, in backward
torch.autograd.backward(
File "/root/anaconda3/envs/llamafactory-w/lib/python3.8/site-packages/torch/autograd/init.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
Expected behavior
No response
Others
No response
The text was updated successfully, but these errors were encountered: