Open
Description
Hello!
Thank you for your work at MLLM.
I had a fine-tuning bug that I couldn't fix: when I ran the stage2_sft.sh
script and trained with speech_conv_datasets only, the logger showed that the train loss was 0 all the time and eval loss was NaN, as shown in the figure.
Command in stage2_sft.sh
as follows:
torchrun
--nproc_per_node 2 \
anygpt/src/train/stage2_sft.py \
--model_name_or_path "${METAROOT}" \
--run_name "mm_sft" \
--cache_dir ${CACHEROOT} \
--report_to "wandb" \
--speech_conv_datasets "$speech_conv_datasets" \
--speech_datasets "$speech_datasets"\
--preprocessing_num_workers 100 \
--bf16 True \
--do_train \
--do_eval \
--output_dir "${OUTROOT}" \
--model_max_length 4096 \
--save_strategy "steps" \
--save_steps 5 \
--evaluation_strategy "steps" \
--eval_steps 5 \
--max_steps 5 \
--concatenating False \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 1 \
--val_set_size 10 \
--num_train_epochs 3\
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--log_level debug \
--logging_steps 1 \
--overwrite_output_dir False\
--fsdp "full_shard auto_wrap" \
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
--use_flash_attn True \
--ddp_timeout 7200 \
--save_total_limit 10
I'm using the following python environment:
transformers 4.34.1
huggingface-hub 0.24.0
tokenizers 0.14.1
torch 2.1.0
torchaudio 2.1.0
torchvision 0.16.0
flash-attn 2.5.9.post1
Metadata
Metadata
Assignees
Labels
No labels