Skip to content

Train_loss = 0 and Eval_loss = NaN in stage2_sft #31

Open
@xuxiaoang

Description

@xuxiaoang

Hello!
Thank you for your work at MLLM.
I had a fine-tuning bug that I couldn't fix: when I ran the stage2_sft.sh script and trained with speech_conv_datasets only, the logger showed that the train loss was 0 all the time and eval loss was NaN, as shown in the figure.
屏幕截图 2024-07-20 210750

Command in stage2_sft.sh as follows:

torchrun
    --nproc_per_node 2 \
    anygpt/src/train/stage2_sft.py \
    --model_name_or_path "${METAROOT}" \
    --run_name "mm_sft" \
    --cache_dir ${CACHEROOT} \
    --report_to "wandb" \
    --speech_conv_datasets "$speech_conv_datasets" \
    --speech_datasets "$speech_datasets"\
    --preprocessing_num_workers 100 \
    --bf16 True \
    --do_train \
    --do_eval \
    --output_dir "${OUTROOT}" \
    --model_max_length 4096 \
    --save_strategy "steps" \
    --save_steps 5 \
    --evaluation_strategy "steps" \
    --eval_steps 5 \
    --max_steps 5 \
    --concatenating False \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --val_set_size 10 \
    --num_train_epochs 3\
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --log_level debug \
    --logging_steps 1 \
    --overwrite_output_dir False\
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --use_flash_attn True \
    --ddp_timeout 7200 \
    --save_total_limit 10

I'm using the following python environment:

transformers              4.34.1
huggingface-hub           0.24.0
tokenizers                0.14.1
torch                     2.1.0
torchaudio                2.1.0
torchvision               0.16.0
flash-attn                2.5.9.post1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions