Train_loss = 0 and Eval_loss = NaN in stage2_sft

Hello!
Thank you for your work at MLLM.
I had a fine-tuning bug that I couldn't fix: when I ran the `stage2_sft.sh` script and trained with speech_conv_datasets only, the logger showed that the train loss was 0 all the time and eval loss was NaN, as shown in the figure.
![屏幕截图 2024-07-20 210750](https://github.com/user-attachments/assets/695b177b-413a-4823-bf75-aea6067a632a)

Command in `stage2_sft.sh`  as follows:

    torchrun
        --nproc_per_node 2 \
        anygpt/src/train/stage2_sft.py \
        --model_name_or_path "${METAROOT}" \
        --run_name "mm_sft" \
        --cache_dir ${CACHEROOT} \
        --report_to "wandb" \
        --speech_conv_datasets "$speech_conv_datasets" \
        --speech_datasets "$speech_datasets"\
        --preprocessing_num_workers 100 \
        --bf16 True \
        --do_train \
        --do_eval \
        --output_dir "${OUTROOT}" \
        --model_max_length 4096 \
        --save_strategy "steps" \
        --save_steps 5 \
        --evaluation_strategy "steps" \
        --eval_steps 5 \
        --max_steps 5 \
        --concatenating False \
        --per_device_train_batch_size 1 \
        --per_device_eval_batch_size 4 \
        --gradient_accumulation_steps 1 \
        --val_set_size 10 \
        --num_train_epochs 3\
        --learning_rate 2e-5 \
        --weight_decay 0. \
        --warmup_ratio 0.03 \
        --lr_scheduler_type "cosine" \
        --log_level debug \
        --logging_steps 1 \
        --overwrite_output_dir False\
        --fsdp "full_shard auto_wrap" \
        --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
        --use_flash_attn True \
        --ddp_timeout 7200 \
        --save_total_limit 10

I'm using the following python environment:

    transformers              4.34.1
    huggingface-hub           0.24.0
    tokenizers                0.14.1
    torch                     2.1.0
    torchaudio                2.1.0
    torchvision               0.16.0
    flash-attn                2.5.9.post1


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Train_loss = 0 and Eval_loss = NaN in stage2_sft #31

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Train_loss = 0 and Eval_loss = NaN in stage2_sft #31

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions