When reproducing, I noticed the loss decreased and then increased sharply #170

forest520 · 2025-01-06T07:47:00Z

When reproducing the experiment, I found that after the loss decreased, it showed a clear upward trend again

I reproduced on 8 * 32GB-V100,here are my script:
nohup torchrun --nnodes=1 --nproc_per_node=8 --rdzv_id=123 --rdzv_backend=c10d --rdzv_endpoint=localhost:29500 ultravox/ultravox/training/train.py --config_path=ultravox/training/configs/release_config.yaml > train.log 2>&1 &

here a the tensorboard log :

Due to memory constraints, I had to modify the batch size and max steps:

text_model: "/data2/ultravox_data/model/Meta-Llama-3.1-8B-Instruct"
audio_model: "openai/whisper-medium"
loss_config:
  loss_function: "KL_Divergence"

train_sets:
  - name: librispeech-clean-continuation
  - name: librispeech-other-continuation
  - name: commonvoice-en-continuation


val_sets:
  - name: covost2-en-de
  - name: covost2-zh-en
  - name: peoplespeech-clean-transcription

batch_size: 3
max_steps: 115200 # x8x24 = 2,764,800

The text was updated successfully, but these errors were encountered:

Gpwner · 2025-01-08T00:49:00Z

when it almost done ,it come to be this,

Is this as expected?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When reproducing, I noticed the loss decreased and then increased sharply #170

When reproducing, I noticed the loss decreased and then increased sharply #170

forest520 commented Jan 6, 2025 •

edited

Loading

Gpwner commented Jan 8, 2025

When reproducing, I noticed the loss decreased and then increased sharply #170

When reproducing, I noticed the loss decreased and then increased sharply #170

Comments

forest520 commented Jan 6, 2025 • edited Loading

Gpwner commented Jan 8, 2025

forest520 commented Jan 6, 2025 •

edited

Loading