We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When reproducing the experiment, I found that after the loss decreased, it showed a clear upward trend again
I reproduced on 8 * 32GB-V100,here are my script: nohup torchrun --nnodes=1 --nproc_per_node=8 --rdzv_id=123 --rdzv_backend=c10d --rdzv_endpoint=localhost:29500 ultravox/ultravox/training/train.py --config_path=ultravox/training/configs/release_config.yaml > train.log 2>&1 &
nohup torchrun --nnodes=1 --nproc_per_node=8 --rdzv_id=123 --rdzv_backend=c10d --rdzv_endpoint=localhost:29500 ultravox/ultravox/training/train.py --config_path=ultravox/training/configs/release_config.yaml > train.log 2>&1 &
here a the tensorboard log :
Due to memory constraints, I had to modify the batch size and max steps:
text_model: "/data2/ultravox_data/model/Meta-Llama-3.1-8B-Instruct" audio_model: "openai/whisper-medium" loss_config: loss_function: "KL_Divergence" train_sets: - name: librispeech-clean-continuation - name: librispeech-other-continuation - name: commonvoice-en-continuation val_sets: - name: covost2-en-de - name: covost2-zh-en - name: peoplespeech-clean-transcription batch_size: 3 max_steps: 115200 # x8x24 = 2,764,800
The text was updated successfully, but these errors were encountered:
when it almost done ,it come to be this, Is this as expected?
Sorry, something went wrong.
No branches or pull requests
When reproducing the experiment, I found that after the loss decreased, it showed a clear upward trend again
I reproduced on 8 * 32GB-V100,here are my script:
nohup torchrun --nnodes=1 --nproc_per_node=8 --rdzv_id=123 --rdzv_backend=c10d --rdzv_endpoint=localhost:29500 ultravox/ultravox/training/train.py --config_path=ultravox/training/configs/release_config.yaml > train.log 2>&1 &
here a the tensorboard log :
Due to memory constraints, I had to modify the batch size and max steps:
The text was updated successfully, but these errors were encountered: