Training strategy for Zipformer using fp16 ?? #1461

ZQuang2202 · 2024-01-14T16:22:00Z

ZQuang2202
Jan 14, 2024

Hi everyone,
I am a student attempting to reproduce the results of Zipformer on Librispeech 100h, facing limitations in hardware resources that prevent me from using the recommended configuration. Due to these constraints, I have reduced the batch size (max_duration) to 300, as opposed to the recommended 1000. However, I am struggling to find the appropriate configuration for Eden.

Following the training strategy that suggests decreasing the learning rate by √k times when the batch size decreases by k times, I initially set the base_lr to 0.03, keeping other configurations at their default values. But the training process diverges. Despite attempts to adjust lr_batches, lr_epochs (3.5-6), and base_lr (0.3-0.45), it's still not working. Notably, the training process encounters divergence when the batch_count is around 700-900, leading to 'parameter domination' issues in the embed_conv and some attention modules. I attach some log information below.
In an effort to address this, I attempted to reduce the gradient scale of the layers experiencing 'parameter domination,' but this proved ineffective."

I have few questions:

Is there some thing related between base_lr/lr_batch/lr_epoch and max_duration?
Should parameters like dropout rate or min_abs, max_abs of Balancer or Whitener's config be changed? If so, how should it be changed for each different dataset?
Thank you.

yaozengwei · 2024-01-15T05:55:27Z

yaozengwei
Jan 15, 2024
Maintainer

Are you using single GPU and max-duration=300? The gradient noise might be large with such a small batch size. You could try a smaller base-lr, like 0.025, and keep lr_batch/lr_epoch unchanged. Usually you don't need to tune the Balancer and Whitener configurations.

1 reply

ZQuang2202 Jan 15, 2024
Author

Thank you for your response @yaozengwei.
"Are you using single GPU and max-duration=300?" - Yes, I'm using a single RTX 3060 and max-duration=300. I tried base_lr=0.025 but it didn't work. Then, I try base_lr=0.015, and the training process converges. However, the convergence is quite slow. Besides the hyperparameter of Eden, are there any important parameters I need to consider?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training strategy for Zipformer using fp16 ?? #1461

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Training strategy for Zipformer using fp16 ?? #1461

ZQuang2202 Jan 14, 2024

Replies: 1 comment · 1 reply

yaozengwei Jan 15, 2024 Maintainer

ZQuang2202 Jan 15, 2024 Author

ZQuang2202
Jan 14, 2024

Replies: 1 comment 1 reply

yaozengwei
Jan 15, 2024
Maintainer

ZQuang2202 Jan 15, 2024
Author