Training strategy for Zipformer using fp16 ?? #1461
Unanswered
ZQuang2202
asked this question in
Q&A
Replies: 1 comment 1 reply
-
Are you using single GPU and max-duration=300? The gradient noise might be large with such a small batch size. You could try a smaller base-lr, like 0.025, and keep lr_batch/lr_epoch unchanged. Usually you don't need to tune the Balancer and Whitener configurations. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi everyone,
I am a student attempting to reproduce the results of Zipformer on Librispeech 100h, facing limitations in hardware resources that prevent me from using the recommended configuration. Due to these constraints, I have reduced the batch size (max_duration) to 300, as opposed to the recommended 1000. However, I am struggling to find the appropriate configuration for Eden.
Following the training strategy that suggests decreasing the learning rate by √k times when the batch size decreases by k times, I initially set the base_lr to 0.03, keeping other configurations at their default values. But the training process diverges. Despite attempts to adjust lr_batches, lr_epochs (3.5-6), and base_lr (0.3-0.45), it's still not working. Notably, the training process encounters divergence when the batch_count is around 700-900, leading to 'parameter domination' issues in the embed_conv and some attention modules. I attach some log information below.
![image](https://private-user-images.githubusercontent.com/152836329/296590676-5e15854b-ab8f-4b6c-9226-7604dba0db1f.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkyMjQ0MDcsIm5iZiI6MTczOTIyNDEwNywicGF0aCI6Ii8xNTI4MzYzMjkvMjk2NTkwNjc2LTVlMTU4NTRiLWFiOGYtNGI2Yy05MjI2LTc2MDRkYmEwZGIxZi5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjEwJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxMFQyMTQ4MjdaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT03NTM2NDVkNGFkYmE3YjRmNzAzMWQ3ZDExOTk5MjNkMzE2ZDUxNzU2Njc2NjMyODkyYmNjN2Y2ZWEwZDdkMjU3JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.u2K3SRTFGWjb_FVBkASv3egKgxhOtgs9Pun3wkBUDfs)
![image](https://private-user-images.githubusercontent.com/152836329/296590793-5d36af68-134d-4c43-a235-ba3d3354db3f.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkyMjQ0MDcsIm5iZiI6MTczOTIyNDEwNywicGF0aCI6Ii8xNTI4MzYzMjkvMjk2NTkwNzkzLTVkMzZhZjY4LTEzNGQtNGM0My1hMjM1LWJhM2QzMzU0ZGIzZi5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjEwJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxMFQyMTQ4MjdaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1kYjQ2ZmJhNjJmNjNjZDEwYmRlNmI4ODlkNTc4NTBkYzM4N2ZhN2QxODk0ZTZkMzg5YjNiYmI1YTVkYjA0M2FlJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.4a3pzGdG2qGbBCVxH7vh1l0XDcoKcTOjW-TpG9azLXY)
![image](https://private-user-images.githubusercontent.com/152836329/296590946-a5c580e9-8969-48f4-a679-2e687b51c04f.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkyMjQ0MDcsIm5iZiI6MTczOTIyNDEwNywicGF0aCI6Ii8xNTI4MzYzMjkvMjk2NTkwOTQ2LWE1YzU4MGU5LTg5NjktNDhmNC1hNjc5LTJlNjg3YjUxYzA0Zi5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjEwJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxMFQyMTQ4MjdaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1iNjY0MGI4MWQxZWU5NTQ5ZjE0NTdkM2Y1YWNjYjY5MTg1Y2ZkZWJmMzFjOTJjMzdiYjczOWNlODUyMDI1ZTI4JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.yK8sKhx9VleJ-VwA33AwTLnkbQuSmLCkRYPe_AwSjP0)
In an effort to address this, I attempted to reduce the gradient scale of the layers experiencing 'parameter domination,' but this proved ineffective."
I have few questions:
Thank you.
Beta Was this translation helpful? Give feedback.
All reactions