Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiples bug fixes and add on_train_epoch_start callback #129

Merged
merged 6 commits into from
Nov 16, 2023
Merged

Conversation

Edresson
Copy link
Contributor

@Edresson Edresson commented Nov 13, 2023

What it does?

  1. Solve KeyError: 'avg_loss_1' error when start_with_eval=True and target_loss is settled. This issue happens because the training keep_avg_target.avg_values is an empty dictionary. It is related to [Bug] KeyError: 'avg_loss_1' crash when training model TTS#2862. This error also happens in training when we try to save a checkpoint before we have updated self.keep_avg_train or self.keep_avg_eval. To solve it this PR also make _pick_target_avg_loss safe and it avoid issues like multiband_melgan Vocoder Fails on Step 10000 With KeyError: 'avg_loss_0' TTS#1608 to happens, if the keep_avg_target.avg_values is empty it will return None and all will be good.
  2. It also raises an error if multiple-optimizer setup with grad accumulation and without a custom optimize method. It avoids the user training the model with our implementation that has dangling gradients in multiple-optimizer setup with grad accumulation (I already did it accidently, It is really bad because we can lose training time).
  3. It added on_train_epoch_start and on_train_epoch_end callbacks. Currently, the only way to put modules in eval mode model during the training is via on_train_step_start callback, that is called each train_step. It is really slow. Adding this new callback we can do it only one time per epoch. It should decrease the step time for XTTS GPT and XTTS decoder training.

@Edresson Edresson requested a review from erogol November 13, 2023 17:56
@Edresson Edresson changed the title Fix key error on target loss when start_with_eval=True Multiples bug fixes and add on_train_epoch_start callback Nov 13, 2023
@erogol erogol merged commit 385cced into main Nov 16, 2023
7 checks passed
@erogol erogol deleted the fix_eval branch November 16, 2023 10:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants