Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training is not properly restartable #37

Open
dhdaines opened this issue Jul 27, 2022 · 2 comments
Open

training is not properly restartable #37

dhdaines opened this issue Jul 27, 2022 · 2 comments

Comments

@dhdaines
Copy link
Contributor

If running sphinxtrain train and it fails (as it does, because the Perl scripts have bugs, but also as it might if using spot instances on $CLOUD) then there's no easy way to restart training.

Back in the old days we would just sit in our offices at CMU all night running scripts_pl/NN.step/s***ve_confg.pl manually, but I would prefer to be in the forest picking mushrooms these days.

This isn't rocket science, at the very least it could just restart from the step and iteration, though of course, it would be much better to rerun just the parts that failed. We're not even using GPUs so there's no issue with repeatability when doing that.

@nshmyrev
Copy link
Contributor

There is -f option of sphinxtrain to run from any stage

https://github.com/cmusphinx/sphinxtrain/blob/master/scripts/sphinxtrain#L164

not well documented though

@dhdaines
Copy link
Contributor Author

Yes, I've been using that, though it restarts the stage from the beginning, while it should restart from an iteration (or ideally just rerun the job that failed and carry on).

One might question the utility of improving things like this in SphinxTrain, but it's still a considerably nicer training tool than the monstrosity that is the Kaldi scripts :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants