Description
On a high-level, my training setup works like:
- Run training, keep some fixed epochs (either via the default predefined pattern, or custom), and the N best epochs per train/dev scores.
- Run recog (or translation or whatever inference) on fixed epochs + M best epochs on some other dev set (e.g. Hub500 for Switchboard).
- Select the best epoch from the recog results.
- Run recog for all relevant eval sets on the selected best epoch.
I want that the recog on fixed epochs runs as soon as those epochs are ready. I do that via Job.update
. For the other epochs, this needs the final learning-rate-file with the scores, so it depends on that. This is then also via Job.update
, to dynamically add some recogs. Note that the number of epochs where recog is performed on is variable, because there might be overlaps between those sets.
I assume this is a quite reasonable and common pipeline, which you are probably also doing like this, or similar.
I think it's good if we have some common pipeline or helper code for this, and not that everyone has its own custom solution.
So I want to discuss this here. We can implement sth new, or use some existing code. For example, I have implemented exactly that already. See my GetBestRecogTrainExp
job, the recog_training_exp
function, and related code.