Multi-Utterance Mini-Batch #7

mschonwe · 2015-10-30T22:05:50Z

Do you have a suggestion for supporting processing of mini-batches of multiple utterances at a time?

We have refactored our data to have feature files of fixed frame lengths. We can have dataLoader load in the .bin features as utterances(rows) x frame--features(columns), but it seems we would need to modify ctc_fast.pyx to loop over the utterances, and somehow combine the gradients. The loop over the utts seems easy enough, but not sure how to combine gradients.

Have you already tested multi-utterance mini-batches and decided that they are not appropriate for the task?

amaas · 2015-10-31T02:37:41Z

Updating from multiple utterances would likely help to smooth out the
gradients, but might not be better in terms of overall final solution. The
easiest way to parallelize computing the ctc loss function is to take a
data-parallel map-reduce style approach rather than trying to vectorize the
code in ctc_fast. You could look at parallel python module or similar to
call a separate instance of ctc_fast for each utterance after forward
propagating all utterances through the RNN.

On Fri, Oct 30, 2015 at 3:05 PM, mschonwe [email protected] wrote:

Do you have a suggestion for supporting processing of mini-batches of
multiple utterances at a time?

We have refactored our data to have feature files of fixed frame lengths.
We can have dataLoader load in the .bin features as utterances(rows) x
frame--features(columns), but it seems we would need to modify ctc_fast.pyx
to loop over the utterances, and somehow combine the gradients. The loop
over the utts seems easy enough, but not sure how to combine gradients.

Have you already tested multi-utterance mini-batches and decided that they
are not appropriate for the task?

—
Reply to this email directly or view it on GitHub
#7.

mschonwe · 2015-11-02T00:05:15Z

My motivation is based on a lecture from Adam Coates describing the roofline model, and how to increase system throughput by saturating both computational and bandwidth resources.

I was looking to increase the arithmetic intensity (and therefore hopefully throughput) by feeding mini-batches into costAndGrad. The profiling I did indicates that nearly all the processing time is spent in cudamat calls (in costAndGrad), so I wasn't thinking the CTC calculation was substantially rate limiting. (runsnake image below: ctc_loss is the oval at the bottom right). I think we'll be ok looping over each utterance and averaging the gradients for each mini-batch.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-Utterance Mini-Batch #7

Multi-Utterance Mini-Batch #7

mschonwe commented Oct 30, 2015

amaas commented Oct 31, 2015

mschonwe commented Nov 2, 2015

Multi-Utterance Mini-Batch #7

Multi-Utterance Mini-Batch #7

Comments

mschonwe commented Oct 30, 2015

amaas commented Oct 31, 2015

mschonwe commented Nov 2, 2015