Skip to content
esalesky edited this page Mar 16, 2018 · 8 revisions

unanswered:

  • maxi-batches in pytorch: how to implement that?
  • why does it break if init_hidden is not assigned to None?
  • can decoder act on sequence for teacher forcing? (tried, appears not, why?)
  • normalize loss by batch_size? (going with yes for now)

answered:

  • Should we primarily/exclusively use teacher forcing to train the MT model? (can try TA or Graham before/after class)
    • Graham answer: Generally this is the case, can explore alternatives or MRT with BLEU as well. Shouldn't be necessary to do the fancy stuff though.
  • Normally pytorch averages losses of batch using loss_fn. When masking, should we average losses only over the non-zero elements?
    • TA Answer: Yes
Clone this wiki locally