Skip to content
Andrew Runge edited this page Mar 11, 2018 · 12 revisions

baseline:

  • attn decoder: cuda + minibatch capable
  • cross-entropy loss - Note: didn't need to do this
  • gradient clipping
  • learning rate decay by 0.5 after every 10 epochs
  • beam search (beam size=5)

not baseline:

  • morph-tag data, bpe it

maybe:

  • initialize decoder with mean encoder hidden instead of last (i vote try with and without)
  • linear between embeds and hidden (personally I'd like to try this with and without to compare)
  • maxibatches
  • conditional gru for first decoder layer
  • early stopping? nematus default is just after 10. i think we're okay to not do this & just do 2 restarts
Clone this wiki locally