Skip to content
esalesky edited this page Mar 10, 2018 · 12 revisions

baseline:

  • attn decoder: cuda + minibatch capable
  • cross-entropy loss
  • gradient clipping
  • learning rate decay by 0.5 after every 10 epochs + early stopping
  • beam search (beam size=5)

not baseline:

  • morph-tag data, bpe it

maybe:

  • initialize decoder with mean encoder hidden instead of last (i vote try with and without)
  • linear between embeds and hidden (personally I'd like to try this with and without to compare)
  • maxibatches
  • conditional gru for first decoder layer
Clone this wiki locally