My first step in transformer legacy =)
This verson is constructed from other sources of transformer implementation.
Yet it has no masking at Decoder, so you should be aware of that.
This implementation has multi-head self-attention and positional encoding
which gives the nn understanding of the structure of sequence.