The dominant approach recently (2019)
Task: Machine Translation with parallel corpus => predict each translated word
- We want parallelization but RNNs (e.g. LSTM, GRU) are inherently sequential
- Despite GRUs and LSTMs, RNNs still need attention mechanism to deal with long range dependencies - path length between states grows with sequence
Can we replace sequential computation (i.e. RNNs) entirely with just self-attention?!
- It is really fast (you can do this very quickly on a GPU)
Attention is cheap: (Amount of FLOPs)
Mechnism | Complexity |
---|---|
Self-attention | |
RNN (LSTM) | |
Convolution |
Can we simulate convolution with multi-head?!
- with more heads
- or heads are function of positions
Problem: As
Solution: Scale by length of query/key vectors -
Problem with simple self-attention: Only one way for words to interact with one-another
where
- non-recurrent sequence-to-sequence encoder-decoder
- a multi-head attention (self-attention) stack
- final cost/error function is standard cross-entropy error on top of a softmax classifier
Each block has two "sublayers"
- Multi-head Attention
- 2-layer Feed-forward Neural Net (with ReLU)
Each of these two step also has:
- Residual (short-circuit) connection and LayerNorm
-
$\operatorname{LayerNorm}(x + \operatorname{Sublayer}(x))$ - LayerNorm changes input to have mean 0 and variance 1
- per layer and per training point
- adds two more parameter
- LayerNorm changes input to have mean 0 and variance 1
- Actual word representations are byte-pair encodings
- Also added is a positional encoding so same words at different locations have different overall representations
mimic language model: (it can't look forward because it's illegal)
- the causal self-attention
- impose causality by just mask out the positions that you can look at
Masked decoder self-attention on previously generated outputs:
- Encoder-Decoder Attention where queries come from previous decoder layer
- And keys and values come from output of encoder
- Byte-pair encodings
- Checkpoint averaging
- Adam optimizer with learning rate changes
- Dropout during training at every layer just before adding residual
- Label smoothing
- Auto-regressive decoding with bea search and length penalties
Use of transformers is sperading but they are hard to optimize and unlike LSTMs don't usually just work out of the box and they don't play well yet with other building blocks on tasks.
- Harvard NLP - The Annotated Transformer - Must read!
- Stanford CS224n Lecture14 Transformers Slides
- jadore801120/attention-is-all-you-need-pytorch
- PyTorch Transformer layers
- Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 14 – Transformers and Self-Attention - YouTube
- Youtube - Attention Is All You Need - TODO
- The Illustrated Transformer
- bilibili - Transformer Explain I
- bilibili - Transfermer BERT pre-train II
- Transformers
- Kyubyong/transformer - A TensorFlow Implementation of the Transformer: Attention Is All You Need
- lena-voita/the-story-of-heads - This is a repository with the code for the ACL 2019 paper "Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned"