Transformer: Multi-head Self-attention

The dominant approach recently (2019)

Task: Machine Translation with parallel corpus => predict each translated word

Background

Motivation

We want parallelization but RNNs (e.g. LSTM, GRU) are inherently sequential
Despite GRUs and LSTMs, RNNs still need attention mechanism to deal with long range dependencies - path length between states grows with sequence

Self-attention

Can we replace sequential computation (i.e. RNNs) entirely with just self-attention?!

It is really fast (you can do this very quickly on a GPU)

Attention is cheap: (Amount of FLOPs)

Mechnism	Complexity
Self-attention	$O(\text{length}^2 \cdot \text{dim})$
RNN (LSTM)	$O(\text{length} \cdot \text{dim}^2)$
Convolution	$O(\text{length} \cdot \text{dim}^2 \cdot \text{kernel width})$

Can we simulate convolution with multi-head?!

with more heads
or heads are function of positions

Dot-Project Attention

$$ \textit{Attention}(Q, K, V) = \textit{softmax}(\frac{QK^T}{\sqrt{d_k}})V $$

Problem: As $d_k$ get large, the variance of $q^Tk$ increses $\rightarrow$ some values inside the softmax get larger $\rightarrow$ the softmax gets very peaked $\rightarrow$ hence its gradient gets smaller

Solution: Scale by length of query/key vectors - $\times \frac{1}{\sqrt{d_k}}$

Multi-head Attention

Problem with simple self-attention: Only one way for words to interact with one-another

$$ \textit{MultiHead}(Q, K, V) = \textit{concat}(head_1, \dots, head_h)W^O $$

where $head_i = \textit{Attention}(QW_i^Q, KW_i^K, VW_i^K)$

Model

non-recurrent sequence-to-sequence encoder-decoder
a multi-head attention (self-attention) stack
final cost/error function is standard cross-entropy error on top of a softmax classifier

A Transformer Block

Each block has two "sublayers"

Multi-head Attention
2-layer Feed-forward Neural Net (with ReLU)

Each of these two step also has:

Residual (short-circuit) connection and LayerNorm
$\operatorname{LayerNorm}(x + \operatorname{Sublayer}(x))$
- LayerNorm changes input to have mean 0 and variance 1
  - per layer and per training point
  - adds two more parameter

Encoder

Actual word representations are byte-pair encodings
Also added is a positional encoding so same words at different locations have different overall representations

$$ PE_{(pos, 2i)} = \sin(pos/10000^{2i/d_{model}}) \\ PE_{(pos, 2i+1)} = \cos(pos/10000^{2i/d_{model}}) \\ $$

Decoder

mimic language model: (it can't look forward because it's illegal)

the causal self-attention
impose causality by just mask out the positions that you can look at

Masked decoder self-attention on previously generated outputs:

Encoder-Decoder Attention where queries come from previous decoder layer
And keys and values come from output of encoder

Conclusion

Tips and tricks of the Transformer

Byte-pair encodings
Checkpoint averaging
Adam optimizer with learning rate changes
Dropout during training at every layer just before adding residual
Label smoothing
Auto-regressive decoding with bea search and length penalties

Use of transformers is sperading but they are hard to optimize and unlike LSTMs don't usually just work out of the box and they don't play well yet with other building blocks on tasks.

Resources

Harvard NLP - The Annotated Transformer - Must read!
Stanford CS224n Lecture14 Transformers Slides
jadore801120/attention-is-all-you-need-pytorch
PyTorch Transformer layers
- source code

Tutorial

Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 14 – Transformers and Self-Attention - YouTube
Youtube - Attention Is All You Need - TODO
The Illustrated Transformer
bilibili - Transformer Explain I
bilibili - Transfermer BERT pre-train II

Github

Transformers
- huggingface/transformers: 🤗 Transformers: State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch.
Kyubyong/transformer - A TensorFlow Implementation of the Transformer: Attention Is All You Need
lena-voita/the-story-of-heads - This is a repository with the code for the ACL 2019 paper "Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transformer.md

Transformer.md

Transformer: Multi-head Self-attention

Background

Motivation

Self-attention

Dot-Project Attention

Multi-head Attention

Model

A Transformer Block

Encoder

Decoder

Conclusion

Tips and tricks of the Transformer

Resources

Tutorial

Github

Files

Transformer.md

Latest commit

History

Transformer.md

File metadata and controls

Transformer: Multi-head Self-attention

Background

Motivation

Self-attention

Dot-Project Attention

Multi-head Attention

Model

A Transformer Block

Encoder

Decoder

Conclusion

Tips and tricks of the Transformer

Resources

Tutorial

Github