Skip to content

Latest commit

 

History

History
130 lines (85 loc) · 5.25 KB

Transformer.md

File metadata and controls

130 lines (85 loc) · 5.25 KB

Transformer: Multi-head Self-attention

The dominant approach recently (2019)

Task: Machine Translation with parallel corpus => predict each translated word

Background

Motivation

  • We want parallelization but RNNs (e.g. LSTM, GRU) are inherently sequential
  • Despite GRUs and LSTMs, RNNs still need attention mechanism to deal with long range dependencies - path length between states grows with sequence

Self-attention

Can we replace sequential computation (i.e. RNNs) entirely with just self-attention?!

  • It is really fast (you can do this very quickly on a GPU)

Attention is cheap: (Amount of FLOPs)

Mechnism Complexity
Self-attention $O(\text{length}^2 \cdot \text{dim})$
RNN (LSTM) $O(\text{length} \cdot \text{dim}^2)$
Convolution $O(\text{length} \cdot \text{dim}^2 \cdot \text{kernel width})$

Can we simulate convolution with multi-head?!

  • with more heads
  • or heads are function of positions

Dot-Project Attention

$$ \textit{Attention}(Q, K, V) = \textit{softmax}(\frac{QK^T}{\sqrt{d_k}})V $$

Problem: As $d_k$ get large, the variance of $q^Tk$ increses $\rightarrow$ some values inside the softmax get larger $\rightarrow$ the softmax gets very peaked $\rightarrow$ hence its gradient gets smaller

Solution: Scale by length of query/key vectors - $\times \frac{1}{\sqrt{d_k}}$

Multi-head Attention

Problem with simple self-attention: Only one way for words to interact with one-another

$$ \textit{MultiHead}(Q, K, V) = \textit{concat}(head_1, \dots, head_h)W^O $$

where $head_i = \textit{Attention}(QW_i^Q, KW_i^K, VW_i^K)$

Model

  • non-recurrent sequence-to-sequence encoder-decoder
  • a multi-head attention (self-attention) stack
  • final cost/error function is standard cross-entropy error on top of a softmax classifier

A Transformer Block

Each block has two "sublayers"

  1. Multi-head Attention
  2. 2-layer Feed-forward Neural Net (with ReLU)

Each of these two step also has:

  • Residual (short-circuit) connection and LayerNorm
  • $\operatorname{LayerNorm}(x + \operatorname{Sublayer}(x))$
    • LayerNorm changes input to have mean 0 and variance 1
      • per layer and per training point
      • adds two more parameter

Encoder

  • Actual word representations are byte-pair encodings
  • Also added is a positional encoding so same words at different locations have different overall representations

$$ PE_{(pos, 2i)} = \sin(pos/10000^{2i/d_{model}}) \\ PE_{(pos, 2i+1)} = \cos(pos/10000^{2i/d_{model}}) \\ $$

Decoder

mimic language model: (it can't look forward because it's illegal)

  • the causal self-attention
  • impose causality by just mask out the positions that you can look at

Masked decoder self-attention on previously generated outputs:

  1. Encoder-Decoder Attention where queries come from previous decoder layer
  2. And keys and values come from output of encoder

Conclusion

Tips and tricks of the Transformer

  • Byte-pair encodings
  • Checkpoint averaging
  • Adam optimizer with learning rate changes
  • Dropout during training at every layer just before adding residual
  • Label smoothing
  • Auto-regressive decoding with bea search and length penalties

Use of transformers is sperading but they are hard to optimize and unlike LSTMs don't usually just work out of the box and they don't play well yet with other building blocks on tasks.

Resources

Tutorial

Github