Attention is a general deep learning technique
- Attention is not a part of Deep Learning
- so we can calculate attention individually
- Attention is more like a "mechanism" of the weighted summation
- 2014 - Recurrent Models of Visual Attention
- RNN model with attention for image classification
- 2014~2015 - Atention in Neural Machine Translation
- 2015~2016 - Attention-based RNN/CNN in NLP
- 2017 - Self-attention
- Attention significantly improves Neural Machine Translate performance
- its useful to allow decoder to focus on certain parts of the source
- Attention solves the bottleneck problem
- attention allows decoder to look directly at source; bypass bottleneck
- Attention helps with vanishing gradient problem (TODO link to the vanishing gradient, shortcut)
- provides shortcut to faraway states
- Attention provides some interpretability (e.g. the visualization attention matrix)
- By inspecting attention distribution, we can see what the decoder was focusing on
- Attention is trivial to parallelize (attention is permutation invariant)
- we have encoder hidden states
$h_1, \dots, h_N \in \mathbb{R}^h$ - on timestep
$t$ , we have decoder hidden states$s_t \in \mathbb{R}^h$
And then, we get the attention score
dot product => get a scalar score
We take softmax to get the attention distribution
this is a probability distribution and sums to 1
We use
Finally we concatenate the attention output
Given a set of vector values, and a vector query, attention is a technique to compute a "weighted sum of the values", dependent on the query.
query attends to the values
- Values - a set of vectors
- Query - a single vector
Intuition:
- The weighted sum is a selective summary of the information contained in the values, where the query determines which values to focus on.
- Atteniton is a way to obtain a fixed-size representation of an arbitrary set of representations (the values), dependent on some other representation (the query)
We have
- Values
$h_1, \dots, h_N \in \mathbb{R}^{d_1}$ - Query
$s \in \mathbb{R}^{d_2}$
Attention always involves
- Computing the attention scores:
$e \in \mathbb{R}^N$ (there are multiple ways to do this)- Basic dot-product attention
- Multiplicative attention
- Additive attention
- Taking softmax to get attention distribution
$\alpha$ - Using attention distribution to take weighted sum of values thus obtaining the attention output
$a$
- this assumes
$d_1 = d_2$
two vectors mediated by a matrix
- where
$W \in \mathbb{R}^{d_2\times d_1}$ is a weight matrix
Space Complexity:
kind of a shallow neural network
- where
$W_1 \in \mathbb{R}^{d_3\times d_1}$ ,$W_2 \in \mathbb{R}^{d_3\times d_2}$ are a weight matrices and$v \in \mathbb{R}^{d_3}$ is a weight vector -
$d_3$ (the attention dimensionality) is a hyperparameter
Space Complexity:
- Origianl version of Bilinear form attention
$S_{ij} = c_i^T W q_j$ - Reduce the rank and complexity by dividing it into the product of two lower rank matrices
$S_{ij} = c_i^T U^T V q_j$ - Make the attention distribution to be symmetric
$S_{ij} = c_i^T W^T D W q_j$ (sill make sence of linear algebra term) - Stick the left and right half through a ReLU
$S_{ij} = \operatorname{ReLU}(C_i^TW^T)D \operatorname{ReLU}(Wq_j)$
- Smaller space
- Non-linearity
Space Complexity:
Attention Name | Alignment score function | Citation |
---|---|---|
Content-base | ||
Additive(*) | Graves2014 | |
Location-Base | Bahdanau2015 | |
General (multiplicative) | Luong2015 | |
Dot-Product | Luong2015 | |
Scaled Dot-Product(^) | Vaswani2017 |
- (*) Referred to as “concat” in Luong, et al., 2015 and as “additive attention” in Vaswani, et al., 2017.
- (^) It adds a scaling factor 1/n‾√, motivated by the concern when the input is large, the softmax function may have an extremely small gradient, hard for efficient learning.
broader categories of attention mechanisms
Attention Name | Alignment score function | Citation |
---|---|---|
Self-Attention(&) | Relating different positions of the same input sequence. Theoretically the self-attention can adopt any score functions above, but just replace the target sequence with the same input sequence. | Cheng2016 |
Global/Soft | Attending to the entire input state space. | Xu2015 |
Local/Hard | Attending to the part of input state space; i.e. a patch of the input image. | Xu2015; Luong2015 |
- (&) Also, referred to as “intra-attention” in Cheng et al., 2016 and some other papers.
intra-attention
Re-represent the word representing based on its context (neighbors).
- For each node/vector, create a query vector Q, key vector K and a value vector V
- dot product (
$Q \cdot K$ ): compute similarity -
$\sqrt{d_k}$ : a scaling factor to make sure that the dot products don't blow up
Multi-head Self-attention: Transformer
- Attention? Attention!
- Attention and Augmented Recurrent Neural Networks
- Attention機制詳解(二)—— Self-Attention與Transformer
- successar/AttentionExplanation
- Attention - PyTorch and Keras: Introduce attention mechanism with example using PyTorch and Keras simultaneously
- Andrew Ng - C5W3L07 Attention Model Intuition
- Andrew Ng - C5W3L08 Attention Model
- Youtube - Attention in Neural Network - TODO
- EvilPsyCHo/Attention-PyTorch - Good attention tutorial
- NLP From Scratch: Translation with a Sequence to Sequence Network and Attention — PyTorch Tutorials 1.2.0 documentation