A transformer is a deep learning model that adopts the mechanism of self-attention, differentially weighting the significance of each part of the input data. It is used primarily in the fields of natural language processing (NLP) and computer vision (CV) (reference).
Like recurrent neural networks (RNNs), transformers are designed to process sequential input data, such as natural language, with applications towards tasks such as translation and text summarization. However, unlike RNNs, transformers process the entire input all at once. The attention mechanism provides context for any position in the input sequence. For example, if the input data is a natural language sentence, the transformer does not have to process one word at a time. This allows for more parallelization than RNNs and therefore reduces training times.
Transformers were introduced in 2017 by a team at Google Brain and are increasingly the model of choice for NLP problems, replacing RNN models such as long short-term memory (LSTM). The additional training parallelization allows training on larger datasets. This led to the development of pretrained systems such as BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer), which were trained with large language datasets, such as the Wikipedia Corpus and Common Crawl, and can be fine-tuned for specific tasks.
Token embeddings simply represent kind of a lookup table:
This table is used to encode certain words into their token representation (reference).
Positional encoding (reference) describes the location or position of an entity in a sequence so that each position is assigned a unique representation.
The positional encoding is given by sine and cosine functions of varying frequencies:
where is:
In sum, as all inputs are processed at once, transformers embedding represents addition of the token embedding and positional embedding (reference). More precisely, as positional encoding vector and each token embedding vectors are added, the summed transformer embedding now represents a special embedding with positional information within it.
Scaled dot-product attention is an attention mechanism where the dot products are scaled down by
If we assume that
Multi-head Attention is a module for attention mechanisms which runs through an attention mechanism several times in parallel (reference). The independent attention outputs are then concatenated and linearly transformed into the expected dimension. Intuitively, multiple attention heads allows for attending to parts of the sequence differently (e.g. longer-term dependencies versus shorter-term dependencies).
$$ \text{MultiHead}\left(\textbf{Q}, \textbf{K}, \textbf{V}\right) = \left[\text{head}{1},\dots,\text{head}{h}\right]\textbf{W}_{0} $$
where
$$ \text{ head}{i} = \text{Attention} \left(\textbf{Q}\textbf{W}{i}^{Q}, \textbf{K}\textbf{W}{i}^{K}, \textbf{V}\textbf{W}{i}^{V} \right) $$
Above are all learnable parameter matrices.
Note that scaled dot-product attention is most commonly used in this module, although in principle it can be swapped out for other types of attention mechanism.
Layer Normalization directly estimates the normalization statistics from the summed inputs to the neurons within a hidden layer so the normalization does not introduce any new dependencies between training cases (reference).
We calculate it in the following fashion:
$$ y_i = \gamma \hat{x}{i} + \beta \equiv {\text{LN}}{\gamma, \beta} (x_i) $$
where
$$ \hat{x}{i,k} = \frac{x{i,k}-\mu_i}{\sqrt{\sigma_i^2 + \epsilon}} $$
while
Position-Wise Feed-Forward Layer is a type of feed-forward layer consisting of two dense layers that applies to the last dimension, which means the same dense layers are used for each position item in the sequence, so called position-wise (reference).
Simply, it can be represented as follows: