Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
All our models contain the following components: (i) a residual block, (ii) an MLP block, and (iii) a temporal-mixing block. While (i) and (ii) are the same across all models, we consider three temporal mixing blocks: global Multi-Query Attention (MQA), local (sliding-window) MQA and our proposed recurrent block. As part of the recurrent block we use the Real-Gated Linear Recurrent Unit (RG-LRU) – a novel recurrent layer inspired by the Linear Recurrent Unit Orvieto et al., 2023b.
The residual block, as shown in Figure 2(a), defines the global structure of our models and is inspired by pre-norm Transformers (Xiong et al., 2020). After embedding the input sequence we pass it through
Figure 2: a) The main backbone of our mode architecture is the residual block, which is stacked
The residual block contains two components, applied in order. The first component takes the hidden state
We use a gated MLP block Dauphin et al., 2017 (illustrated in Figure 2(b)), which creates two branches from its input of dimension
The temporal-mixing block is the component of our model that aggregates hidden layer activations at different temporal locations in the sequence. We consider three temporal-mixing blocks: global MQA Shazeer, 2019, local MQA Beltagy et al., 2020 and our proposed Recurrent block.
Unless otherwise stated, we use MQA rather than MHA to improve the inference speeds of our Transformer baselines Shazeer, 2019. We use a fixed head dimension
One of the key disadvantages of using global attention is that its computational complexity grows quadratically in the sequence length. To address this, several works have started to adopt local attention Beltagy et al., 2020, also known as sliding window attention. It allows each position to attend only to a fixed number of tokens in the past. This not only reduces the computational FLOPs but also bounds the size of the KV cache to the size of window, making it no longer quadratic in the sequence length. All other details are the same as the global MQA.
Our recurrent block (Figure 2(c)) is similar to the GSS block Mehta et al., 2022 and the block used by Mamba Gu and Dao, 2023. We take the input of dimension
Our proposed RG-LRU layer has a simple recurrence inspired by the Linear Recurrent Unit (LRU) Orvieto et al., 2023b, but incorporates a gating mechanism motivated by the literature on non-linear RNNs, in particular LSTMs Hochreiter and Schmidhuber, 1997 and GRUs Chung et al., 2014. The equations describing the layer are as follows:
The output of the layer is
The input gate