-
Notifications
You must be signed in to change notification settings - Fork 4
Relative positional encoding
Standard self-attention:
Let
We drop any transpose of reshape in the following formulas, and hope they become clear from the context, or from the specified output shape.
RETURNN SelfAttentionLayer
supports key_shift
where you can pass the output of RelativePositionalEncodingLayer
.
This was proposed in Shawn et al, Self-Attention with Relative Position Representations, 2018
and implemented in Tensor2Tensor (see _relative_attention_inner
),
and RETURNN follows that implementation more or less.
The default is a trainable encoding matrix with clipping limit 16,
i.e. the encoding matrix shape is
Ignoring the biases:
In RETURNN-common, nn.RelPosSelfAttention
and nn.relative_positional_encoding
follow the paper
Dai et al, Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context, 2019
and the original implementation.
The derivation starts by looking how the absolute positional encoding
Any shifts (
Actually linear_pos
, pos_bias_u
, pos_bias_v
in our code.
Note that
Compared to Shawn et al 2018, we have the additional transformation
T5 method.
AliBI, from Train short, test long.
ROPE, RoFormer: Enhanced Transformer with Rotary Position Embedding, 2021.