Attention with Linear Biases (ALiBi) does not add positional embeddings to word embeddings; instead, it biases query-key attention scores with a penalty that is proportional to their distance. ALiBi is defined as:
The figrue offers a visualization.
ALiBi adds a constant bias (right) to each attention score (
alibi_mask = torch.full((seqlen_q, seqlen_kv), float('inf'), dtype=data_type)
for i in range(seqlen_q-1, -1, -1):
for j in range(seqlen_kv):
mask = j - seqlen_kv + 1 + (seqlen_q - 1 - i)
if mask <= 0:
alibi_mask[i][j] = mask
alibi_mask = alibi_mask.unsqueeze(0).expand(num_heads, -1, -1)
# alibi_mask shape -> (num_heads, seqlen_q, seqlen_kv)
# slopes_m shape -> (num_heads, 1, 1)
alibi_mask = slopes_m * alibi_mask
The following code shows the calculation process of
Number of heads
Data type of ALiBi mask
Length of query tensor.
Shape: scalar
Length of key/values tensor.
Shape: scalar
Optional custom mask. If shape is attn_mask
will be broadcasted.
Note: The last dim of mask could be bigger than
Shape:
Output mask of ALiBi.
Shape: