Attention with Linear Biases (ALiBi) does not add positional embeddings to word embeddings; instead, it biases query-key attention scores with a penalty that is proportional to their distance. ALiBi is defined as:
The figrue offers a visualization.
ALiBi adds a constant bias (right) to each attention score (
num_heads
is the number of heads in transformer model.
def get_slopes(num_heads):
result = []
closest_power_of_2 = 2 ** math.floor(math.log2(num_heads))
for n in range(1, closest_power_of_2+1):
result.append(2**(-8 * n / closest_power_of_2))
if closest_power_of_2 < num_heads:
for n in range(1, 2*(num_heads-closest_power_of_2)+1, 2):
result.append(2**(-4 * n / closest_power_of_2))
return tmp
NOTE: This OP's result will be used by Attention to fuse ALiBiMask.
Number of heads
None
Shape: