RF scaled_dot_product_attention #1555

albertz · 2024-06-27T13:31:35Z

Add scaled_dot_product_attention as a function to RF, and use it in our attention code. (Does this also work with RelPosSelfAttention?)

In case of PyTorch, wrap torch.nn.functional.scaled_dot_product_attention. That should be much more compute and memory efficient compared to the direct implementation. It uses FlashAttention or potentially a number of other efficient kernels. (Although on older GPUs, probably not. See my question on PyTorch discussion forum.)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RF scaled_dot_product_attention #1555

RF scaled_dot_product_attention #1555

albertz commented Jun 27, 2024 •

edited

Loading

RF scaled_dot_product_attention #1555

RF scaled_dot_product_attention #1555

Comments

albertz commented Jun 27, 2024 • edited Loading

albertz commented Jun 27, 2024 •

edited

Loading