Flash attention support. #20152

hazemessamm · 2024-08-22T18:48:34Z

I added support for flash attention for PyTorch.

Let me know what do you think about this current implementation so I can add support for JAX and maybe will try for TF.

codecov-commenter · 2024-08-22T18:57:24Z

Codecov Report

Attention: Patch coverage is 26.31579% with 14 lines in your changes missing coverage. Please review.

Project coverage is 78.85%. Comparing base (5aa5f88) to head (57e6e56).
Report is 2 commits behind head on master.

Files with missing lines	Patch %	Lines
keras/src/backend/torch/nn.py	18.18%	8 Missing and 1 partial ⚠️
keras/src/backend/numpy/nn.py	0.00%	1 Missing and 1 partial ⚠️
keras/src/backend/tensorflow/nn.py	0.00%	1 Missing and 1 partial ⚠️
keras/src/backend/jax/nn.py	66.66%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #20152      +/-   ##
==========================================
+ Coverage   78.81%   78.85%   +0.04%     
==========================================
  Files         512      513       +1     
  Lines       49063    49250     +187     
  Branches     9035     9080      +45     
==========================================
+ Hits        38668    38837     +169     
- Misses       8530     8543      +13     
- Partials     1865     1870       +5

Flag	Coverage Δ
keras	`78.71% <26.31%> (+0.04%)`	⬆️
keras-jax	`62.36% <21.05%> (+0.10%)`	⬆️
keras-numpy	`57.38% <10.52%> (-0.03%)`	⬇️
keras-tensorflow	`63.62% <10.52%> (+0.06%)`	⬆️
keras-torch	`62.35% <15.78%> (+0.09%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

fchollet

Thanks for the PR -- the code looks good! Please add a unit test.

For the JAX version, I think we'd want to rely on a Pallas kernel. We can get help from the JAX team.

github-actions · 2024-09-24T02:02:17Z

This PR is stale because it has been open for 14 days with no activity. It will be closed if no further activity occurs. Thank you.

hazemessamm · 2024-10-02T19:01:00Z

Hey, sorry for not finishing this PR, I have a quick question, where should I add the tests?

fchollet · 2024-10-02T21:58:03Z

Hey, sorry for not finishing this PR, I have a quick question, where should I add the tests?

In keras/src/ops/nn_test.py. Ops are tested through the op class in e.g. keras/src/ops/nn.py, rather than in a backend specific way.

fchollet · 2024-10-05T18:10:27Z

@james77777778 do you think flash attention should be a standalone op, or could this be managed at the level of the dot_product_attention op (e.g. as an argument)?

james77777778 · 2024-10-06T03:47:11Z

@james77777778 do you think flash attention should be a standalone op, or could this be managed at the level of the dot_product_attention op (e.g. as an argument)?

It should be possible to consolidate this into dot_product_attention. That’s how it's implemented in torch, and I've seen a similar approach in jax
(https://github.com/jax-ml/jax/blob/81a31f6adf453b2afc39936e15c15d8ad327bf6e/jax/_src/nn/functions.py#L1037-L1041)

As far as I know, for torch, flash attention is utilized if the conditions are met. For jax, we need to specify implementation="cudnn" to use it.

fchollet · 2024-10-06T04:29:07Z

Very cool -- @hazemessamm can we do that, e.g. by adding a flash_attention argument in dot_product_attention? This makes it quite easy to also add support for JAX ( in addition to PyTorch). For TF I think we can skip support for now.

…t_product_attention op

fchollet

Awesome work! Thank you.

hazemessamm · 2024-10-06T19:39:29Z

Awesome work! Thank you.

Thank you, glad I could help.

fchollet · 2024-10-06T22:03:27Z

The test fails on torch + GPU:

FAILED keras/src/ops/nn_test.py::NNOpsCorrectnessTest::test_dot_product_attention_none_none_(true, false)_true - RuntimeError: No available kernel. Aborting execution.

Do you know if this is an issue with the torch version? What version is required? What torch + GPU setup were you testing on?

hazemessamm · 2024-10-06T23:39:57Z

The test fails on torch + GPU:

FAILED keras/src/ops/nn_test.py::NNOpsCorrectnessTest::test_dot_product_attention_none_none_(true, false)_true - RuntimeError: No available kernel. Aborting execution.

Do you know if this is an issue with the torch version? What version is required? What torch + GPU setup were you testing on?

I think flash attention in PyTorch does not work with any dtype except float16 and on specific GPUs, I just tested it on H100 GPU and it worked fine but it did not work on T4 GPU on Colab.

I also just found the following functions in PyTorch that we can use to check whether the inputs and the current GPU can use flash attention or not.

import torch
bsz, num_heads, seqlen, head_dim = 1, 2, 10, 16
query = torch.randn((bsz, num_heads, seqlen, head_dim), dtype=torch.float32, device='cuda:0')

params = torch.backends.cuda.SDPAParams(query, query, query, None, 16**-0.5, False)
is_flash_attention_enabled = torch.backends.cuda.can_use_flash_attention(params, False)
print(is_flash_attention_enabled) # Output: False, it will be true if `dtype=torch.float16`

If you think that this is a good idea then I will use this snippet in the flash attention function in PyTorch backend.

Documentation:
https://pytorch.org/docs/stable/backends.html#torch.backends.cuda.SDPAParams
https://pytorch.org/docs/stable/backends.html#torch.backends.cuda.can_use_flash_attention

fchollet · 2024-10-07T01:40:28Z

If you think that this is a good idea then I will use this snippet in the flash attention function in PyTorch backend.

That sounds great! Then, we can also skip the PyTorch unit test when this check evaluates to False.

…ion and removed flash attention from tests

fchollet

Looks good, thank you! Can you also add the test back? You can use pytest.mark.skipif to skip when unimplemented for PyTorch for TF.

hazemessamm · 2024-10-07T16:08:42Z

I skipped the tests for TensorFlow, NumPy and torch and I just tested JAX on T4 GPU on colab and I got this error: RuntimeError: Require at least Ampere arch to run, so we will need JAX + GPU tests to run on Ampere arch otherwise we will need to skip the tests for all frameworks. Also the current JAX version that runs on github tests does not have dot_product_attention function.

hazemessamm · 2024-10-07T17:21:47Z

I added some conditions for JAX to skip the tests if they were met, what do you think?

hazemessamm added 2 commits August 22, 2024 21:39

added flash attention support for pytorch

22f47b1

added a comment explaining why the causal mask is created manually

7e99f06

google-ml-butler bot added the size:M label Aug 22, 2024

google-ml-butler bot assigned gbaned Aug 22, 2024

fchollet reviewed Aug 22, 2024

View reviewed changes

gbaned added the stat:awaiting response from contributor label Sep 9, 2024

github-actions bot added the stale label Sep 24, 2024

google-ml-butler bot removed stale stat:awaiting response from contributor labels Oct 2, 2024

hazemessamm and others added 3 commits October 4, 2024 02:46

Merge branch 'master' into flash_attention

32a74e0

added unit tests for flash attention

f3a05b0

added test skipping for flash attention for numpy

f585859

hazemessamm added 3 commits October 6, 2024 18:03

removed flash attn op and added support for flash attention inside do…

e614e75

…t_product_attention op

added skip tests for every framework except torch

561a9a7

formatted files

86db62a

fchollet approved these changes Oct 6, 2024

View reviewed changes

google-ml-butler bot added kokoro:force-run ready to pull Ready to be merged into the codebase labels Oct 6, 2024

kokoro-team removed the kokoro:force-run label Oct 6, 2024

added checks for flash attention in pytorch beforing computing attent…

43938ca

…ion and removed flash attention from tests

google-ml-butler bot removed the ready to pull Ready to be merged into the codebase label Oct 7, 2024

fchollet approved these changes Oct 7, 2024

View reviewed changes

google-ml-butler bot added kokoro:force-run ready to pull Ready to be merged into the codebase labels Oct 7, 2024

kokoro-team removed the kokoro:force-run label Oct 7, 2024

added skipping tests for all frameworks except jax

738375f

google-ml-butler bot removed the ready to pull Ready to be merged into the codebase label Oct 7, 2024

formatted files

e12950c

added conditions to skip tests for jax

da50a6b

fixed typo

57e6e56

fchollet approved these changes Oct 7, 2024

View reviewed changes

google-ml-butler bot added kokoro:force-run ready to pull Ready to be merged into the codebase labels Oct 7, 2024

kokoro-team removed the kokoro:force-run label Oct 7, 2024

fchollet merged commit 8e67e0e into keras-team:master Oct 8, 2024
9 checks passed

google-ml-butler bot removed the ready to pull Ready to be merged into the codebase label Oct 8, 2024

Flash attention support. #20152

Flash attention support. #20152

Uh oh!

Conversation

hazemessamm commented Aug 22, 2024

Uh oh!

codecov-commenter commented Aug 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

fchollet left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Sep 24, 2024

Uh oh!

hazemessamm commented Oct 2, 2024

Uh oh!

fchollet commented Oct 2, 2024

Uh oh!

fchollet commented Oct 5, 2024

Uh oh!

james77777778 commented Oct 6, 2024

Uh oh!

fchollet commented Oct 6, 2024

Uh oh!

fchollet left a comment

Choose a reason for hiding this comment

Uh oh!

hazemessamm commented Oct 6, 2024

Uh oh!

fchollet commented Oct 6, 2024

Uh oh!

hazemessamm commented Oct 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fchollet commented Oct 7, 2024

Uh oh!

fchollet left a comment

Choose a reason for hiding this comment

Uh oh!

hazemessamm commented Oct 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hazemessamm commented Oct 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Aug 22, 2024 •

edited

Loading

hazemessamm commented Oct 6, 2024 •

edited

Loading

hazemessamm commented Oct 7, 2024 •

edited

Loading

hazemessamm commented Oct 7, 2024 •

edited

Loading