Adding flash attention for sequence parallel #565

dianaml0 · 2022-12-23T21:38:18Z

Patch Description
Creating this PR off of #511, so it can be reviewed by @stephenroller

The last commit (3d709db) removes some changes from the sequence parallel code which enabled testing with world size of 1. CI is not currently running the test anyway because CI needs to be updated for the test to run.

The forward and backward tests are passing right now. However in some cases, about .2% of the elements fail

Testing steps
Unit Test gpu_tests/test_sequence_parallel_transformer_layer.py

dianaml0 · 2022-12-23T22:12:22Z

CircleCI failure not related to this PR

stephenroller · 2023-01-03T15:20:16Z

Can we rebase for checks? Should we be concerned about the last bits of numerical differences?

dianaml0 · 2023-01-03T16:12:29Z

@stephenroller just rebased the PR, should be up to date now. The rtol and atol used are the same ones we use for testing in xFormers for all flash attention bwds. I do a small training run to validate, would that be useful?

dianaml0 · 2023-01-04T18:46:55Z

Looks like everything is passing now after rebasing

facebook-github-bot · 2024-06-17T22:08:00Z

Hi @dianaml0!

Thank you for your pull request.

We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

dianaml0 requested review from suchenzang, stephenroller, ngoyal2707, punitkoura, moyapchen, klshuster, ruanslv, davides, igormolybogFB and Xirider as code owners December 23, 2022 21:38

facebook-github-bot added the cla signed label Dec 23, 2022

dianaml0 force-pushed the flash_seqpar_v2 branch from fe89b3b to 16c91d1 Compare December 23, 2022 21:47

dianaml0 force-pushed the flash_seqpar_v2 branch from 2f05657 to cb3a4ee Compare January 3, 2023 15:55

dianaml0 requested a review from sharannarang as a code owner January 3, 2023 15:55

dianaml0 force-pushed the flash_seqpar_v2 branch from cb3a4ee to 135972e Compare January 4, 2023 17:19

dianaml0 force-pushed the flash_seqpar_v2 branch from 390b808 to 9401934 Compare January 6, 2023 20:32

stephenroller and others added 10 commits January 18, 2023 09:53

[wip] Adding flash attention for sequence parallel

b4539bb

change to faster flash attn

2830c3d

add back standard attention

6dc5006

gate mem efficient attn behind a flag

55af48f

cleanup

25270a6

save flags for backwards

0514d61

linting

d360526

add test

de36a37

move test to gpu tests

9b09c89

fix

1d42bba

dianaml0 and others added 18 commits January 18, 2023 09:53

lint

66d52f8

fix tests

2dbb1ca

add args

02717dd

install xformers for gpu tests

6740918

update shapes

f68a3d8

skip if xformers not available

3b1cb73

use Triton since fastest for zucchini shape, fix reshaping of attn

7db010b

cleaner reshaping

546458f

clean up tests

156c057

Do not install xFormers in circleCI, need updated Cuda

dea3360

clean up tests

832340a

fixing bwd, some tmp changes

d33855c

add testing and logic for multiple heads, fix bug in bwd

c84ad93

clean up tests and add separate tolerances for fwd and bwd

2ab4400

remove changes to code needed for testing with world size of 1

722ede4

lint fixes

523f4e1

formatting

9aa46d2

Clean up comments

d0aa8b6

dianaml0 force-pushed the flash_seqpar_v2 branch from 9401934 to d0aa8b6 Compare January 18, 2023 17:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding flash attention for sequence parallel #565

Adding flash attention for sequence parallel #565

dianaml0 commented Dec 23, 2022 •

edited

Loading

dianaml0 commented Dec 23, 2022

stephenroller commented Jan 3, 2023

dianaml0 commented Jan 3, 2023

dianaml0 commented Jan 4, 2023

facebook-github-bot commented Jun 17, 2024

Adding flash attention for sequence parallel #565

Are you sure you want to change the base?

Adding flash attention for sequence parallel #565

Conversation

dianaml0 commented Dec 23, 2022 • edited Loading

dianaml0 commented Dec 23, 2022

stephenroller commented Jan 3, 2023

dianaml0 commented Jan 3, 2023

dianaml0 commented Jan 4, 2023

facebook-github-bot commented Jun 17, 2024

Process

dianaml0 commented Dec 23, 2022 •

edited

Loading