-
Notifications
You must be signed in to change notification settings - Fork 725
Adding flash attention for sequence parallel #565
base: main
Are you sure you want to change the base?
Conversation
fe89b3b
to
16c91d1
Compare
CircleCI failure not related to this PR |
Can we rebase for checks? Should we be concerned about the last bits of numerical differences? |
2f05657
to
cb3a4ee
Compare
@stephenroller just rebased the PR, should be up to date now. The rtol and atol used are the same ones we use for testing in xFormers for all flash attention bwds. I do a small training run to validate, would that be useful? |
cb3a4ee
to
135972e
Compare
Looks like everything is passing now after rebasing |
390b808
to
9401934
Compare
9401934
to
d0aa8b6
Compare
Hi @dianaml0! Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours needs attention. You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at [email protected]. Thanks! |
Patch Description
Creating this PR off of #511, so it can be reviewed by @stephenroller
The last commit (3d709db) removes some changes from the sequence parallel code which enabled testing with world size of 1. CI is not currently running the test anyway because CI needs to be updated for the test to run.
The forward and backward tests are passing right now. However in some cases, about .2% of the elements fail
Testing steps
Unit Test
gpu_tests/test_sequence_parallel_transformer_layer.py