don't save inputs/outputs buffer of FlashAttenFunc to reduce memory usage for inference mode #1383

XiaobingSuper · 2024-12-12T02:21:33Z

In inference mode, we don't need to save the inputs/outputs used by the training mode. This PR can reduce memory usage during LLM serving(such as vllm, it uses those flash attn APIs for a better performance).

… inference mode

rocking5566 · 2024-12-17T22:32:08Z

@XiaobingSuper
I found this PR make pytest (https://github.com/Dao-AILab/flash-attention/tree/main/tests) fail.

flash-attention/flash_attn/flash_attn_interface.py

Line 455 in 0dfb281

is_grad = torch.is_grad_enabled() and qkv.requires_grad

torch.is_grad_enabled() always be False in the test script, which make backward cannot access ctx.saved_tensors

XiaobingSuper · 2024-12-19T01:38:09Z

@XiaobingSuper I found this PR make pytest (https://github.com/Dao-AILab/flash-attention/tree/main/tests) fail.

flash-attention/flash_attn/flash_attn_interface.py

Line 455 in 0dfb281

is_grad = torch.is_grad_enabled() and qkv.requires_grad

torch.is_grad_enabled() always be False in the test script, which make backward cannot access ctx.saved_tensors

Sorry, it is my mistake, gradient computation is already disabled in custom autograd.Functions by default(https://discuss.pytorch.org/t/is-torch-no-grad-making-a-difference-in-custom-autograd-functions/186627), it need to check it before calling the customer op.

XiaobingSuper · 2024-12-19T02:18:02Z

@rocking5566 I created a PR #1397 to fix this issue. Thanks.

don't save inputs buffer of FlashAttenFunc to reduce memory usage for…

6a8d811

… inference mode

tridao merged commit 0dfb281 into Dao-AILab:main Dec 12, 2024

XiaobingSuper mentioned this pull request Dec 19, 2024

check torch.is_grad_enabled before calling customer flash atten ops #1397

Closed

rocking5566 mentioned this pull request Jan 7, 2025

[ROCm] benchmark_flash_attention.py failing with Memory Access Fault #1381

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

don't save inputs/outputs buffer of FlashAttenFunc to reduce memory usage for inference mode #1383

don't save inputs/outputs buffer of FlashAttenFunc to reduce memory usage for inference mode #1383

XiaobingSuper commented Dec 12, 2024

rocking5566 commented Dec 17, 2024 •

edited

Loading

XiaobingSuper commented Dec 19, 2024 •

edited

Loading

XiaobingSuper commented Dec 19, 2024

don't save inputs/outputs buffer of FlashAttenFunc to reduce memory usage for inference mode #1383

don't save inputs/outputs buffer of FlashAttenFunc to reduce memory usage for inference mode #1383

Conversation

XiaobingSuper commented Dec 12, 2024

rocking5566 commented Dec 17, 2024 • edited Loading

XiaobingSuper commented Dec 19, 2024 • edited Loading

XiaobingSuper commented Dec 19, 2024

rocking5566 commented Dec 17, 2024 •

edited

Loading

XiaobingSuper commented Dec 19, 2024 •

edited

Loading