use flash attn fuse cross entropy loss to reduce metric memory usage #2987

cli99 · 2024-02-09T05:59:13Z

This PR uses fused cross entropy loss from flash attention in the metric LanguageCrossEntropy (also LanguagePerplexity).
The current torch.nn.CrossEntropyLoss call needs 6 * seq_len * vocab_size GPU memory, and can be the bottleneck memory usage when sequence length is long (where act ckpt is probably used). Using cross entropy loss from flash attn resolves this problem.

Example test model with long sequence and full act ckpt:
with torch loss fn:

with flash_attn loss fn

dakinggg · 2024-02-09T06:03:07Z

@cli99 consider doing it as in https://github.com/mosaicml/llm-foundry/pull/575/files, to avoid introducing flash attn as a composer dependency.

j316chuck · 2024-02-09T06:04:31Z

composer/metrics/nlp.py

+        try:
+            from flash_attn.losses.cross_entropy import CrossEntropyLoss as FusedCrossEntropyLoss
+            self.loss_fn = FusedCrossEntropyLoss(ignore_index=ignore_index, reduction='sum')
+        except ImportError:
+            self.loss_fn = torch.nn.CrossEntropyLoss(ignore_index=ignore_index, reduction='sum')


Should this live in llm-foundry? CC: @dakinggg

i have a very old pr that i never merged

mvpatel2000 · 2024-03-19T20:42:59Z

@cli99 should we close or is this still WIP

add flash attn fuse cross entropy loss

a3117a6

j316chuck reviewed Feb 9, 2024

View reviewed changes

dakinggg mentioned this pull request Apr 26, 2024

Use FA's CrossEntropyLoss for metrics calculation #3214

Closed

7 tasks

mvpatel2000 force-pushed the dev branch from 8a09a3b to 6f8831d Compare July 22, 2024 21:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use flash attn fuse cross entropy loss to reduce metric memory usage #2987

use flash attn fuse cross entropy loss to reduce metric memory usage #2987

cli99 commented Feb 9, 2024

dakinggg commented Feb 9, 2024

j316chuck Feb 9, 2024

dakinggg Feb 9, 2024

dakinggg Feb 9, 2024

mvpatel2000 commented Mar 19, 2024

use flash attn fuse cross entropy loss to reduce metric memory usage #2987

Are you sure you want to change the base?

use flash attn fuse cross entropy loss to reduce metric memory usage #2987

Conversation

cli99 commented Feb 9, 2024

dakinggg commented Feb 9, 2024

j316chuck Feb 9, 2024

Choose a reason for hiding this comment

dakinggg Feb 9, 2024

Choose a reason for hiding this comment

dakinggg Feb 9, 2024

Choose a reason for hiding this comment

mvpatel2000 commented Mar 19, 2024