Initial fix for token corruption when batching #665

renxida · 2024-12-09T21:20:23Z

There are 2 problems fixed by 2 code changes in this PR.

Cache over-allocation.

This is a small problem that causes us to over-allocate cache pages in the KV cache. This will require further work to get service.py and {Base,Trie}PagedAttentionCache to allocate a precise & consistent amout of cache, but is sufficient to solve the problem at hand.

Zero-padding of seq_len and start_position

For unused requests in a batch, seq_len and start_position are usually filled with 0. This injects NaNs that are written to page 0.

Page index 0 serves a special padding role in our batching system. It's used to fill unused pages for shorter requests and to pad unused requests within a batch.

Under normal circumstances, NaNs in page 0 wouldn't be problematic since our masking system is designed to ignore values beyond the current token. For example, when generating token 17 with a page list of [255, 254, 0], we should never need to read from the padding page.

The issue stems from our current masking implementation. Instead of directly ignoring values, we mask by adding negative infinity to values before applying an exponential function. While this typically works fine and results in zeroes, it breaks down when encountering NaN values. When this happens, NaN values from page 0 can leak into our calculations, resulting in token corruption.

stbaione

LGTM, once pre-commit passes and leak is figured out in ASan

There are 2 problems fixed by 2 code changes in this PR. # Cache over-allocation. This is a small problem that causes us to over-allocate cache pages in the KV cache. This will require further work to get service.py and {Base,Trie}PagedAttentionCache to allocate a precise & consistent amout of cache, but is sufficient to solve the problem at hand. # Zero-padding of seq_len and start_position For unused requests in a batch, seq_len and start_position are usually filled with 0. This injects NaNs that are written to page 0. Page index 0 serves a special padding role in our batching system. It's used to fill unused pages for shorter requests and to pad unused requests within a batch. Under normal circumstances, NaNs in page 0 wouldn't be problematic since our masking system is designed to ignore values beyond the current token. For example, when generating token 17 with a page list of [255, 254, 0], we should never need to read from the padding page. The issue stems from our current masking implementation. Instead of directly ignoring values, we mask by adding negative infinity to values before applying an exponential function. While this typically works fine and results in zeroes, it breaks down when encountering NaN values. When this happens, NaN values from page 0 can leak into our calculations, resulting in token corruption.

missed a line in #665

There are 2 problems fixed by 2 code changes in this PR. # Cache over-allocation. This is a small problem that causes us to over-allocate cache pages in the KV cache. This will require further work to get service.py and {Base,Trie}PagedAttentionCache to allocate a precise & consistent amout of cache, but is sufficient to solve the problem at hand. # Zero-padding of seq_len and start_position For unused requests in a batch, seq_len and start_position are usually filled with 0. This injects NaNs that are written to page 0. Page index 0 serves a special padding role in our batching system. It's used to fill unused pages for shorter requests and to pad unused requests within a batch. Under normal circumstances, NaNs in page 0 wouldn't be problematic since our masking system is designed to ignore values beyond the current token. For example, when generating token 17 with a page list of [255, 254, 0], we should never need to read from the padding page. The issue stems from our current masking implementation. Instead of directly ignoring values, we mask by adding negative infinity to values before applying an exponential function. While this typically works fine and results in zeroes, it breaks down when encountering NaN values. When this happens, NaN values from page 0 can leak into our calculations, resulting in token corruption.

missed a line in #665

PRs in the history of this problem: #665, #723 #665 is supposed to fix a NaN cache corruption issue by 1-filling seq_len instead of 0-filling. Its supposed to 1-fill seq_len for decode and prefill, but I mistakenly 1-filled seq_len for decode only, and also 1-filled the start_position for decode instead of prefill seq_len. #723 adds 1-filling for prefill, and this PR removes the mistaken start_positions 1-filling for decode. After this PR shortfin concurrent tests should be working properly. Up next: a failing trie kv sharing test case.

renxida marked this pull request as ready for review December 9, 2024 21:20

renxida requested review from stbaione and rsuderman December 9, 2024 23:15

stbaione approved these changes Dec 9, 2024

View reviewed changes

renxida force-pushed the bslfix branch 2 times, most recently from 6767c34 to dc9544a Compare December 10, 2024 17:14

renxida added 2 commits December 10, 2024 12:50

initial fix

386c368

precommit

176f0aa

renxida enabled auto-merge (squash) December 10, 2024 17:50

renxida force-pushed the bslfix branch from dc9544a to 176f0aa Compare December 10, 2024 17:50

renxida merged commit 4c015d4 into nod-ai:main Dec 10, 2024
20 checks passed

renxida mentioned this pull request Dec 20, 2024

Fix gibberish token problem for prefill also #723

Merged

renxida added a commit that referenced this pull request Dec 20, 2024

Fix gibberish token problem for prefill also (#723)

2776186

missed a line in #665

monorimet pushed a commit that referenced this pull request Jan 8, 2025

Fix gibberish token problem for prefill also (#723)

9613981

missed a line in #665

renxida mentioned this pull request Jan 14, 2025

Fix shortfin cpu llm concurrency test problems by zero-fill start positions instead of one-filling #826

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial fix for token corruption when batching #665

Initial fix for token corruption when batching #665

renxida commented Dec 9, 2024

stbaione left a comment

Initial fix for token corruption when batching #665

Initial fix for token corruption when batching #665

Conversation

renxida commented Dec 9, 2024

Cache over-allocation.

Zero-padding of seq_len and start_position

stbaione left a comment

Choose a reason for hiding this comment