[QUESTION] About all-reduce overlap #8

ZhongYingMatrix · 2024-12-12T03:44:00Z

Thx for ur wonderful work! I am very interested in your method of Encode and all-reduce overlap, we named "dual streams". Could you provide a general explanation of this technical approach?

The text was updated successfully, but these errors were encountered:

unix1986 · 2024-12-12T04:20:45Z

@ZhongYingMatrix Thank you for your attention. Dual-Streams employs two CUDA streams, one designated for All-Reduce operations and the other for Prefill computations. The Prefill process is segmented into multiple chunks based on tokens. The approach involves computing one chunk initially, followed by an All-Reduce step, and then proceeding to compute the next chunk in parallel, iterating through this cycle until completion. This feature is contributed by my colleague @spetrel .

ZhongYingMatrix · 2024-12-16T07:58:03Z

The Prefill process is segmented into multiple chunks based on tokens. The approach involves computing one chunk initially, followed by an All-Reduce step, and then proceeding to compute the next chunk in parallel, iterating through this cycle until completion.

@unix1986 @spetrel Hi, sorry to bother. I have a new question regarding the attention computation. Since the attention involves all of the tokens, how is this handled across different chunks?

spetrel · 2024-12-16T09:56:35Z

We do "chunks" only for prompt encoding computation, the KV cache is "continuous" as one piece.

ZhongYingMatrix · 2024-12-17T02:29:38Z

@spetrel Is it chunked before or after QKV projection? I suppose that any query chunk should refer to all of the key/value chunks.

spetrel · 2024-12-17T02:53:42Z

before, at the input of block layer. Yes, all query chunks refer to the whole KV cache. cuda events is used to synchronize between chunks attentions to make sure previous chunk attention get done before next chunk attention.

ZhongYingMatrix changed the title ~~[QUESTION] About all~~ [QUESTION] About all-reduce overlap Dec 12, 2024

ZhongYingMatrix closed this as completed Dec 12, 2024

unix1986 pinned this issue Jan 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] About all-reduce overlap #8

[QUESTION] About all-reduce overlap #8

ZhongYingMatrix commented Dec 12, 2024 •

edited

Loading

unix1986 commented Dec 12, 2024 •

edited

Loading

ZhongYingMatrix commented Dec 16, 2024

spetrel commented Dec 16, 2024

ZhongYingMatrix commented Dec 17, 2024

spetrel commented Dec 17, 2024

[QUESTION] About all-reduce overlap #8

[QUESTION] About all-reduce overlap #8

Comments

ZhongYingMatrix commented Dec 12, 2024 • edited Loading

unix1986 commented Dec 12, 2024 • edited Loading

ZhongYingMatrix commented Dec 16, 2024

spetrel commented Dec 16, 2024

ZhongYingMatrix commented Dec 17, 2024

spetrel commented Dec 17, 2024

ZhongYingMatrix commented Dec 12, 2024 •

edited

Loading

unix1986 commented Dec 12, 2024 •

edited

Loading