Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] About all-reduce overlap #8

Closed
ZhongYingMatrix opened this issue Dec 12, 2024 · 5 comments
Closed

[QUESTION] About all-reduce overlap #8

ZhongYingMatrix opened this issue Dec 12, 2024 · 5 comments

Comments

@ZhongYingMatrix
Copy link

ZhongYingMatrix commented Dec 12, 2024

Thx for ur wonderful work! I am very interested in your method of Encode and all-reduce overlap, we named "dual streams". Could you provide a general explanation of this technical approach?

@ZhongYingMatrix ZhongYingMatrix changed the title [QUESTION] About all [QUESTION] About all-reduce overlap Dec 12, 2024
@unix1986
Copy link
Collaborator

unix1986 commented Dec 12, 2024

@ZhongYingMatrix Thank you for your attention. Dual-Streams employs two CUDA streams, one designated for All-Reduce operations and the other for Prefill computations. The Prefill process is segmented into multiple chunks based on tokens. The approach involves computing one chunk initially, followed by an All-Reduce step, and then proceeding to compute the next chunk in parallel, iterating through this cycle until completion. This feature is contributed by my colleague @spetrel .

@ZhongYingMatrix
Copy link
Author

The Prefill process is segmented into multiple chunks based on tokens. The approach involves computing one chunk initially, followed by an All-Reduce step, and then proceeding to compute the next chunk in parallel, iterating through this cycle until completion.

@unix1986 @spetrel Hi, sorry to bother. I have a new question regarding the attention computation. Since the attention involves all of the tokens, how is this handled across different chunks?

@spetrel
Copy link
Collaborator

spetrel commented Dec 16, 2024

We do "chunks" only for prompt encoding computation, the KV cache is "continuous" as one piece.

@ZhongYingMatrix
Copy link
Author

@spetrel Is it chunked before or after QKV projection? I suppose that any query chunk should refer to all of the key/value chunks.

@spetrel
Copy link
Collaborator

spetrel commented Dec 17, 2024

before, at the input of block layer. Yes, all query chunks refer to the whole KV cache. cuda events is used to synchronize between chunks attentions to make sure previous chunk attention get done before next chunk attention.

@unix1986 unix1986 pinned this issue Jan 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants