-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QUESTION] About all-reduce overlap #8
Comments
@ZhongYingMatrix Thank you for your attention. Dual-Streams employs two CUDA streams, one designated for All-Reduce operations and the other for Prefill computations. The Prefill process is segmented into multiple chunks based on tokens. The approach involves computing one chunk initially, followed by an All-Reduce step, and then proceeding to compute the next chunk in parallel, iterating through this cycle until completion. This feature is contributed by my colleague @spetrel . |
@unix1986 @spetrel Hi, sorry to bother. I have a new question regarding the attention computation. Since the attention involves all of the tokens, how is this handled across different chunks? |
We do "chunks" only for prompt encoding computation, the KV cache is "continuous" as one piece. |
@spetrel Is it chunked before or after QKV projection? I suppose that any query chunk should refer to all of the key/value chunks. |
before, at the input of block layer. Yes, all query chunks refer to the whole KV cache. cuda events is used to synchronize between chunks attentions to make sure previous chunk attention get done before next chunk attention. |
Thx for ur wonderful work! I am very interested in your method of
Encode and all-reduce overlap, we named "dual streams"
. Could you provide a general explanation of this technical approach?The text was updated successfully, but these errors were encountered: