-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about supporting other settings of cross-layer KV cache sharing #12
Comments
Hi! Thank you for the question. Our paper only gave a brief introduction to YOCO. If you want to know more detailed information of YOCO, please refer to their original paper: https://arxiv.org/abs/2405.05254.
You are right. The YOCO paper mentioned that the global KV cache is calculated based on the output of the L/2-th layer, which can be seen as the KV cache of the L/2+1-th layer. So the target layer is the 11th instead of the 10th layer in our framework. |
Thank you for your fast reply. I understand the 11th layer should be the target layer. But the starts index in your code is "0", which means the layer indices should be listed from 0 to 21 (for maximum 22 layers). Thus, the target layer index should be 10 (which is the 11th layer) ? For quick verification, YOCO should contain L/2 (equals to 11 here) layers that compute KVs, but in
there are 12 layers that compute KVs (from index 0 to 11). I don't know whether if I misunderstand your code or something else, thank you. |
Thank you for your reply.
Sorry for the misunderstanding. Since there are L=22 layers in total, the L/2+1-th layer would be the 12th layer, whose index is 11.
In fact, YOCO contains L/2+1 (equals to 12 here) layers that compute KVs, including L/2 sliding window attention layers and one global KV layer. |
Hi, I have solved all configuration problems with your gentle help, thanks.
I am trying to use other KV cache-sharing strategies on your project (e.g. YOCO). I have noticed you gave a pre-defined configuration which is
I am a little bit confused about this configuration, YOCO sets the first 2/L (L refers to the total layer number) layers as decoders which generate KV, and the last layer of the self-decoder generates a global KV cache. The remaining 2/L layers are cross-decoders that use the global KV cache to compute attention scores. Thus, I think the configuration of YOCO should be
That is, the layer index 10 is the target layer. I don't know if is there any misunderstanding for me, or if it relates to some reasons you mentioned in your paper but I haven't noticed, please let me know.
The text was updated successfully, but these errors were encountered: