Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about supporting other settings of cross-layer KV cache sharing #12

Open
ChenHong30 opened this issue Nov 19, 2024 · 3 comments
Assignees

Comments

@ChenHong30
Copy link

Hi, I have solved all configuration problems with your gentle help, thanks.

I am trying to use other KV cache-sharing strategies on your project (e.g. YOCO). I have noticed you gave a pre-defined configuration which is

config.sliding_window  = 1024   # the window size for the sliding window attention
config.layer_types     = "0s_1s_2s_3s_4s_5s_6s_7s_8s_9s_10s_11_11_11_11_11_11_11_11_11_11_11" # YOCO config, 's' is for sliding window

I am a little bit confused about this configuration, YOCO sets the first 2/L (L refers to the total layer number) layers as decoders which generate KV, and the last layer of the self-decoder generates a global KV cache. The remaining 2/L layers are cross-decoders that use the global KV cache to compute attention scores. Thus, I think the configuration of YOCO should be

config.sliding_window  = 1024   # the window size for the sliding window attention
config.layer_types     = "0s_1s_2s_3s_4s_5s_6s_7s_8s_9s_10s_10_10_10_10_10_10_10_10_10_10_10" # YOCO config, 's' is for sliding window

That is, the layer index 10 is the target layer. I don't know if is there any misunderstanding for me, or if it relates to some reasons you mentioned in your paper but I haven't noticed, please let me know.

@pigdogbaby
Copy link
Collaborator

pigdogbaby commented Nov 19, 2024

Hi! Thank you for the question.

Our paper only gave a brief introduction to YOCO. If you want to know more detailed information of YOCO, please refer to their original paper: https://arxiv.org/abs/2405.05254.

YOCO sets the first 2/L (L refers to the total layer number) layers as decoders which generate KV, and the last layer of the self-decoder generates a global KV cache. The remaining 2/L layers are cross-decoders that use the global KV cache to compute attention scores.

You are right. The YOCO paper mentioned that the global KV cache is calculated based on the output of the L/2-th layer, which can be seen as the KV cache of the L/2+1-th layer. So the target layer is the 11th instead of the 10th layer in our framework.
image

@ChenHong30
Copy link
Author

Thank you for your fast reply. I understand the 11th layer should be the target layer. But the starts index in your code is "0", which means the layer indices should be listed from 0 to 21 (for maximum 22 layers). Thus, the target layer index should be 10 (which is the 11th layer) ?

For quick verification, YOCO should contain L/2 (equals to 11 here) layers that compute KVs, but in

config.sliding_window  = 1024   # the window size for the sliding window attention
config.layer_types     = "0s_1s_2s_3s_4s_5s_6s_7s_8s_9s_10s_11_11_11_11_11_11_11_11_11_11_11" # YOCO config, 's' is for sliding window

there are 12 layers that compute KVs (from index 0 to 11). I don't know whether if I misunderstand your code or something else, thank you.

@pigdogbaby
Copy link
Collaborator

Thank you for your reply.

The YOCO paper mentioned that the global KV cache is calculated based on the output of the L/2-th layer, which can be seen as the KV cache of the L/2+1-th layer. So the target layer is the 11th instead of the 10th layer in our framework.

Sorry for the misunderstanding. Since there are L=22 layers in total, the L/2+1-th layer would be the 12th layer, whose index is 11.

YOCO should contain L/2 (equals to 11 here) layers that compute KVs.

In fact, YOCO contains L/2+1 (equals to 12 here) layers that compute KVs, including L/2 sliding window attention layers and one global KV layer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants