Question about supporting other settings of cross-layer KV cache sharing #12

ChenHong30 · 2024-11-19T06:27:50Z

Hi, I have solved all configuration problems with your gentle help, thanks.

I am trying to use other KV cache-sharing strategies on your project (e.g. YOCO). I have noticed you gave a pre-defined configuration which is

config.sliding_window  = 1024   # the window size for the sliding window attention
config.layer_types     = "0s_1s_2s_3s_4s_5s_6s_7s_8s_9s_10s_11_11_11_11_11_11_11_11_11_11_11" # YOCO config, 's' is for sliding window

I am a little bit confused about this configuration, YOCO sets the first 2/L (L refers to the total layer number) layers as decoders which generate KV, and the last layer of the self-decoder generates a global KV cache. The remaining 2/L layers are cross-decoders that use the global KV cache to compute attention scores. Thus, I think the configuration of YOCO should be

config.sliding_window  = 1024   # the window size for the sliding window attention
config.layer_types     = "0s_1s_2s_3s_4s_5s_6s_7s_8s_9s_10s_10_10_10_10_10_10_10_10_10_10_10" # YOCO config, 's' is for sliding window

That is, the layer index 10 is the target layer. I don't know if is there any misunderstanding for me, or if it relates to some reasons you mentioned in your paper but I haven't noticed, please let me know.

The text was updated successfully, but these errors were encountered:

pigdogbaby · 2024-11-19T08:21:27Z

Hi! Thank you for the question.

Our paper only gave a brief introduction to YOCO. If you want to know more detailed information of YOCO, please refer to their original paper: https://arxiv.org/abs/2405.05254.

YOCO sets the first 2/L (L refers to the total layer number) layers as decoders which generate KV, and the last layer of the self-decoder generates a global KV cache. The remaining 2/L layers are cross-decoders that use the global KV cache to compute attention scores.

You are right. The YOCO paper mentioned that the global KV cache is calculated based on the output of the L/2-th layer, which can be seen as the KV cache of the L/2+1-th layer. So the target layer is the 11th instead of the 10th layer in our framework.

ChenHong30 · 2024-11-19T10:34:19Z

Thank you for your fast reply. I understand the 11th layer should be the target layer. But the starts index in your code is "0", which means the layer indices should be listed from 0 to 21 (for maximum 22 layers). Thus, the target layer index should be 10 (which is the 11th layer) ?

For quick verification, YOCO should contain L/2 (equals to 11 here) layers that compute KVs, but in

config.sliding_window  = 1024   # the window size for the sliding window attention
config.layer_types     = "0s_1s_2s_3s_4s_5s_6s_7s_8s_9s_10s_11_11_11_11_11_11_11_11_11_11_11" # YOCO config, 's' is for sliding window

there are 12 layers that compute KVs (from index 0 to 11). I don't know whether if I misunderstand your code or something else, thank you.

pigdogbaby · 2024-11-19T11:07:00Z

Thank you for your reply.

The YOCO paper mentioned that the global KV cache is calculated based on the output of the L/2-th layer, which can be seen as the KV cache of the L/2+1-th layer. So the target layer is the 11th instead of the 10th layer in our framework.

Sorry for the misunderstanding. Since there are L=22 layers in total, the L/2+1-th layer would be the 12th layer, whose index is 11.

YOCO should contain L/2 (equals to 11 here) layers that compute KVs.

In fact, YOCO contains L/2+1 (equals to 12 here) layers that compute KVs, including L/2 sliding window attention layers and one global KV layer.

why-in-Shanghaitech assigned pigdogbaby Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about supporting other settings of cross-layer KV cache sharing #12

Question about supporting other settings of cross-layer KV cache sharing #12

ChenHong30 commented Nov 19, 2024

pigdogbaby commented Nov 19, 2024 •

edited

Loading

ChenHong30 commented Nov 19, 2024

pigdogbaby commented Nov 19, 2024

Question about supporting other settings of cross-layer KV cache sharing #12

Question about supporting other settings of cross-layer KV cache sharing #12

Comments

ChenHong30 commented Nov 19, 2024

pigdogbaby commented Nov 19, 2024 • edited Loading

ChenHong30 commented Nov 19, 2024

pigdogbaby commented Nov 19, 2024

pigdogbaby commented Nov 19, 2024 •

edited

Loading