The original definition of KeyValueCache
refers to here.
The difference between dynamic_batching.KeyValueCache
and KeyValueCache
is that the sequence length, the value of start_pos
and the length of key and value caches of each batch are different.
dynamic_batching.KeyValueCache
uses seqstarts
and kvstarts
to record the sequence begining position of each batch. And the ability to map batches to different locations in the cache is provided by cachestarts
.
In the description below
-
$L$ is number of attention layers(num_layer
) -
$B$ is batch size -
$MaxT$ is max tokens lengthcache
could hold(i.e, it could be over 10,000,000 in some case) -
$MaxP$ is the max number of pages of sequences in Paged Attention mode. -
$H$ isnum_heads
of transformer -
$Dh$ isdims_per_head
orhead_dim
of transformer.
NOTE:
cache
andscale
are used as in-out tensor, so it is recommended to use them as model inputs, and let the user set the shape by themselves (mainly becauseMaxT
need to be configured separately).
Number of attention layers.
Attention layer index for cache and scale.
Quantize bit for cache compression. For example, 8 means int8 compression. 0
means disabled.
Quantize scale shared group size.
For Grouped-Query Attention. Repeat key and value num_repeat
time on axis num_heads
to construct an input compatiltable with non-grouped MultiHeadAttention.
Define cache indexing mode. Default is zero.
- When
cache_mode
is0
,cache
is indexed by offset mode. Shape ofcachestarts
is$(B)$ . For each batch$b$ ,cachestarts[b]
mapping cache begining index in$MaxT$ ofcache
andscale
. Note thatcachestarts[b+1]-cachestarts[b]
can not calculate out the cache length of batch$b$ . - When
cache_mode
is1
,cache
is indexed by page table mode, which called Paged Attention. Shape ofcachestarts
is$(B, MaxP)$ . For each batch$b$ ,cachestarts[b, :]
contains pages' begining index in$MaxT$ ofcache
andscale
.
Example forbatch = 2, page_size = 256
:$$cachestarts=[[0,256,\cdots],[1024,2048,\cdots]]$$
Define data layout of cache
and scale
. Default is zero.
Meaning of numbers:
-
0
:$cache(MaxT,L,2,H,Dh)$ and$scale(MaxT,L,2,H,Dh/quant\_group)$ -
1
:$cache(L,MaxT,2,H,Dh)$ and$scale(L,MaxT,2,H,Dh/quant\_group)$ -
2
:$cache(L,2,MaxT,H,Dh)$ and$scale(L,2,MaxT,H,Dh/quant\_group)$ -
3
:$cache(L,2,H,MaxT,Dh)$ and$scale(L,2,H,MaxT,Dh/quant\_group)$
Page size in Paged Attention(when cache_mode
is 1
)
Shape:
Shape:
seqstarts[:B]
contains the position of the first token in current_key
and current_value
for each batch.
And seqstarts[B]
contains the total length of current_key
and current_value
.
Note that seqstarts[b+1]-seqstarts[b]
can calculate out the sequence length of batch
Shape:
kvstarts[:B]
contains the position of the first token in key
and value
for each batch.
And kvstarts[B]
contains the total length of key
and value
.
Note that kvstarts[b+1]-kvstarts[b]
can calculate out the key and value length of batch
Shape:
Indexing cache position in cache
and scale
. Behavior is determinated by cache_mode
.
Shape:
Sequence position where current_key
and current_value
begining to store of each batch.
Shape:
Maximum sequence length of current_key
and current_value
, equal to max(seqstarts[1:]-seqstarts[:B])
. For parallel computing.
Maximum sequence length of key
and value
, equal to max(kvstarts[1:]-kvstarts[:B])
. For parallel computing.
Shape: Determinated by cache_layout
.
Contains key and value caches of attention layer. When cache_layout
is 0
, subspace
Shape: determinate by cache_layout
.
Contains key and value cache quantize scales of attention layer. When cache_layout
is 0
, subspace quant_bit
is not zero. Data in this tensor will be modified.
Shape:
Packed current key and all pass key. If quant_bit
is not 0
, it should be decompressed.
Shape:
Packed contains current value and all pass value. If quant_bit
is not 0
, it should be decompressed.