-
Notifications
You must be signed in to change notification settings - Fork 203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Token rotation #987
Token rotation #987
Conversation
613f8bf
to
9d4f3bb
Compare
5b61636
to
0d60110
Compare
0d60110
to
c6ef2c0
Compare
69d905b
to
83f7045
Compare
src/cpp/src/cache_eviction.cpp
Outdated
block_rotation_data.cosines.push_back( | ||
m_rope_cos_lut[current_rotation_delta_in_blocks * m_block_size]); | ||
block_rotation_data.sines.push_back( | ||
m_rope_sin_lut[current_rotation_delta_in_blocks * m_block_size]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it better to just pass m_rope_cos_lut
and m_rope_sin_lut
to PagedAttention op alongside with required indices? Having this packing on the host side doesn't provide any flexibility in how you can adjust RoPE -- code here essentially implements Gather. Potential RoPE variants that will be implemented in the future require changes in how we prepare sin/cos or how we are applying the coefficients in PA operation itself. But doing coefficient packing here gives potential data duplication along multiple LLM layers and we will need this packing each time when re-rotation applied, so it looks better to locate this code right in PA, gather values locally for each KV cache block without repacking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be done, I suppose, with a minor question of whether in this case there should really be separate sin
and cos
inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not "potential data duplication" -- chunks of cos/sin coefficients are explicitly duplicated here for all tokens in each block. When I mentioned "potential" I meant duplication of rotation coefficients between different attention layers in the model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, we are working with double
data type here, so data duplication gives 2x memory consumption impact. It will not contribute to memory peak as we are calling this function per layer, but still using double
here is not necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the base data type to float.
rotation_multipliers_tensor_data[position_offset + embedding_pair_idx] = | ||
rotation_multipliers_cos[tok_idx][embedding_pair_idx]; | ||
rotation_multipliers_tensor_data[position_offset + embedding_pair_idx + head_size / 2] = | ||
rotation_multipliers_sin[tok_idx][embedding_pair_idx]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we are repacking sin/cos coefficients again after initial copying/duplication from lut tables. Repacking is required because we expect this particular layout in PA operation -- nothing else force us to have this repacking here. But is it a really convenient layout if we need repacking here? In the code of cache eviction on CPU side, for example here https://github.com/openvinotoolkit/openvino/pull/27088/files#diff-9ec1d0710e07c208e40402e8342b86075f2bdc06a900f1a8cc3ec28385952753R17-R18 we are using sin and cos by two unrelated pointers, so packing sin and cos close to each other for each token doesn't look as requirement. If we take sin/cos in their original layout and separately for sin and cos in PA operation, repacking here wouldn't be needed and it will still be friendly for CPU implementation. Correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By putting the potential repacking functionality onto the GenAI lib level I thought to circumvent the potential future issue of different layouts required by devices for efficiency. Already, it seems, the GPU requires different key cache layout from the CPU case, see:
openvino.genai/src/cpp/src/device_config.hpp
Line 118 in e2fa0d0
if (m_device.find("GPU") != std::string::npos) { |
Right now the difference is only in the order of the last two dims (block_size, embedding_size). It could be made non-existent, though, if we decide to forego the per-token rotation coefficient passing and only pass per-block coefficients, in which case the LUT shape should be the same both for CPU and GPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Together with per-token duplication it requires a lot more storage and memory bandwidth to prepare and use these coefficients in comparison to just indices. For example, if we have num_kv_heads == 1 then amount of memory spent for the coefficients is the same as rotated kv cache blocks themselves, and if KV cache is compressed, the size of coefficients if they are still stored as f32 will be at least 2x bigger. In this part of the code we are just repacking that big amount of data. So we are loading/storing this amount 4 times: first when duplicating them in another place in the code, then in this place we spend another 2 times when loading and storing them in a different layout, and then in PA when applying the coefficients. In the first and the second passes we are really dealing with double
type that doubles the bandwidth requirement, BTW.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made the LUT slower with suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slower?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
smaller*
83f7045
to
533585a
Compare
if (current_rotation_delta_in_blocks != 0) { | ||
BlockRotationData block_rotation_data; | ||
block_rotation_data.logical_block_idx = logical_block_idx - current_rotation_delta_in_blocks; | ||
block_rotation_data.rotation_delta = current_rotation_delta_in_blocks * m_block_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the number of different rotation_delta
s is bounded by the maximum number of evicted blocks. And each rotation_delta is multiple of m_block_size. That means that we don't need to pre-calculate and store other values in sin/cos LUT's and gives a significantly lower bound for their size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented the idea, adjusting for reality (recent_size
also counts)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had to revert this back to maximum sequence length because in the latest iteration, in effort to align with the Python POC, I only evict once the prompt is pre-filled, which means that the occupancy of cache by single sequence is still bound only by max sequence length.
} | ||
|
||
// TODO (vshampor): LUT size equal to max cache size in tokens | ||
// is overkill - find a way to pass the max sequence length defined by pipeline instead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even max sequence length is overkill. Mentioned in another comment: the number of required different rows in LUT can be divided by the block size. It makes the tables times smaller and provides a chipper extending way for the tables on-the-fly when the maximum relative distance in re-rotation exceeds the currently allocated value. Or I just have a wrong impression about how it works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AlexKoff88 suggests to keep the possibility of per-token rotation for the future, hence the duplication and LUT granularity right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having an index per token still give you the possibility to control it with a token granularity -- and this could be kept in PA semantics. But you can organize LUT without unused gaps already now and divide all indices by block size -- it is a part of GenAI and doesn't affect how OV side is implemented. Are there any other dependencies or future work activities that can utilize per-token granularity without changing the code in this PR? Can we externally change the algorithm for cache eviction having the same GenAI released package?
Concerning per-block or per-token granularity, you can utilize well-defined broadcast semantics for PA inputs that defines rotation deltas. For example, [num_rot_blocks, block_size] tensor shape will have per-token granularity, and [num_rot_blocks, 1] will assume broadcast of the same rotation index to all tokens in blocks providing per-block granularity. This is a usual thing for element-wise operations in OV. In this way you give economical representation for per-block granularity and in the same time provide necessary flexibility to switch to per-token granularity if it is required in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a dimension to the deltas tensor, with broadcast semantics hand-coded.
src/cpp/src/cache_eviction.cpp
Outdated
m_rope_sin_lut.resize(max_position_angle_multiplier); | ||
m_rope_cos_lut.resize(max_position_angle_multiplier); | ||
|
||
for (size_t i = 0; i < max_position_angle_multiplier; i++) { | ||
m_rope_sin_lut[i].reserve(num_freqs); | ||
m_rope_cos_lut[i].reserve(num_freqs); | ||
for (size_t j = 0; j < num_freqs; j++) { | ||
double exponent = -static_cast<double>(2 * j) / kv_head_size; | ||
double base_angle = std::pow(rope_theta, exponent); | ||
m_rope_sin_lut[i].push_back( | ||
-std::sin(i * base_angle)); // minus since we will be rotating by an inverse angle | ||
m_rope_cos_lut[i].push_back(std::cos(i * base_angle)); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on other comments about maximum number of different re-rotation values: amount of data can be reduced by block_size
time and limited by the maximum number of evicted blocks.
And saving double
values doesn't still make sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the data type to float, want to keep per-token granularity even if it means duplication of values right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can external user utilize the per-token granularity without changing the code of the cache eviction?
533585a
to
f073a70
Compare
f073a70
to
2173d07
Compare
2173d07
to
ea64dc5
Compare
baec186
to
7094461
Compare
7094461
to
bea9823
Compare
@@ -23,7 +23,7 @@ namespace ov::genai { | |||
class CacheEvictionConfig { | |||
public: | |||
CacheEvictionConfig() {}; | |||
CacheEvictionConfig(size_t start_size, size_t recent_size, size_t max_cache_size, AggregationMode aggregation_mode_) : aggregation_mode(aggregation_mode_), m_start_size(start_size), m_recent_size(recent_size), m_max_cache_size(max_cache_size) { | |||
CacheEvictionConfig(size_t start_size, size_t recent_size, size_t max_cache_size, AggregationMode aggregation_mode_, bool apply_rotation_ = false) : aggregation_mode(aggregation_mode_), apply_rotation(apply_rotation_), m_start_size(start_size), m_recent_size(recent_size), m_max_cache_size(max_cache_size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we enable rotation automatically based on model topology without mandatory user-level control? I, as a user, have no idea when I should enable it, and will leave it as-is (disabled) or enable it every time. What is the recommendation for the user? Is true
a better default value because majority of modern architectures uses RoPE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't achieve expected accuracy results in currently available test cases right now with rotation enabled. Will probably set the default to true
once we do get accuracy, but the user's model may have a differing RoPE scheme from what we currently support. I'll add some words about that to the docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then it is OK to merge it to master to continue development there, but not in the currently open release branch.
// `rotate_half` function in HF transformers. It can be shown that this form still preserves the relative | ||
// positioning property from the RoFormer article. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how I should interpret this part. Does it mean that LUT computed here suits both interleaved and not-interleaved RoPE? Or only for one of them, but the difference can be ignored?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The LUT here suits only the non-interleaved case. It can be shown...
part is trivia for the curious and relates to the general desirable property of RoPE-like positional encodings.
1e0bfc9
to
510c29e
Compare
Ticket: 153791
To be merged after:
openvinotoolkit/openvino#27088