Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Token rotation #987

Merged
merged 13 commits into from
Jan 24, 2025
Merged

Token rotation #987

merged 13 commits into from
Jan 24, 2025

Conversation

vshampor
Copy link
Contributor

@vshampor vshampor commented Oct 16, 2024

Ticket: 153791

To be merged after:
openvinotoolkit/openvino#27088

@github-actions github-actions bot added category: continuous batching Continuous batching category: sampling Sampling / Decoding algorithms labels Oct 17, 2024
@ilya-lavrenov ilya-lavrenov added do_not_merge do_not_review and removed category: sampling Sampling / Decoding algorithms labels Oct 18, 2024
@github-actions github-actions bot added the category: sampling Sampling / Decoding algorithms label Oct 26, 2024
@github-actions github-actions bot added the category: cmake / build Cmake scripts label Nov 4, 2024
@vshampor vshampor force-pushed the token_rotation branch 3 times, most recently from 5b61636 to 0d60110 Compare November 12, 2024 16:42
@vshampor vshampor marked this pull request as ready for review November 12, 2024 16:42
@github-actions github-actions bot added no-match-files and removed category: sampling Sampling / Decoding algorithms labels Nov 20, 2024
@AlexKoff88 AlexKoff88 requested a review from l-bat November 27, 2024 11:09
@github-actions github-actions bot added the category: whisper Whisper pipeline label Dec 2, 2024
Comment on lines 337 to 340
block_rotation_data.cosines.push_back(
m_rope_cos_lut[current_rotation_delta_in_blocks * m_block_size]);
block_rotation_data.sines.push_back(
m_rope_sin_lut[current_rotation_delta_in_blocks * m_block_size]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it better to just pass m_rope_cos_lut and m_rope_sin_lut to PagedAttention op alongside with required indices? Having this packing on the host side doesn't provide any flexibility in how you can adjust RoPE -- code here essentially implements Gather. Potential RoPE variants that will be implemented in the future require changes in how we prepare sin/cos or how we are applying the coefficients in PA operation itself. But doing coefficient packing here gives potential data duplication along multiple LLM layers and we will need this packing each time when re-rotation applied, so it looks better to locate this code right in PA, gather values locally for each KV cache block without repacking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be done, I suppose, with a minor question of whether in this case there should really be separate sin and cos inputs.

Copy link
Collaborator

@slyalin slyalin Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not "potential data duplication" -- chunks of cos/sin coefficients are explicitly duplicated here for all tokens in each block. When I mentioned "potential" I meant duplication of rotation coefficients between different attention layers in the model.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, we are working with double data type here, so data duplication gives 2x memory consumption impact. It will not contribute to memory peak as we are calling this function per layer, but still using double here is not necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the base data type to float.

Comment on lines 480 to 483
rotation_multipliers_tensor_data[position_offset + embedding_pair_idx] =
rotation_multipliers_cos[tok_idx][embedding_pair_idx];
rotation_multipliers_tensor_data[position_offset + embedding_pair_idx + head_size / 2] =
rotation_multipliers_sin[tok_idx][embedding_pair_idx];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are repacking sin/cos coefficients again after initial copying/duplication from lut tables. Repacking is required because we expect this particular layout in PA operation -- nothing else force us to have this repacking here. But is it a really convenient layout if we need repacking here? In the code of cache eviction on CPU side, for example here https://github.com/openvinotoolkit/openvino/pull/27088/files#diff-9ec1d0710e07c208e40402e8342b86075f2bdc06a900f1a8cc3ec28385952753R17-R18 we are using sin and cos by two unrelated pointers, so packing sin and cos close to each other for each token doesn't look as requirement. If we take sin/cos in their original layout and separately for sin and cos in PA operation, repacking here wouldn't be needed and it will still be friendly for CPU implementation. Correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By putting the potential repacking functionality onto the GenAI lib level I thought to circumvent the potential future issue of different layouts required by devices for efficiency. Already, it seems, the GPU requires different key cache layout from the CPU case, see:

if (m_device.find("GPU") != std::string::npos) {

Right now the difference is only in the order of the last two dims (block_size, embedding_size). It could be made non-existent, though, if we decide to forego the per-token rotation coefficient passing and only pass per-block coefficients, in which case the LUT shape should be the same both for CPU and GPU.

Copy link
Collaborator

@slyalin slyalin Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Together with per-token duplication it requires a lot more storage and memory bandwidth to prepare and use these coefficients in comparison to just indices. For example, if we have num_kv_heads == 1 then amount of memory spent for the coefficients is the same as rotated kv cache blocks themselves, and if KV cache is compressed, the size of coefficients if they are still stored as f32 will be at least 2x bigger. In this part of the code we are just repacking that big amount of data. So we are loading/storing this amount 4 times: first when duplicating them in another place in the code, then in this place we spend another 2 times when loading and storing them in a different layout, and then in PA when applying the coefficients. In the first and the second passes we are really dealing with double type that doubles the bandwidth requirement, BTW.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the LUT slower with suggestions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slower?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

smaller*

if (current_rotation_delta_in_blocks != 0) {
BlockRotationData block_rotation_data;
block_rotation_data.logical_block_idx = logical_block_idx - current_rotation_delta_in_blocks;
block_rotation_data.rotation_delta = current_rotation_delta_in_blocks * m_block_size;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the number of different rotation_deltas is bounded by the maximum number of evicted blocks. And each rotation_delta is multiple of m_block_size. That means that we don't need to pre-calculate and store other values in sin/cos LUT's and gives a significantly lower bound for their size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented the idea, adjusting for reality (recent_size also counts)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to revert this back to maximum sequence length because in the latest iteration, in effort to align with the Python POC, I only evict once the prompt is pre-filled, which means that the occupancy of cache by single sequence is still bound only by max sequence length.

}

// TODO (vshampor): LUT size equal to max cache size in tokens
// is overkill - find a way to pass the max sequence length defined by pipeline instead
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even max sequence length is overkill. Mentioned in another comment: the number of required different rows in LUT can be divided by the block size. It makes the tables times smaller and provides a chipper extending way for the tables on-the-fly when the maximum relative distance in re-rotation exceeds the currently allocated value. Or I just have a wrong impression about how it works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlexKoff88 suggests to keep the possibility of per-token rotation for the future, hence the duplication and LUT granularity right now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having an index per token still give you the possibility to control it with a token granularity -- and this could be kept in PA semantics. But you can organize LUT without unused gaps already now and divide all indices by block size -- it is a part of GenAI and doesn't affect how OV side is implemented. Are there any other dependencies or future work activities that can utilize per-token granularity without changing the code in this PR? Can we externally change the algorithm for cache eviction having the same GenAI released package?

Concerning per-block or per-token granularity, you can utilize well-defined broadcast semantics for PA inputs that defines rotation deltas. For example, [num_rot_blocks, block_size] tensor shape will have per-token granularity, and [num_rot_blocks, 1] will assume broadcast of the same rotation index to all tokens in blocks providing per-block granularity. This is a usual thing for element-wise operations in OV. In this way you give economical representation for per-block granularity and in the same time provide necessary flexibility to switch to per-token granularity if it is required in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a dimension to the deltas tensor, with broadcast semantics hand-coded.

Comment on lines 288 to 300
m_rope_sin_lut.resize(max_position_angle_multiplier);
m_rope_cos_lut.resize(max_position_angle_multiplier);

for (size_t i = 0; i < max_position_angle_multiplier; i++) {
m_rope_sin_lut[i].reserve(num_freqs);
m_rope_cos_lut[i].reserve(num_freqs);
for (size_t j = 0; j < num_freqs; j++) {
double exponent = -static_cast<double>(2 * j) / kv_head_size;
double base_angle = std::pow(rope_theta, exponent);
m_rope_sin_lut[i].push_back(
-std::sin(i * base_angle)); // minus since we will be rotating by an inverse angle
m_rope_cos_lut[i].push_back(std::cos(i * base_angle));
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on other comments about maximum number of different re-rotation values: amount of data can be reduced by block_size time and limited by the maximum number of evicted blocks.

And saving double values doesn't still make sense to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the data type to float, want to keep per-token granularity even if it means duplication of values right now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can external user utilize the per-token granularity without changing the code of the cache eviction?

@github-actions github-actions bot added category: Python API Python API for GenAI category: GenAI C++ API Changes in GenAI C++ public headers labels Dec 20, 2024
@vshampor vshampor force-pushed the token_rotation branch 2 times, most recently from baec186 to 7094461 Compare January 21, 2025 12:09
@github-actions github-actions bot removed the category: tokenizers Tokenizer class or submodule update label Jan 21, 2025
@vshampor vshampor enabled auto-merge January 21, 2025 13:36
@@ -23,7 +23,7 @@ namespace ov::genai {
class CacheEvictionConfig {
public:
CacheEvictionConfig() {};
CacheEvictionConfig(size_t start_size, size_t recent_size, size_t max_cache_size, AggregationMode aggregation_mode_) : aggregation_mode(aggregation_mode_), m_start_size(start_size), m_recent_size(recent_size), m_max_cache_size(max_cache_size) {
CacheEvictionConfig(size_t start_size, size_t recent_size, size_t max_cache_size, AggregationMode aggregation_mode_, bool apply_rotation_ = false) : aggregation_mode(aggregation_mode_), apply_rotation(apply_rotation_), m_start_size(start_size), m_recent_size(recent_size), m_max_cache_size(max_cache_size) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we enable rotation automatically based on model topology without mandatory user-level control? I, as a user, have no idea when I should enable it, and will leave it as-is (disabled) or enable it every time. What is the recommendation for the user? Is true a better default value because majority of modern architectures uses RoPE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't achieve expected accuracy results in currently available test cases right now with rotation enabled. Will probably set the default to true once we do get accuracy, but the user's model may have a differing RoPE scheme from what we currently support. I'll add some words about that to the docstring.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then it is OK to merge it to master to continue development there, but not in the currently open release branch.

Comment on lines +283 to +284
// `rotate_half` function in HF transformers. It can be shown that this form still preserves the relative
// positioning property from the RoFormer article.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how I should interpret this part. Does it mean that LUT computed here suits both interleaved and not-interleaved RoPE? Or only for one of them, but the difference can be ignored?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LUT here suits only the non-interleaved case. It can be shown... part is trivia for the curious and relates to the general desirable property of RoPE-like positional encodings.

@MaximProshin MaximProshin removed this from the 2025.0 milestone Jan 23, 2025
@vshampor vshampor requested a review from slyalin January 23, 2025 09:05
@vshampor vshampor disabled auto-merge January 23, 2025 10:23
@vshampor vshampor added this pull request to the merge queue Jan 24, 2025
Merged via the queue into openvinotoolkit:master with commit 9caf53c Jan 24, 2025
68 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: cmake / build Cmake scripts category: continuous batching Continuous batching category: GenAI C++ API Changes in GenAI C++ public headers category: Python API Python API for GenAI category: whisper Whisper pipeline no-match-files
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants