Add slice before matmut transformation for CB scenario #1261

olpipi · 2024-11-27T11:10:25Z

CVS-154930
CVS-155533

src/cpp/src/model_runner.hpp

src/cpp/src/sampler.cpp

src/cpp/src/model_runner.hpp

src/cpp/src/utils/paged_attention_transformations.cpp

src/cpp/src/utils.cpp

src/cpp/src/sampler.cpp

src/cpp/src/utils/paged_attention_transformations.cpp

src/cpp/src/model_runner.hpp

src/cpp/src/sampler.cpp

src/cpp/src/utils/paged_attention_transformations.cpp

ilya-lavrenov · 2024-12-19T12:43:16Z

src/cpp/src/speculative_decoding/speculative_decoding_impl.cpp

@@ -37,6 +37,8 @@ ContinuousBatchingPipeline::SpeculativeDecodingImpl::SpeculativeDecodingImpl(con

    utils::apply_paged_attention_transformations(main_model, main_model_desc.scheduler_config.use_cache_eviction);
    utils::apply_paged_attention_transformations(draft_model, main_model_desc.scheduler_config.use_cache_eviction);
+    utils::apply_gather_before_matmul_transformation(main_model);
+    utils::apply_gather_before_matmul_transformation(draft_model);


please, the same to prompt lookup decoding pipeline

ilya-lavrenov · 2024-12-20T10:26:23Z

src/cpp/src/model_runner.hpp

-            // context_len corresponds to first token within subgroup of scheduled tokens
-            size_t group_context_len = group_position_id;
+            // Next variables are only for sliced matmul case
+            size_t actual_seq_len = 0;


it should per sequence, not per sequence group, but you increment this value within a loop over num_running_sequences, so finally you will have seq_len * num_running_sequences instead of seq_len

ilya-lavrenov · 2024-12-20T10:27:26Z

src/cpp/src/sampler.cpp

@@ -756,8 +756,7 @@ SamplerOutput Sampler::sample(std::vector<SequenceGroup::Ptr> & sequence_groups,
            continue;

        size_t num_running_sequences = sequence_group->num_running_seqs();
-        size_t actual_seq_len = sequence_group->get_num_scheduled_tokens(); // points to a token which needs to be sampled
-        size_t padded_amount_of_processed_tokens = std::max(actual_seq_len, batch_seq_len);
+        size_t actual_seq_len = sequence_group->get_seq_len_to_sample();


Suggested change

size_t actual_seq_len = sequence_group->get_seq_len_to_sample();

size_t output_seq_len = sequence_group->get_output_seq_len();

IMO, it's a better name as by this function we define a number of tokens in last matmul

ilya-lavrenov · 2024-12-20T10:32:09Z

src/cpp/src/model_runner.hpp

@@ -153,6 +173,7 @@ class ModelRunner {
                subsequence_begins_data += 1;
                block_indices_begins_data += 1;
            }
+            sequence_group->set_seq_len_to_sample(matmul_gathering_is_required ? std::min(actual_seq_len, num_scheduled_tokens) : num_scheduled_tokens);


Suggested change

sequence_group->set_seq_len_to_sample(matmul_gathering_is_required ? std::min(actual_seq_len, num_scheduled_tokens) : num_scheduled_tokens);

sequence_group->set_seq_len_to_sample(seq_len);

why not just as simple as suggested?

ilya-lavrenov · 2024-12-20T10:35:53Z

src/cpp/src/model_runner.hpp


                    position_ids_data[token_id] = position_id;
+
+                    if (matmul_gathering_is_required && sampling_is_required) {
+                        if (echo_output ||


looks like echo is broken here as sampling_is_required can be false when we process prompt using several iterations (e.g. with dynamic split fuse and batch size 256, we require 4 iterations to process prompt 1024)

ilya-lavrenov · 2024-12-20T10:37:20Z

src/cpp/src/model_runner.hpp

+
+                    if (matmul_gathering_is_required && sampling_is_required) {
+                        if (echo_output ||
+                            group_position_id + token_id >= prompt_len - 1 &&


why do we need this condition? it says that token is actually not prompt token, which is guaranteed by sampling_is_required

ilya-lavrenov · 2024-12-20T10:40:27Z

src/cpp/src/model_runner.hpp

+                    if (matmul_gathering_is_required && sampling_is_required) {
+                        if (echo_output ||
+                            group_position_id + token_id >= prompt_len - 1 &&
+                            group_position_id + token_id >= num_scheduled_tokens - tokens_to_sample_per_sequence) {


what does this condition do?

let's consider example:
group_position_id (the same as content length / KV cache size) is 16
we have scheduled 1 token (because of KV cache limitations)
number of candidates for speculative decoding is 3, so tokens_to_sample_per_sequence is 4
so, the condition is:

16 + 0 >= 1 - 4

olpipi added do_not_merge do_not_review labels Nov 27, 2024

github-actions bot added category: continuous batching Continuous batching category: sampling Sampling / Decoding algorithms category: speculative decoding Speculative decoding no-match-files labels Nov 27, 2024

ilya-lavrenov self-assigned this Nov 27, 2024

ilya-lavrenov reviewed Nov 27, 2024

View reviewed changes

src/cpp/src/model_runner.hpp Outdated Show resolved Hide resolved

src/cpp/src/sampler.cpp Outdated Show resolved Hide resolved

src/cpp/src/model_runner.hpp Outdated Show resolved Hide resolved

src/cpp/src/model_runner.hpp Outdated Show resolved Hide resolved

ilya-lavrenov assigned iefode Nov 27, 2024

ilya-lavrenov added this to the 2025.0 milestone Nov 27, 2024

olpipi force-pushed the slice_matmul_cb branch from 76b65ab to 1523fab Compare November 28, 2024 18:12

olpipi marked this pull request as ready for review November 28, 2024 18:27

olpipi removed do_not_merge do_not_review labels Nov 28, 2024

olpipi force-pushed the slice_matmul_cb branch from 1523fab to e82d3ce Compare December 3, 2024 18:09

iefode reviewed Dec 4, 2024

View reviewed changes

src/cpp/src/utils/paged_attention_transformations.cpp Outdated Show resolved Hide resolved

src/cpp/src/utils.cpp Show resolved Hide resolved

src/cpp/src/sampler.cpp Outdated Show resolved Hide resolved

ilya-lavrenov reviewed Dec 4, 2024

View reviewed changes

github-actions bot added the category: LLM LLM pipeline (stateful, static) label Dec 6, 2024

mlukasze requested a review from iefode December 9, 2024 05:50

iefode reviewed Dec 10, 2024

View reviewed changes

src/cpp/src/utils/paged_attention_transformations.cpp Outdated Show resolved Hide resolved

olpipi force-pushed the slice_matmul_cb branch 2 times, most recently from 5fc1067 to c82d6d1 Compare December 12, 2024 11:58

olpipi added 7 commits December 18, 2024 14:56

Add slice before matmut transformation for CB scenario

19bb38e

fix

fac1cb7

fix

2a71926

fix

2f033e6

Apply comments

3044631

Apply comments

639ae54

Apply comments

894439b

olpipi force-pushed the slice_matmul_cb branch from c82d6d1 to 894439b Compare December 18, 2024 14:58

olpipi added 2 commits December 18, 2024 16:37

Fix

2931677

Rework gather input calculating

3a6066a

olpipi force-pushed the slice_matmul_cb branch from fa0a951 to 3a6066a Compare December 18, 2024 19:57

iefode approved these changes Dec 20, 2024

View reviewed changes

ilya-lavrenov reviewed Dec 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add slice before matmut transformation for CB scenario #1261

Add slice before matmut transformation for CB scenario #1261

olpipi commented Nov 27, 2024 •

edited

Loading

ilya-lavrenov Dec 19, 2024

ilya-lavrenov Dec 20, 2024

ilya-lavrenov Dec 20, 2024

ilya-lavrenov Dec 20, 2024

ilya-lavrenov Dec 20, 2024

ilya-lavrenov Dec 20, 2024

ilya-lavrenov Dec 20, 2024

	size_t actual_seq_len = sequence_group->get_seq_len_to_sample();
	size_t output_seq_len = sequence_group->get_output_seq_len();

	sequence_group->set_seq_len_to_sample(matmul_gathering_is_required ? std::min(actual_seq_len, num_scheduled_tokens) : num_scheduled_tokens);
	sequence_group->set_seq_len_to_sample(seq_len);

Add slice before matmut transformation for CB scenario #1261

Are you sure you want to change the base?

Add slice before matmut transformation for CB scenario #1261

Conversation

olpipi commented Nov 27, 2024 • edited Loading

ilya-lavrenov Dec 19, 2024

Choose a reason for hiding this comment

ilya-lavrenov Dec 20, 2024

Choose a reason for hiding this comment

ilya-lavrenov Dec 20, 2024

Choose a reason for hiding this comment

ilya-lavrenov Dec 20, 2024

Choose a reason for hiding this comment

ilya-lavrenov Dec 20, 2024

Choose a reason for hiding this comment

ilya-lavrenov Dec 20, 2024

Choose a reason for hiding this comment

ilya-lavrenov Dec 20, 2024

Choose a reason for hiding this comment

olpipi commented Nov 27, 2024 •

edited

Loading