dl/translation: scheduling policy for translation lag #25016

bharathv · 2025-02-04T03:58:17Z

A scheduling policy that strives to meet the target lag deadline for the translators while still being fair so that translators
with a small lag do not starve translators with large lag. The policy uses a heuristic with shares to guarantee a degree of
fairness proportional to the alloted shares.

Backports Required

Release Notes

none

bharathv · 2025-02-04T04:00:11Z

/dt

vbotbuildovich · 2025-02-04T07:53:54Z

CI test results

test results on build#61544

test_id	test_kind	job_url	test_status	passed
datalake_translation_tests_rpunit.datalake_translation_tests_rpunit	unit	https://buildkite.com/redpanda/redpanda/builds/61544#0194cf20-a1e4-4fd8-81b4-ccb6a395b925	FAIL	0/2
gtest_raft_rpunit.gtest_raft_rpunit	unit	https://buildkite.com/redpanda/redpanda/builds/61544#0194cf20-a1e3-428a-904f-186259ad53a6	FLAKY	1/2
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade	ducktape	https://buildkite.com/redpanda/redpanda/builds/61544#0194cf6a-13a6-4d87-90ae-41c9013bdbca	FLAKY	1/2
rptest.tests.datalake.compaction_test.CompactionGapsTest.test_translation_no_gaps.cloud_storage_type=CloudStorageType.S3.catalog_type=CatalogType.REST_HADOOP	ducktape	https://buildkite.com/redpanda/redpanda/builds/61544#0194cf6a-13a7-4762-86ef-a76b7f3f3c2b	FLAKY	1/2
rptest.tests.datalake.compaction_test.CompactionGapsTest.test_translation_no_gaps.cloud_storage_type=CloudStorageType.S3.catalog_type=CatalogType.REST_JDBC	ducktape	https://buildkite.com/redpanda/redpanda/builds/61544#0194cf6a-13a8-43a0-986d-cb5936b671a3	FLAKY	1/5
rptest.tests.partition_movement_test.SIPartitionMovementTest.test_shadow_indexing.num_to_upgrade=2.cloud_storage_type=CloudStorageType.ABS	ducktape	https://buildkite.com/redpanda/redpanda/builds/61544#0194cf6a-13a8-43a0-986d-cb5936b671a3	FLAKY	1/2
rptest.tests.scaling_up_test.ScalingUpTest.test_scaling_up_with_recovered_topic	ducktape	https://buildkite.com/redpanda/redpanda/builds/61544#0194cf6a-13a6-4d87-90ae-41c9013bdbca	FLAKY	1/3

test results on build#61582

test_id	test_kind	job_url	test_status	passed
rptest.tests.cloud_storage_timing_stress_test.CloudStorageTimingStressTest.test_cloud_storage_with_partition_moves.cleanup_policy=compact.delete	ducktape	https://buildkite.com/redpanda/redpanda/builds/61582#0194d316-ddba-491b-b0c5-ff442bbb1e0d	FLAKY	1/2
rptest.tests.compaction_recovery_test.CompactionRecoveryTest.test_index_recovery	ducktape	https://buildkite.com/redpanda/redpanda/builds/61582#0194d316-ddbb-492a-b82a-15a0d4e32bd5	FLAKY	1/3
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade	ducktape	https://buildkite.com/redpanda/redpanda/builds/61582#0194d316-ddb9-4fd2-bb9a-98bf9a732778	FLAKY	1/2
rptest.tests.datalake.compaction_test.CompactionGapsTest.test_translation_no_gaps.cloud_storage_type=CloudStorageType.S3.catalog_type=CatalogType.REST_JDBC	ducktape	https://buildkite.com/redpanda/redpanda/builds/61582#0194d316-ddba-491b-b0c5-ff442bbb1e0d	FLAKY	1/4

test results on build#61625

test_id	test_kind	job_url	test_status	passed
partition_balancer_planner_test_rpunit.partition_balancer_planner_test_rpunit	unit	https://buildkite.com/redpanda/redpanda/builds/61625#0194d721-cfd3-4a22-8caa-8b52686ebca7	FLAKY	1/2
rptest.tests.datalake.compaction_test.CompactionGapsTest.test_translation_no_gaps.cloud_storage_type=CloudStorageType.S3.catalog_type=CatalogType.REST_HADOOP	ducktape	https://buildkite.com/redpanda/redpanda/builds/61625#0194d76a-f03c-413e-b11f-09a455fb72ce	FLAKY	1/2
rptest.tests.datalake.compaction_test.CompactionGapsTest.test_translation_no_gaps.cloud_storage_type=CloudStorageType.S3.catalog_type=CatalogType.REST_JDBC	ducktape	https://buildkite.com/redpanda/redpanda/builds/61625#0194d76a-f039-4c34-9212-f015ed475786	FLAKY	1/2
rptest.tests.partition_movement_test.SIPartitionMovementTest.test_shadow_indexing.num_to_upgrade=0.cloud_storage_type=CloudStorageType.ABS	ducktape	https://buildkite.com/redpanda/redpanda/builds/61625#0194d76a-f03b-4e32-8cb8-7a9afab37153	FLAKY	1/2

test results on build#61680

test_id	test_kind	job_url	test_status	passed
gtest_raft_rpunit.gtest_raft_rpunit	unit	https://buildkite.com/redpanda/redpanda/builds/61680#0194dc29-bb4d-43e3-9191-9d3992db5a72	FLAKY	1/2
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade	ducktape	https://buildkite.com/redpanda/redpanda/builds/61680#0194dc72-df59-4fa3-ba73-4aa9b652306e	FLAKY	1/2
rptest.tests.partition_move_interruption_test.PartitionMoveInterruption.test_cancelling_partition_move_x_core.replication_factor=3.unclean_abort=True.recovery=no_recovery.compacted=False	ducktape	https://buildkite.com/redpanda/redpanda/builds/61680#0194dc6d-b1dc-452c-abaa-9275421335d3	FLAKY	1/2

bharathv · 2025-02-04T16:36:31Z

/dt

rockwotj

LGTM, a couple of nits is all.

rockwotj · 2025-02-04T20:58:33Z

src/v/datalake/translation/scheduling_policies.h

+    on_resource_exhaustion(executor&, const reservations_tracker&) override;
+
+private:
+    // Minium expected time slice allotment share of the total target lag.


Suggested change

// Minium expected time slice allotment share of the total target lag.

// Minimum expected time slice allotment share of the total target lag.

Comment

rockwotj · 2025-02-04T21:00:36Z

src/v/datalake/translation/scheduling_policies.h

+        // Check {@link minimum_allotment_coeff}
+        unfulfilled_quota,
+        // Cannot be classified into one of the following groups
+        random,


is other a better name than random?

yep, renamed.

rockwotj · 2025-02-04T21:07:32Z

src/v/datalake/translation/scheduling_policies.h

+    static constexpr long default_about_to_expire_group_shares = 30;
+    static constexpr long default_expired_group_shares = 50;
+
+    absl::flat_hash_map<translator_group, long> _group_to_shares = {


Should this be static?

rockwotj · 2025-02-04T21:16:51Z

src/v/datalake/translation/scheduling_policies.cc

+
+    executor.as.check();
+
+    while (!prioritized.empty()) {


if the above loop needs to be reactor friendly, doesn't this one too?

I don't think this loop is expensive in most cases because we start the first available translator and break the loop, added a yield just in case.

oleiman

lgtm. few tiny nits and questions

src/v/datalake/translation/scheduling_policies.cc

src/v/datalake/translation/tests/fair_scheduling_policy_tests.cc

bharathv

Thanks for the quick reviews.

bharathv · 2025-02-05T00:50:58Z

src/v/datalake/translation/scheduling_policies.h

+        // Check {@link minimum_allotment_coeff}
+        unfulfilled_quota,
+        // Cannot be classified into one of the following groups
+        random,


yep, renamed.

src/v/datalake/translation/scheduling_policies.cc

bharathv · 2025-02-05T01:07:57Z

src/v/datalake/translation/scheduling_policies.cc

+
+    executor.as.check();
+
+    while (!prioritized.empty()) {


I don't think this loop is expensive in most cases because we start the first available translator and break the loop, added a yield just in case.

oleiman

lgtm

andrwng · 2025-02-05T18:49:24Z

src/v/datalake/translation/scheduling_policies.cc

+                candidates.push_back(
+                  {.id = id,
+                   .group = translator_group::other,
+                   .weight = random_generators::get_int<long>()});


Not sure, but I wonder if we need to bound the random number, given we're going to be comparing against something relatively bounded (a time duration)

given we're going to be comparing against something relatively bounded (a time duration)

not sure what you mean by this exactly, .. this weight is used for ordering within the group, so we are just comparing amongst these random numbers.. (I'm ok with clamping it with a bound but just want to make sure we are on the same page)

Ah oops thanks for clarifying. I misunderstood that we don't compare across groups.

andrwng · 2025-02-05T19:09:32Z

src/v/datalake/translation/scheduling_policies.cc

+    while (mem_tracker.memory_exhausted() && !executor.as.abort_requested()) {
+        co_await ss::sleep_abortable(polling_interval, executor.as);
+    }


Are we guaranteed to make progress within the policy here? I'm wondering does something like this end up happening?

we have 10/10 blocks reserved

t_1 attempts to reserve a block, which notifies the scheduler about memory exhaustion and waits

since the scheduler and policy runs in a single loop, we eventually get to this line after asynchronously stopping t_big which has one block

t_1 immediately takes the reservation since it's been waiting on the semaphore

this loop doesn't exit because memory is still exhausted now that t_1 has taken the reservation, until the translators naturally finish

?? maybe other translators try to reserve, but the scheduler is stuck in this loop, so no translator can make progress?

Right this case is possible I think if all the translators are really high throughput. As for progress, the wait time is bounded here (meaning we are not stuck in this loop forever) because each translator will naturally release memory as its time slice finishes. I have an idea to improve the behavior here, let me push in next rev.

Hmm I think I'm missing something. What bounds the wait time? It doesn't look like this while loop ever exits if we get to step 6 (it'd be a live lock), given neither the default_reservation_tracker::reserve_memory() nor this while loop have a timeout?

The loop exits because mem_tracker.memory_exhausted() returns false eventually when the inflight translations exceed their time quota and release their resources (see mock_translator impl).

Could we get into a scenario where all translators are waiting on the reservation semaphore with no deadline, while this loop is also waiting for memory to be freed with no deadline?

I don't think thats possible because there is a deadline which is the time_slice allotted to every translator and the translator has to release its resources after that time slice elapses no matter what which keeps the wait here bounded.

Ahh I think you're right. I missed that the abort source used by the reservation tracker is the translator abort source, not the executor's. Sorry for the noise!

src/v/datalake/translation/scheduling_policies.cc

bharathv

have a clarifying question on one of the comments, will push the next rev once that is addressed.

bharathv · 2025-02-06T02:59:08Z

src/v/datalake/translation/scheduling_policies.cc

+                candidates.push_back(
+                  {.id = id,
+                   .group = translator_group::other,
+                   .weight = random_generators::get_int<long>()});


given we're going to be comparing against something relatively bounded (a time duration)

not sure what you mean by this exactly, .. this weight is used for ordering within the group, so we are just comparing amongst these random numbers.. (I'm ok with clamping it with a bound but just want to make sure we are on the same page)

src/v/datalake/translation/scheduling_policies.cc

bharathv · 2025-02-06T03:29:30Z

src/v/datalake/translation/scheduling_policies.cc

+    while (mem_tracker.memory_exhausted() && !executor.as.abort_requested()) {
+        co_await ss::sleep_abortable(polling_interval, executor.as);
+    }


Right this case is possible I think if all the translators are really high throughput. As for progress, the wait time is bounded here (meaning we are not stuck in this loop forever) because each translator will naturally release memory as its time slice finishes. I have an idea to improve the behavior here, let me push in next rev.

github-actions bot added area/build area/redpanda labels Feb 4, 2025

bharathv force-pushed the ifff33 branch 2 times, most recently from 54d13e8 to 1ccd58c Compare February 4, 2025 16:36

bharathv marked this pull request as ready for review February 4, 2025 20:45

bharathv requested review from ztlpn, rockwotj, andrwng, mmaslankaprv and oleiman February 4, 2025 20:46

rockwotj previously approved these changes Feb 4, 2025

View reviewed changes

oleiman previously approved these changes Feb 4, 2025

View reviewed changes

bharathv commented Feb 5, 2025

View reviewed changes

bharathv dismissed stale reviews from oleiman and rockwotj via c9b896b February 5, 2025 17:19

bharathv force-pushed the ifff33 branch from 1ccd58c to c9b896b Compare February 5, 2025 17:19

bharathv requested review from rockwotj and oleiman February 5, 2025 17:37

oleiman previously approved these changes Feb 5, 2025

View reviewed changes

andrwng reviewed Feb 5, 2025

View reviewed changes

bharathv commented Feb 6, 2025

View reviewed changes

bharathv dismissed oleiman’s stale review via dc3adb7 February 6, 2025 16:44

bharathv force-pushed the ifff33 branch from c9b896b to dc3adb7 Compare February 6, 2025 16:44

bharathv added 3 commits February 6, 2025 08:46

dl/translation: add a scheduling policy for translation lag semantics

c98262f

dl/translation/tests: add tests for fair scheduling policy

6e34aef

dl/translation: switch default scheduling policy

e65521e

bharathv force-pushed the ifff33 branch from dc3adb7 to e65521e Compare February 6, 2025 16:46

bharathv requested review from andrwng and oleiman February 6, 2025 16:46

andrwng approved these changes Feb 6, 2025

View reviewed changes

oleiman approved these changes Feb 6, 2025

View reviewed changes

bharathv enabled auto-merge February 6, 2025 18:16

bharathv merged commit 317787e into redpanda-data:dev Feb 6, 2025
17 checks passed

bharathv deleted the ifff33 branch February 6, 2025 20:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dl/translation: scheduling policy for translation lag #25016

dl/translation: scheduling policy for translation lag #25016

bharathv commented Feb 4, 2025 •

edited

Loading

bharathv commented Feb 4, 2025

vbotbuildovich commented Feb 4, 2025 •

edited

Loading

bharathv commented Feb 4, 2025

rockwotj left a comment

rockwotj Feb 4, 2025

rockwotj Feb 4, 2025

bharathv Feb 5, 2025

rockwotj Feb 4, 2025

rockwotj Feb 4, 2025

bharathv Feb 5, 2025

oleiman left a comment

bharathv left a comment

bharathv Feb 5, 2025

bharathv Feb 5, 2025

oleiman left a comment

andrwng Feb 5, 2025

bharathv Feb 6, 2025

andrwng Feb 6, 2025

andrwng Feb 5, 2025

bharathv Feb 6, 2025

andrwng Feb 6, 2025

bharathv Feb 6, 2025

andrwng Feb 6, 2025

bharathv Feb 6, 2025

andrwng Feb 6, 2025

bharathv left a comment

bharathv Feb 6, 2025

bharathv Feb 6, 2025

dl/translation: scheduling policy for translation lag #25016

dl/translation: scheduling policy for translation lag #25016

Conversation

bharathv commented Feb 4, 2025 • edited Loading

Backports Required

Release Notes

bharathv commented Feb 4, 2025

vbotbuildovich commented Feb 4, 2025 • edited Loading

CI test results

bharathv commented Feb 4, 2025

rockwotj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oleiman left a comment

Choose a reason for hiding this comment

bharathv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oleiman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bharathv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bharathv commented Feb 4, 2025 •

edited

Loading

vbotbuildovich commented Feb 4, 2025 •

edited

Loading