-
Notifications
You must be signed in to change notification settings - Fork 602
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dl/translation: scheduling policy for translation lag #25016
Conversation
/dt |
CI test resultstest results on build#61544
test results on build#61582
test results on build#61625
test results on build#61680
|
54d13e8
to
1ccd58c
Compare
/dt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, a couple of nits is all.
on_resource_exhaustion(executor&, const reservations_tracker&) override; | ||
|
||
private: | ||
// Minium expected time slice allotment share of the total target lag. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Minium expected time slice allotment share of the total target lag. | |
// Minimum expected time slice allotment share of the total target lag. | |
Comment | |
// Check {@link minimum_allotment_coeff} | ||
unfulfilled_quota, | ||
// Cannot be classified into one of the following groups | ||
random, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is other
a better name than random
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, renamed.
static constexpr long default_about_to_expire_group_shares = 30; | ||
static constexpr long default_expired_group_shares = 50; | ||
|
||
absl::flat_hash_map<translator_group, long> _group_to_shares = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be static
?
|
||
executor.as.check(); | ||
|
||
while (!prioritized.empty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the above loop needs to be reactor friendly, doesn't this one too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this loop is expensive in most cases because we start the first available translator and break the loop, added a yield just in case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. few tiny nits and questions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick reviews.
// Check {@link minimum_allotment_coeff} | ||
unfulfilled_quota, | ||
// Cannot be classified into one of the following groups | ||
random, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, renamed.
|
||
executor.as.check(); | ||
|
||
while (!prioritized.empty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this loop is expensive in most cases because we start the first available translator and break the loop, added a yield just in case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
candidates.push_back( | ||
{.id = id, | ||
.group = translator_group::other, | ||
.weight = random_generators::get_int<long>()}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, but I wonder if we need to bound the random number, given we're going to be comparing against something relatively bounded (a time duration)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given we're going to be comparing against something relatively bounded (a time duration)
not sure what you mean by this exactly, .. this weight is used for ordering within the group, so we are just comparing amongst these random numbers.. (I'm ok with clamping it with a bound but just want to make sure we are on the same page)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah oops thanks for clarifying. I misunderstood that we don't compare across groups.
while (mem_tracker.memory_exhausted() && !executor.as.abort_requested()) { | ||
co_await ss::sleep_abortable(polling_interval, executor.as); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we guaranteed to make progress within the policy here? I'm wondering does something like this end up happening?
- we have 10/10 blocks reserved
- t_1 attempts to reserve a block, which notifies the scheduler about memory exhaustion and waits
- since the scheduler and policy runs in a single loop, we eventually get to this line after asynchronously stopping t_big which has one block
- t_1 immediately takes the reservation since it's been waiting on the semaphore
- this loop doesn't exit because memory is still exhausted now that t_1 has taken the reservation, until the translators naturally finish
- ?? maybe other translators try to reserve, but the scheduler is stuck in this loop, so no translator can make progress?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right this case is possible I think if all the translators are really high throughput. As for progress, the wait time is bounded here (meaning we are not stuck in this loop forever) because each translator will naturally release memory as its time slice finishes. I have an idea to improve the behavior here, let me push in next rev.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I think I'm missing something. What bounds the wait time? It doesn't look like this while loop ever exits if we get to step 6 (it'd be a live lock), given neither the default_reservation_tracker::reserve_memory()
nor this while loop have a timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The loop exits because mem_tracker.memory_exhausted()
returns false eventually when the inflight translations exceed their time quota and release their resources (see mock_translator impl).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we get into a scenario where all translators are waiting on the reservation semaphore with no deadline, while this loop is also waiting for memory to be freed with no deadline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think thats possible because there is a deadline which is the time_slice allotted to every translator and the translator has to release its resources after that time slice elapses no matter what which keeps the wait here bounded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh I think you're right. I missed that the abort source used by the reservation tracker is the translator abort source, not the executor's. Sorry for the noise!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have a clarifying question on one of the comments, will push the next rev once that is addressed.
candidates.push_back( | ||
{.id = id, | ||
.group = translator_group::other, | ||
.weight = random_generators::get_int<long>()}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given we're going to be comparing against something relatively bounded (a time duration)
not sure what you mean by this exactly, .. this weight is used for ordering within the group, so we are just comparing amongst these random numbers.. (I'm ok with clamping it with a bound but just want to make sure we are on the same page)
while (mem_tracker.memory_exhausted() && !executor.as.abort_requested()) { | ||
co_await ss::sleep_abortable(polling_interval, executor.as); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right this case is possible I think if all the translators are really high throughput. As for progress, the wait time is bounded here (meaning we are not stuck in this loop forever) because each translator will naturally release memory as its time slice finishes. I have an idea to improve the behavior here, let me push in next rev.
A scheduling policy that strives to meet the target lag deadline for the translators while still being fair so that translators
with a small lag do not starve translators with large lag. The policy uses a heuristic with shares to guarantee a degree of
fairness proportional to the alloted shares.
Backports Required
Release Notes