Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: book keep dirty_ratio in disk_log_impl #24649

Open
wants to merge 5 commits into
base: dev
Choose a base branch
from

Conversation

WillemKauf
Copy link
Contributor

@WillemKauf WillemKauf commented Dec 23, 2024

The dirty ratio of a log is defined as the ratio between the number of bytes in "dirty" segments and the total number of bytes in closed segments.

Dirty segments are closed segments which have not yet been cleanly compacted- i.e, duplicates for keys in this segment could be found in the prefix of the log up to this segment.

Add book-keeping to disk_log_impl in order to cache both _dirty_segment_bytes as well as _closed_segment_bytes, which allows us to calculate the dirty ratio, and add observability for it in storage::probe.

In the future, this could be used in combination with a compaction configuration a la min.cleanable.dirty.ratio to schedule compaction.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.3.x
  • v24.2.x
  • v24.1.x

Release Notes

Improvements

  • Adds the observable metrics dirty_segment_bytes and closed_segment_bytes to the storage layer.

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Dec 24, 2024

CI test results

test results on build#60092
test_id test_kind job_url test_status passed
gtest_raft_rpunit.gtest_raft_rpunit unit https://buildkite.com/redpanda/redpanda/builds/60092#0193f576-772e-4d29-8e01-3035f1e9661c FLAKY 1/2
test results on build#60241
test_id test_kind job_url test_status passed
rptest.transactions.tx_atomic_produce_consume_test.TxAtomicProduceConsumeTest.test_basic_tx_consumer_transform_produce.with_failures=True ducktape https://buildkite.com/redpanda/redpanda/builds/60241#019429a2-9e8f-413c-87cf-33a7c04d8757 FAIL 0/1
test results on build#60591
test_id test_kind job_url test_status passed
rm_stm_tests_rpunit.rm_stm_tests_rpunit unit https://buildkite.com/redpanda/redpanda/builds/60591#01945072-b1aa-48fd-bf2e-574626a0b62c FLAKY 1/2
rptest.tests.partition_reassignments_test.PartitionReassignmentsTest.test_reassignments_kafka_cli ducktape https://buildkite.com/redpanda/redpanda/builds/60591#019450bf-4a4b-4d89-97f6-141851e1303f FLAKY 4/6

@vbotbuildovich
Copy link
Collaborator

Retry command for Build#60241

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/transactions/tx_atomic_produce_consume_test.py::TxAtomicProduceConsumeTest.test_basic_tx_consumer_transform_produce@{"with_failures":true}

dotnwat
dotnwat previously approved these changes Jan 4, 2025
Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is awesome.

  1. do we already have metric that measures the actual size reduction achieved by compaction for a partition? what i'm thinking about is future scheduling: if a partition with the highest dirty ratio has all unique keys, then compaction won't help at all, and it might be useful for some other partition with lots of duplicates to be compacted first. using the achieved compaction ratio for the last round might be good enough as a proxy.

  2. the other thing i was going to ask is: there are many places where compacted bytes and dirty bytes are updated. is this error prone compared to running a for-loop to compute the metrics on demand in the probe handler? is that loop over the segments slow enough that we should consider avoiding it like in this pr?

@WillemKauf
Copy link
Contributor Author

do we already have metric that measures the actual size reduction achieved by compaction for a partition

We do bookkeep the compaction ratio already in the log. You raise a good point, that maybe the dirty ratio on its own is not the best heuristic for scheduling (given the fact that key cardinality isn't considered).

My question would then be: do we want parity with Kafka here, or do we want to try something new with the compaction ratio as a scheduling heuristic?

the other thing i was going to ask is: there are many places where compacted bytes and dirty bytes are updated. is this error prone compared to running a for-loop to compute the metrics on demand in the probe handler

Is it error prone- I don't think so, as far as bytes removed/added go I think I've covered all the code locations where this may happen in disk_log_impl (segment removal, compaction removed bytes, segment rolls). I do share your concern that if a byte-affecting location was missed that these metrics would begin to drift from their true values.

So, I would agree with you that an on-demand for loop over the segments would be most verifiably accurate/up to date for reporting these metrics.

@dotnwat
Copy link
Member

dotnwat commented Jan 5, 2025

My question would then be: do we want parity with Kafka here, or do we want to try something new with the compaction ratio as a scheduling heuristic?

I think the dirty ratio still makes sense to track, and like you mention, we can figure out scheduling later. my instinct is that dirty ratio probably is good enough, but if the scheduling goal is simply to process the biggest bang for the buck partition first then taking into account the estimate actual ratio of compaction (in combination with dirty ratio) would seem to make sense. maybe it doesn't make a big difference in real world data.

I do share your concern that if a byte-affecting location was missed that these metrics would begin to drift from their true values.

yeh. it's probably not something we need to be too concerned with since there really aren't many cases where we need to update the probe. we should be able to catch new cases in review.

Comment on lines 3892 to 3942
template<typename tag>
void disk_log_impl::update_dirty_segment_bytes(uint64_t bytes) {
apply_tagged_operation<tag>(_dirty_segment_bytes, bytes);
_probe->set_dirty_segment_bytes(_dirty_segment_bytes);
}

template<typename tag>
void disk_log_impl::update_closed_segment_bytes(uint64_t bytes) {
apply_tagged_operation<tag>(_closed_segment_bytes, bytes);
_probe->set_closed_segment_bytes(_closed_segment_bytes);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems overly complicated for something that could be

void disk_log_impl::add_dirty_segment_bytes(uint64_t bytes) {
    _dirty_segment_bytes += bytes;
    _probe->set_dirty_segment_bytes(_dirty_segment_bytes);
}
void disk_log_impl::add_closed_segment_bytes(uint64_t bytes) {
    _closed_segment_bytes += bytes;
    _probe->set_closed_segment_bytes(_closed_segment_bytes);
}
void disk_log_impl::subtract_dirty_segment_bytes(uint64_t bytes) {
    _dirty_segment_bytes -= std::min(bytes, _dirty_segment_bytes);
    _probe->set_dirty_segment_bytes(_dirty_segment_bytes);
}
void disk_log_impl::subtract_closed_segment_bytes(uint64_t bytes) {
    _closed_segment_bytes -= std::min(bytes, _closed_segment_bytes);
    _probe->set_closed_segment_bytes(_closed_segment_bytes);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it could be split into these 4 functions, but this is the very boilerplate I was seeking to avoid in the current solution.

I could see an argument for the simple boilerplate over the complexity of the template solution, though I'm of the opinion it is not that much of a complexity add.

If you much prefer the boilerplate, I'd be happy to change it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference is boilerplate over templates when it's really simple to do so. Maybe i'd feel differently if there were more than two options (add/subtract)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed template solution in place of boilerplate.

src/v/storage/disk_log_impl.cc Outdated Show resolved Hide resolved
tests/rptest/tests/log_compaction_test.py Outdated Show resolved Hide resolved
@WillemKauf
Copy link
Contributor Author

Force push to:

  • Fix accidentally moved @skip_debug_mode decorator in log_compaction_test.py

Adds the book-keeping variables `_dirty/closed_segment_bytes` to
`disk_log_impl`, as well as some getter/setter functions.

These functions will be used throughout `disk_log_impl` where required
(segment rolling, compaction, segment eviction) to track the bytes
contained in dirty and closed segments.
Uses the added functions `update_dirty/closed_segment_bytes()`
in the required locations within `disk_log_impl` in order
to bookkeep the dirty ratio.

Bytes can be either removed or added by rolling new segments,
compaction, and retention enforcement.
@WillemKauf WillemKauf force-pushed the dirty_ratio_compaction branch from 4cdfca5 to e62dee6 Compare January 10, 2025 17:41
@WillemKauf
Copy link
Contributor Author

Force push to:

  • Remove template solution for subtract_tag{}, add_tag{} and add boilerplate functions add/subtract_dirty/closed_segment_bytes()

@WillemKauf WillemKauf requested a review from andrwng January 10, 2025 19:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants