Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datalake: remove offset translation from translation_stm (and add compaction_test.py) #24610

Open
wants to merge 5 commits into
base: dev
Choose a base branch
from

Conversation

WillemKauf
Copy link
Contributor

Previously, the datalake::translation::translation_stm would return its max collectible as the following:

model::offset translation_stm::max_collectible_offset() {
if (!_raft->log_config().iceberg_enabled()) {
return model::offset::max();
}
// if offset is not initialized, do not attempt translation.
if (_highest_translated_offset == kafka::offset{}) {
return model::offset{};
}
return _raft->log()->to_log_offset(
kafka::offset_cast(_highest_translated_offset));
}

This offset translation leads to an overly restrictive condition for the max collectible offset, due to the fact that it is translation batch unaware.

Here, the utility function get_translated_log_offset() is added, which returns the "equivalent" translated log offset for a given kafka offset, taking into account translation batches (which don't need to be translated, and thus shouldn't restrict the max collectible offset).

Use of this function is plumbed through the partition_translator and the translation_stm, and we now bookkeep the _highest_translated_log_offset in the translation_stm to avoid any offset translation within it.

Additionally, a new test for compaction with an Iceberg enabled topic is added to datalake/compaction_test.py, with some enhancements to the datalake_verifier service to make it compaction aware.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.3.x
  • v24.2.x
  • v24.1.x

Release Notes

Improvements

  • Fixes an overly restrictive condition for retention in Iceberg-enabled topics.

@@ -24,22 +24,24 @@

class DatalakeVerifier():
"""
Verifier that does the verification of the data in the redpanda Iceberg table.
The verifier consumes offsets from specified topic and verifies it the data
Verifier that does the verification of the data in the redpanda Iceberg table.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing whitespace removal

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Dec 19, 2024

Retry command for Build#59935

please wait until all jobs are finished before running the slash command



/ci-repeat 1
tests/rptest/tests/tiered_storage_model_test.py::TieredStorageTest.test_tiered_storage@{"cloud_storage_type_and_url_style":[1,"virtual_host"],"test_case":{"name":"(TS_Read == True, TS_ChunkedRead == True)"}}
tests/rptest/tests/tiered_storage_model_test.py::TieredStorageTest.test_tiered_storage@{"cloud_storage_type_and_url_style":[2,"virtual_host"],"test_case":{"name":"(TS_Read == True, TS_TxRangeMaterialized == True, SpilloverManifestUploaded == True)"}}
tests/rptest/tests/tiered_storage_model_test.py::TieredStorageTest.test_tiered_storage@{"cloud_storage_type_and_url_style":[1,"virtual_host"],"test_case":{"name":"(TS_Read == True, TS_TxRangeMaterialized == True, SpilloverManifestUploaded == True)"}}
tests/rptest/tests/write_caching_fi_test.py::WriteCachingFailureInjectionTest.test_crash_all
tests/rptest/tests/tiered_storage_model_test.py::TieredStorageTest.test_tiered_storage@{"cloud_storage_type_and_url_style":[2,"virtual_host"],"test_case":{"name":"(TS_Read == True, TS_TxRangeMaterialized == True)"}}
tests/rptest/tests/tiered_storage_model_test.py::TieredStorageTest.test_tiered_storage@{"cloud_storage_type_and_url_style":[1,"virtual_host"],"test_case":{"name":"(TS_Read == True, SpilloverManifestUploaded == True)"}}
tests/rptest/tests/tiered_storage_model_test.py::TieredStorageTest.test_tiered_storage@{"cloud_storage_type_and_url_style":[1,"virtual_host"],"test_case":{"name":"(TS_Read == True, TS_TxRangeMaterialized == True)"}}
tests/rptest/tests/write_caching_fi_e2e_test.py::WriteCachingFailureInjectionE2ETest.test_crash_all@{"use_transactions":false}
tests/rptest/tests/datalake/datalake_e2e_test.py::DatalakeE2ETests.test_topic_lifecycle@{"cloud_storage_type":1,"filesystem_catalog_mode":false}
tests/rptest/tests/tiered_storage_model_test.py::TieredStorageTest.test_tiered_storage@{"cloud_storage_type_and_url_style":[1,"virtual_host"],"test_case":{"name":"(TS_Read == True, TS_Timequery == True)"}}
tests/rptest/tests/tiered_storage_model_test.py::TieredStorageTest.test_tiered_storage@{"cloud_storage_type_and_url_style":[1,"path"],"test_case":{"name":"(TS_Read == True, TS_TxRangeMaterialized == True)"}}
tests/rptest/tests/datalake/datalake_e2e_test.py::DatalakeE2ETests.test_topic_lifecycle@{"cloud_storage_type":1,"filesystem_catalog_mode":true}
tests/rptest/tests/tiered_storage_model_test.py::TieredStorageTest.test_tiered_storage@{"cloud_storage_type_and_url_style":[1,"virtual_host"],"test_case":{"name":"(TS_Read == True, TS_Timequery == True, SpilloverManifestUploaded == True)"}}
tests/rptest/tests/tiered_storage_model_test.py::TieredStorageTest.test_tiered_storage@{"cloud_storage_type_and_url_style":[1,"path"],"test_case":{"name":"(TS_Read == True, TS_TxRangeMaterialized == True, SpilloverManifestUploaded == True)"}}
tests/rptest/tests/tiered_storage_model_test.py::TieredStorageTest.test_tiered_storage@{"cloud_storage_type_and_url_style":[1,"virtual_host"],"test_case":{"name":"(TS_Read == True, AdjacentSegmentMergerReupload == True)"}}

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Dec 19, 2024

CI test results

test results on build#59935
test_id test_kind job_url test_status passed
rptest.tests.datalake.datalake_e2e_test.DatalakeE2ETests.test_topic_lifecycle.cloud_storage_type=CloudStorageType.S3.filesystem_catalog_mode=False ducktape https://buildkite.com/redpanda/redpanda/builds/59935#0193dc18-c804-449e-983b-f6040b48eed2 FAIL 0/6
rptest.tests.datalake.datalake_e2e_test.DatalakeE2ETests.test_topic_lifecycle.cloud_storage_type=CloudStorageType.S3.filesystem_catalog_mode=False ducktape https://buildkite.com/redpanda/redpanda/builds/59935#0193dc1c-24f3-40db-9e3c-b72686edcfd1 FAIL 0/6
rptest.tests.datalake.datalake_e2e_test.DatalakeE2ETests.test_topic_lifecycle.cloud_storage_type=CloudStorageType.S3.filesystem_catalog_mode=True ducktape https://buildkite.com/redpanda/redpanda/builds/59935#0193dc18-c806-4851-8e40-029c7bdf36d7 FAIL 0/6
rptest.tests.datalake.datalake_e2e_test.DatalakeE2ETests.test_topic_lifecycle.cloud_storage_type=CloudStorageType.S3.filesystem_catalog_mode=True ducktape https://buildkite.com/redpanda/redpanda/builds/59935#0193dc1c-24f4-48f5-b848-e36477ea95e1 FAIL 0/6
rptest.tests.datalake.partition_movement_test.PartitionMovementTest.test_cross_core_movements.cloud_storage_type=CloudStorageType.S3 ducktape https://buildkite.com/redpanda/redpanda/builds/59935#0193dc18-c803-4c2f-9cd0-86f9a8dd5064 FLAKY 5/6
rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type_and_url_style=.CloudStorageType.ABS.2.virtual_host.test_case=.TS_Read==True.TS_TxRangeMaterialized==True ducktape https://buildkite.com/redpanda/redpanda/builds/59935#0193dc18-c807-4275-8982-8723111a2347 FAIL 0/1
rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type_and_url_style=.CloudStorageType.ABS.2.virtual_host.test_case=.TS_Read==True.TS_TxRangeMaterialized==True ducktape https://buildkite.com/redpanda/redpanda/builds/59935#0193dc1c-24f4-48f5-b848-e36477ea95e1 FAIL 0/1
rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type_and_url_style=.CloudStorageType.ABS.2.virtual_host.test_case=.TS_Read==True.TS_TxRangeMaterialized==True.SpilloverManifestUploaded==True ducktape https://buildkite.com/redpanda/redpanda/builds/59935#0193dc18-c803-4c2f-9cd0-86f9a8dd5064 FAIL 0/1
rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type_and_url_style=.CloudStorageType.ABS.2.virtual_host.test_case=.TS_Read==True.TS_TxRangeMaterialized==True.SpilloverManifestUploaded==True ducktape https://buildkite.com/redpanda/redpanda/builds/59935#0193dc1c-24f6-49d5-ae5a-d5eb3497ad6e FAIL 0/1
rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type_and_url_style=.CloudStorageType.S3.1.path.test_case=.TS_Read==True.TS_TxRangeMaterialized==True ducktape https://buildkite.com/redpanda/redpanda/builds/59935#0193dc18-c804-449e-983b-f6040b48eed2 FAIL 0/1
rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type_and_url_style=.CloudStorageType.S3.1.path.test_case=.TS_Read==True.TS_TxRangeMaterialized==True ducktape https://buildkite.com/redpanda/redpanda/builds/59935#0193dc1c-24f2-467e-a057-0e4a790311ae FAIL 0/1
rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type_and_url_style=.CloudStorageType.S3.1.path.test_case=.TS_Read==True.TS_TxRangeMaterialized==True.SpilloverManifestUploaded==True ducktape https://buildkite.com/redpanda/redpanda/builds/59935#0193dc18-c806-4851-8e40-029c7bdf36d7 FAIL 0/1
rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type_and_url_style=.CloudStorageType.S3.1.path.test_case=.TS_Read==True.TS_TxRangeMaterialized==True.SpilloverManifestUploaded==True ducktape https://buildkite.com/redpanda/redpanda/builds/59935#0193dc1c-24f3-40db-9e3c-b72686edcfd1 FAIL 0/1
rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type_and_url_style=.CloudStorageType.S3.1.virtual_host.test_case=.TS_Read==True.AdjacentSegmentMergerReupload==True ducktape https://buildkite.com/redpanda/redpanda/builds/59935#0193dc1c-24f6-49d5-ae5a-d5eb3497ad6e FAIL 0/1
rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type_and_url_style=.CloudStorageType.S3.1.virtual_host.test_case=.TS_Read==True.SpilloverManifestUploaded==True ducktape https://buildkite.com/redpanda/redpanda/builds/59935#0193dc18-c807-4275-8982-8723111a2347 FAIL 0/1
rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type_and_url_style=.CloudStorageType.S3.1.virtual_host.test_case=.TS_Read==True.SpilloverManifestUploaded==True ducktape https://buildkite.com/redpanda/redpanda/builds/59935#0193dc1c-24f4-48f5-b848-e36477ea95e1 FAIL 0/1
rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type_and_url_style=.CloudStorageType.S3.1.virtual_host.test_case=.TS_Read==True.TS_ChunkedRead==True ducktape https://buildkite.com/redpanda/redpanda/builds/59935#0193dc18-c803-4c2f-9cd0-86f9a8dd5064 FAIL 0/1
rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type_and_url_style=.CloudStorageType.S3.1.virtual_host.test_case=.TS_Read==True.TS_Timequery==True ducktape https://buildkite.com/redpanda/redpanda/builds/59935#0193dc18-c804-449e-983b-f6040b48eed2 FAIL 0/1
rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type_and_url_style=.CloudStorageType.S3.1.virtual_host.test_case=.TS_Read==True.TS_Timequery==True ducktape https://buildkite.com/redpanda/redpanda/builds/59935#0193dc1c-24f2-467e-a057-0e4a790311ae FAIL 0/1
rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type_and_url_style=.CloudStorageType.S3.1.virtual_host.test_case=.TS_Read==True.TS_Timequery==True.SpilloverManifestUploaded==True ducktape https://buildkite.com/redpanda/redpanda/builds/59935#0193dc18-c806-4851-8e40-029c7bdf36d7 FAIL 0/1
rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type_and_url_style=.CloudStorageType.S3.1.virtual_host.test_case=.TS_Read==True.TS_Timequery==True.SpilloverManifestUploaded==True ducktape https://buildkite.com/redpanda/redpanda/builds/59935#0193dc1c-24f3-40db-9e3c-b72686edcfd1 FAIL 0/1
rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type_and_url_style=.CloudStorageType.S3.1.virtual_host.test_case=.TS_Read==True.TS_TxRangeMaterialized==True ducktape https://buildkite.com/redpanda/redpanda/builds/59935#0193dc18-c807-4275-8982-8723111a2347 FAIL 0/1
rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type_and_url_style=.CloudStorageType.S3.1.virtual_host.test_case=.TS_Read==True.TS_TxRangeMaterialized==True ducktape https://buildkite.com/redpanda/redpanda/builds/59935#0193dc1c-24f4-48f5-b848-e36477ea95e1 FAIL 0/1
rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type_and_url_style=.CloudStorageType.S3.1.virtual_host.test_case=.TS_Read==True.TS_TxRangeMaterialized==True.SpilloverManifestUploaded==True ducktape https://buildkite.com/redpanda/redpanda/builds/59935#0193dc18-c803-4c2f-9cd0-86f9a8dd5064 FAIL 0/1
rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type_and_url_style=.CloudStorageType.S3.1.virtual_host.test_case=.TS_Read==True.TS_TxRangeMaterialized==True.SpilloverManifestUploaded==True ducktape https://buildkite.com/redpanda/redpanda/builds/59935#0193dc1c-24f6-49d5-ae5a-d5eb3497ad6e FAIL 0/1
rptest.tests.write_caching_fi_e2e_test.WriteCachingFailureInjectionE2ETest.test_crash_all.use_transactions=False ducktape https://buildkite.com/redpanda/redpanda/builds/59935#0193dc18-c804-449e-983b-f6040b48eed2 FAIL 0/1
rptest.tests.write_caching_fi_e2e_test.WriteCachingFailureInjectionE2ETest.test_crash_all.use_transactions=False ducktape https://buildkite.com/redpanda/redpanda/builds/59935#0193dc1c-24f4-48f5-b848-e36477ea95e1 FAIL 0/1
rptest.tests.write_caching_fi_test.WriteCachingFailureInjectionTest.test_crash_all ducktape https://buildkite.com/redpanda/redpanda/builds/59935#0193dc18-c803-4c2f-9cd0-86f9a8dd5064 FAIL 0/1
rptest.tests.write_caching_fi_test.WriteCachingFailureInjectionTest.test_crash_all ducktape https://buildkite.com/redpanda/redpanda/builds/59935#0193dc1c-24f3-40db-9e3c-b72686edcfd1 FAIL 0/1
test results on build#60023
test_id test_kind job_url test_status passed
rptest.tests.datalake.partition_movement_test.PartitionMovementTest.test_cross_core_movements.cloud_storage_type=CloudStorageType.S3 ducktape https://buildkite.com/redpanda/redpanda/builds/60023#0193e62a-cfd5-4342-9d5c-bd82d737eba0 FLAKY 2/6
test results on build#60079
test_id test_kind job_url test_status passed
rptest.tests.delete_records_test.DeleteRecordsTest.test_delete_records_concurrent_truncations.cloud_storage_enabled=True.truncate_point=start_offset ducktape https://buildkite.com/redpanda/redpanda/builds/60079#0193f494-6494-4f55-ac0e-69cf1956752e FLAKY 5/6

@WillemKauf
Copy link
Contributor Author

WillemKauf commented Dec 19, 2024

Lot of KgoVerifierProducer failures, panic: Out of order offset 0 (vs 0 20000).

Not sure if this is another KgoVerifierProducer issue or if something else has been broken.

The only related change I can see in KgoVerifier was this, in which pw.validOffsets.Insert() is now called under a lock in new function OnAcked (but CI must have ran for this change many times before seeing these failures, so I am uncertain)

EDIT: Probably just because of the oneshot() changes I made. Reverted.

@WillemKauf WillemKauf force-pushed the datalake_translator_offset_fix branch from 6c113d7 to 0e1a24c Compare December 20, 2024 14:32
@WillemKauf
Copy link
Contributor Author

Force push to:

  • Revert changes to kgo_verifier_service::oneshot().

@WillemKauf WillemKauf force-pushed the datalake_translator_offset_fix branch from 0e1a24c to 2fe6c55 Compare December 20, 2024 15:05
@vbotbuildovich
Copy link
Collaborator

Retry command for Build#60016

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/datalake/compaction_test.py::CompactionTest.test_compaction@{"cloud_storage_type":1}

@WillemKauf WillemKauf force-pushed the datalake_translator_offset_fix branch from 2fe6c55 to b01095c Compare December 20, 2024 21:09
@WillemKauf
Copy link
Contributor Author

Force push to:

  • Change compaction wait condition in compaction_test.py. Translation seems to slow the compaction process down quite a bit.

b.add_batch(std::move(batch)).get();
};

make_and_add_record(model::record_batch_type::raft_data);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it would make sense to add a test case where we start from configuration batch as this is what it looks like in real life, configuration is always the first batch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the callout here, I have added two new test cases which begin with configuration batches to translated_log_offset_test.

And most importantly, add the function `get_translated_log_offset()`.
This function will be used to compute the appropriate highest translated
log offset for a given translated kafka offset while taking into account
translator batches.

This is an important function for removing offset translation from within
the `translation_stm`, and to be less pessimistic about the
`max_collectible_offset` returned by the `stm` in the future.
Make use of the added utility function `get_translated_log_offset()`
throughout `partition_translator.cc` and `state_machine.cc` in order
to update the `translation_stm::_highest_translated_log_offset` as
translation occurs.

`translation_stm::max_collectible_offset()` now returns the
`_highest_translated_log_offset` instead of performing offset translation
for the `highest_translated_offset`, which is currently more restrictive
than it should be for housekeeping (due to it being translator batch unaware).
By handling gaps in offsets and recording seen keys, we can validate the
correctness of a compacted log that has been translated (fully) into an
iceberg table.
Adds a new `test_compaction` test, which uses the `KgoVerifierSeqConsumer`
to validate a fully compacted log, along with the `datalake_verifier`
service to validate the Iceberg table.

Also moves the contents of `compaction_gaps_test.py` into
`compaction_test.py`.
@WillemKauf WillemKauf force-pushed the datalake_translator_offset_fix branch from b01095c to 4bd7693 Compare December 23, 2024 16:19
@WillemKauf
Copy link
Contributor Author

Force push to:

  • Add two new tests to translated_log_offset_test.cc
  • Remove early return in get_translated_log_offset() to correct behavior for the edge case of kafka::offset{}
  • Add comment to get_translated_log_offset() declaration about its use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants