[Tests passing] [2.0] Add initial eq-to-pos delete job #356

Zyiqin-Miranda · 2024-09-30T06:45:54Z

For getting overall high-level feedback purpose.

Zyiqin-Miranda · 2024-12-16T08:08:51Z

First version of converter with test to verify correctness working here.
For easier review, an overview of the converter currently:

Fetch all equality deletes, data files, previous position deletes in one for loop that having partition value as key here
For each buckets' files, we have file sequence number (similar to storage layer stream_position) attached, and ONLY fetch relevant data files with equality delete files here.
By relevant, refer to Iceberg spec, specifically:
An equality delete file must be applied to a data file when all of the following are true:
The data file's data sequence number is strictly less than the delete's data sequence number
The data file's partition (both spec id and partition values) is equal [4] to the delete file's partition or the delete file's partition spec is unpartitioned
Convert remote function will use Daft native reader to only read hash value of merge key columns (primary key columns), append file_path, row_index and use zero-copy pyarrow is_in, filter function to find the pos to delete
Upload to S3 with new pos delete files and commit a overwrite (replace not supported yet) snapshot here

raghumdani

Thanks for putting this full implementation together. Great work so far. Couple of things I think would be useful here:

Modularize all the invocations to catalog client so that we can independently write unit tests for it, and change it when internal catalog support is available.
I would not compare the hashes here. Although the probability is low, collisions can theoretically occur and we cannot detect/recover them.
We can add e2e functional tests. I see only one sanity test though.
Would it be simpler to have a separate package for this implementation and use deltacat as a dependency in that package as there is only one way dependency? I fear we may create high coupled functions overtime making the maintenance (with DeltaCAT 2.0) of deltacat harder.
Move all the short term hacks like the overrides into _private module to emphasize danger in taking any dependency on those functions.

deltacat/compute/compactor/utils/system_columns.py

deltacat/compute/converter/equality_delete_to_position_delete_session.py

deltacat/compute/converter/steps/convert.py

raghumdani · 2025-01-04T00:59:21Z

deltacat/compute/converter/steps/convert.py

+            data_file_table["primarykey"],
+            equality_delete_table["primarykey"],
+        )
+        positional_delete_table = data_file_table.filter(equality_deletes)


This looks like a table we get after filtering all the rows matching equality delete values.

Are you suggesting a naming changes here?

yes.. The name doesn't reflect what it really is.

raghumdani · 2025-01-04T01:05:36Z

deltacat/compute/converter/steps/convert.py

+    )
+
+    from deltacat.utils.daft import _get_s3_io_config
+    # TODO: Use Daft SHA1 hash instead to minimize probably of data corruption


As long as we use sha1, we are at the mercy of probability. Although chances are low, it can happen and cause correctness issues. I don't think we should be using any kind of hashing here to check for equality.

I think using SHA-1 is still the right choice here to keep memory requirements predictable. Although a probability of collision exists, some probability of data corruption always exists that may be outside of your control (beyond perhaps eventually consistent detection and rectification mechanisms).

In this case, the chance of SHA-1 collision is likely lower than the probability of introducing corrupt results anyways due to, say, writing back results from memory corrupted due to a hardware failure (e.g., due to the non-zero frequency of these types of errors observed while running compaction and similar jobs at scale internally at Amazon over the past year).

If you want to choose a middle-ground approach, perhaps you should choose a record count at which the probability of collision is appreciably high-enough to necessitate switching (e.g., perhaps in the septillions of records, assuming that we ever manage datasets that get there).

+1 for continue using SHA1 as hashing approach here.
For your @raghumdani concern of using SHA1 for primary key lookup, we have 50% collision probability of collision across 1.2 * 10 ^21 records using birthday problem, (which is 1.2 sextillion records in ONE bucket, which is not a reasonable expectation of our current data volume). So I don’t think that’ll be a legit concern for us in the near future.
The pros for using Hashing of primary keys are:

Simplify memory estimation logic

Avoid OOM

Efficiency of cluster usage, no need to take into variable length string primary keys.

Avoid error in resource estimation, since we don’t need to do file sampling job, which any error in file sampling can cause actual job with SLA expectation to fail

To my mind, correctness is the deal-breaker here however low the probability is. We can figure out how to keep memory requirements stable as a separate problem for which there is already a working implementation in deltacat. As a side note, we have never seen nor we will see writing back corrupted results due to hardware failures as we have checksums in parquet and S3 performs integrity check on multipart uploads by default. However, we have had issues corrupting RCFs due to code bugs we had introduced. In the current compactor, this is already a risk. From the get go, we have it disabled for majority of tables and have recently introduced this env variable SHA1_HASHING_FOR_MEMORY_OPTIMIZATION_DISABLED to disable it for all the tables.

Can we implement the option to toggle instead of creating a TODO? I believe it's not too much of effort (just don't call hash() method on line 185). This PR already has a lot of tech debt and we want to avoid creating more.

It's not just about not calling hash() right? It changes how you estimate your memory resources depend on what method you're using, since you'll get variable length string primary key if you get rid of hash right?

Correct, that part is buggy even in the current state as daft would end up reading entire pk column since you are not using the streaming reader. You already have a TODO for it.

My understanding of this conversation is that not the entire pk column will be kept in memory:

Miranda Zhu
Nov 15th, 2024 at 9:40 AM
I wonder If there is a way to apply the UDF while downloading the columns?
Specifically, we’d like to only keep the columns with UDF applied in-memory but discard the origin columns, that could be really useful to us too
Jay Chia
The new execution engine should apply these in a pipelined fashion, so yes it would happen automatically if you did something like:

Nice, you can add a TODO.

deltacat/compute/converter/utils/s3u.py

raghumdani

Overall, looks better than previous version. Please address few more comments and ensure the GitHub checks are passing.

deltacat/compute/converter/dev/example_single_merge_key_converter.py

raghumdani · 2025-01-23T00:02:40Z

deltacat/compute/converter/equality_delete_to_position_delete_session.py

+        table_metadata=iceberg_table.metadata,
+        files_dict_list=to_be_added_files_dict_list,
+    )
+    commit_overwrite_snapshot(


I may have missed this in first review. We need to gracefully handle any error from commit conflicts which will be resolved by a new job run.

deltacat/compute/converter/equality_delete_to_position_delete_session.py

deltacat/compute/converter/model/convert_session_params.py

raghumdani · 2025-01-23T00:45:37Z

deltacat/compute/converter/model/convert_session_params.py

+    def catalog(self):
+        return self["catalog"]


Can we enrich typing

raghumdani · 2025-01-23T00:49:12Z

deltacat/compute/converter/utils/convert_task_options.py

@@ -0,0 +1,90 @@
+from typing import Optional, Dict
+from deltacat.exceptions import RetryableError


FYI, I have assumed you've done the due diligence to ensure the estimation is accurate. Not going into greater depth on this business logic.

raghumdani · 2025-01-23T00:52:20Z

deltacat/compute/converter/utils/s3u.py

+import s3fs
+
+
+def get_credential():


I have seen this method duplicated at multiple places.

deltacat/compute/converter/utils/s3u.py

raghumdani

Conditional approval provided all the TODOs are taken as fast follow ups.

Zyiqin-Miranda · 2025-01-25T02:04:41Z

Converter to-be implemented features list, tracking here for future PR references.

P0. Multiple identifier columns, column concatenating + relevant memory estimation change
P0. Verify pos delete written out can be read by Spark, probably done in unit test setup using 2.0 docker
P0. Switch to construct equality delete tables using Spark, probably done in a unit test using 2.0 docker
P0. Any model changes we might need for new 2.0 storage model. eg, only convert certain partition, read “delta”, etc.
P0. Daft sha1 hash support.

P1. Currently, Assuming 1 node can fit one hash bucket for now, adjust parallel data file to download in convert function.
P1. Investigate Pyiceberg replace snapshot committing. Currently, replace snapshot committing self-implemented is not working as expected. Definition of correct should be that we’re able to read the REPLACE snapshot using Spark. So currently reuse the OVERWRITE snapshot committing strategy from Pyiceberg.
P1. Investigate replace snapshot committing using_starting_sequence to avoid conflict. Not entirely sure we need this given that some workaround maybe feasible through internal catalog implementation. So deprioritize for P1.
P1. Merge/Compact small pos delete files support
P1. Spark read pos delete performace. Position delete can correctly be matched to corresponding data files by setting lower_bounds==upper_bounds==file_path even with multiple data files. It’s not scanning whole partition pos delete into memory when trying to merge-on-read.

Zyiqin-Miranda · 2025-01-27T04:57:18Z

Merged in PR as all checks have passed.

Zyiqin-Miranda · 2025-01-27T22:34:55Z

Tracked in issue.

Zyiqin-Miranda force-pushed the equality-to-position-job-session branch from 1c30d06 to 3d5149d Compare November 6, 2024 20:03

Zyiqin-Miranda marked this pull request as ready for review January 2, 2025 18:51

raghumdani reviewed Jan 4, 2025

View reviewed changes

Zyiqin-Miranda force-pushed the equality-to-position-job-session branch 9 times, most recently from 0550d6b to 7024222 Compare January 19, 2025 06:28

Zyiqin-Miranda changed the title ~~[WIP] Add eq-to-pos delete job session draft~~ [Tests passing] Add eq-to-pos delete job session draft Jan 19, 2025

Zyiqin-Miranda changed the base branch from main to 2.0 January 19, 2025 06:53

Zyiqin-Miranda force-pushed the equality-to-position-job-session branch from 7024222 to b6bda23 Compare January 19, 2025 07:10

Zyiqin-Miranda changed the title ~~[Tests passing] Add eq-to-pos delete job session draft~~ [Tests passing] Add initial eq-to-pos delete job Jan 19, 2025

Zyiqin-Miranda changed the title ~~[Tests passing] Add initial eq-to-pos delete job~~ [Tests passing] [2.0] Add initial eq-to-pos delete job Jan 20, 2025

Zyiqin-Miranda force-pushed the equality-to-position-job-session branch from b6bda23 to 79801d1 Compare January 21, 2025 23:59

pdames approved these changes Jan 23, 2025

View reviewed changes

raghumdani reviewed Jan 23, 2025

View reviewed changes

raghumdani approved these changes Jan 24, 2025

View reviewed changes

Zyiqin-Miranda added 5 commits January 26, 2025 20:35

[WIP] Add eq-to-pos delete job session draft

d024068

Update with producing file level pos deletes

f0def03

Resolve dependency conflicts with 2.0 branch

f584b37

Add more documentation to example; code cleanup

186fb71

Bump linter version to fix linter + reformatting

013e2f4

Zyiqin-Miranda force-pushed the equality-to-position-job-session branch from cf14f41 to 013e2f4 Compare January 27, 2025 04:37

Zyiqin-Miranda merged commit 9d15ee5 into 2.0 Jan 27, 2025
3 checks passed

Zyiqin-Miranda mentioned this pull request Jan 27, 2025

[Iceberg][Converter] Equality to Position Delete Converter #471

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tests passing] [2.0] Add initial eq-to-pos delete job #356

[Tests passing] [2.0] Add initial eq-to-pos delete job #356

Zyiqin-Miranda commented Sep 30, 2024

Zyiqin-Miranda commented Dec 16, 2024

raghumdani left a comment

raghumdani Jan 4, 2025

Zyiqin-Miranda Jan 21, 2025

raghumdani Jan 22, 2025

raghumdani Jan 4, 2025

pdames Jan 21, 2025

pdames Jan 21, 2025

Zyiqin-Miranda Jan 21, 2025 •

edited

Loading

raghumdani Jan 21, 2025

raghumdani Jan 23, 2025

Zyiqin-Miranda Jan 23, 2025

raghumdani Jan 23, 2025 •

edited

Loading

Zyiqin-Miranda Jan 23, 2025

raghumdani Jan 23, 2025

raghumdani left a comment •

edited

Loading

raghumdani Jan 23, 2025

raghumdani Jan 23, 2025

raghumdani Jan 23, 2025

raghumdani Jan 23, 2025

raghumdani left a comment

Zyiqin-Miranda commented Jan 25, 2025 •

edited

Loading

Zyiqin-Miranda commented Jan 27, 2025

Zyiqin-Miranda commented Jan 27, 2025

		@@ -0,0 +1,90 @@
		from typing import Optional, Dict
		from deltacat.exceptions import RetryableError

[Tests passing] [2.0] Add initial eq-to-pos delete job #356

[Tests passing] [2.0] Add initial eq-to-pos delete job #356

Conversation

Zyiqin-Miranda commented Sep 30, 2024

Zyiqin-Miranda commented Dec 16, 2024

raghumdani left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Zyiqin-Miranda Jan 21, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raghumdani Jan 23, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raghumdani left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raghumdani left a comment

Choose a reason for hiding this comment

Zyiqin-Miranda commented Jan 25, 2025 • edited Loading

Zyiqin-Miranda commented Jan 27, 2025

Zyiqin-Miranda commented Jan 27, 2025

Zyiqin-Miranda Jan 21, 2025 •

edited

Loading

raghumdani Jan 23, 2025 •

edited

Loading

raghumdani left a comment •

edited

Loading

Zyiqin-Miranda commented Jan 25, 2025 •

edited

Loading