Add Support for Dynamic Overwrite #931

jqin61 · 2024-07-15T19:27:28Z

Added support for dynamic overwrite leveraging delete and fast-append(counterpart in Iceberg Spark).

Several follow-ups:

to support current spec with transformed fields. Should be easy but due to the number of transforms, this takes some time. Will add them bit by bit in follow-up prs.
could consider whether to raise userwarning when no delete is executed. Because from prespectives of users of dynamic overwrite, they should not worry about whether it is an pure append or a partition replacement.

sungwy

Hi @jqin61 this PR is looking great. I left a few nit suggestions and a few pointers to incorporate a new features like merge_append

pyiceberg/table/__init__.py

sungwy · 2024-07-15T20:10:39Z

pyiceberg/table/__init__.py

@@ -502,6 +503,71 @@ def append(self, df: pa.Table, snapshot_properties: Dict[str, str] = EMPTY_DICT)
                for data_file in data_files:
                    append_files.append_data_file(data_file)

+    def _build_partition_predicate(self, spec_id: int, delete_partitions: List[Record]) -> BooleanExpression:


nit: I found the delete_partitions argument a bit confusing here, because this function just translates a set of partition record values to its corresponding predicate. Could we rename it to something more generic to indicate that? W should also remove spec_id which isn't used in this function

Suggested change

def _build_partition_predicate(self, spec_id: int, delete_partitions: List[Record]) -> BooleanExpression:

def _build_partition_predicate(self, partition_records: List[Record]) -> BooleanExpression:

good catch, thanks!

pyiceberg/table/__init__.py

Fokko

Hey @jqin61 Good seeing you here again! 🙌 I'll do a more in-depth review tomorrow morning. Could you also document this in the docs under mkdocs/? Otherwise folks won't be able to find this awesome feature 👍

pyiceberg/table/__init__.py

Fokko · 2024-07-17T09:18:35Z

pyiceberg/table/__init__.py

@@ -502,6 +503,73 @@ def append(self, df: pa.Table, snapshot_properties: Dict[str, str] = EMPTY_DICT)
                for data_file in data_files:
                    append_files.append_data_file(data_file)

+    def _build_partition_predicate(self, partition_records: List[Record]) -> BooleanExpression:


Can you add two tests where:

We start with an unpartitioned table, write some data, and then evolve to a partition.

Start with a Monthly partitioned table, insert data for a few days, convert the partition to a daily partition, and dynamically overwrite a single day.

Thanks for the thorough review! So if the table is firstly unpartitioned and then evolved into a partition. I think the expected bahavior is that dynamic overwrite will also delete (potentially through overwrite) data in the unpartitioned files?

pyiceberg/table/__init__.py

tests/integration/test_writes/test_partitioned_writes.py

Fokko · 2024-07-17T09:31:23Z

Left some more comments @jqin61, thanks for working on this 👍

Fokko · 2024-08-07T07:05:08Z

@jqin61 Sorry for the slow review, I was doing some other stuff as well. Can you fix the merge conflicts? I think this looks good to go 👍

jqin61 · 2024-08-07T16:23:09Z

@jqin61 Sorry for the slow review, I was doing some other stuff as well. Can you fix the merge conflicts? I think this looks good to go 👍

Thank you Fokko! Sorry for the delay, I was extremely busy recently, I will get some time next weekend to fix the comments, add tests and fix the documentation. I will also move the transform support out of the scope of this pr due to its complexity, will send you details about it soon.

…pyarrow_schema_compatible

sungwy

Hi @jqin61 - this looks good to me. I've added some nit suggestions to the documentation.

Thank you again for working on this amazing feature!

sungwy · 2024-09-19T13:07:03Z

mkdocs/docs/api.md

@@ -353,6 +353,127 @@ lat: [[52.371807,37.773972,53.11254],[53.21917]]
 long: [[4.896029,-122.431297,6.0989],[6.56667]]
 ```

+### Partial overwrites
+
+You can use overwrite with an overwrite filter `tbl.overwrite(df,overwrite_filter)` to delete partial table data which matches the filter before appending new data.


Suggested change

You can use overwrite with an overwrite filter `tbl.overwrite(df,overwrite_filter)` to delete partial table data which matches the filter before appending new data.

When using the `overwrite` API, you can use an `overwrite_filter` to delete data that that matches the filter before appending new data into the table.

sungwy · 2024-09-19T13:09:06Z

mkdocs/docs/api.md

+tbl.overwrite(df, overwrite_filter=EqualTo('city', "Paris"))
+```
+
+This results in such data if data is printed by `tbl.scan().to_arrow()`:


Suggested change

This results in such data if data is printed by `tbl.scan().to_arrow()`:

This produces the following result with `tbl.scan().to_arrow()`:

sungwy · 2024-09-19T13:10:18Z

mkdocs/docs/api.md

+long: [[74.006],[4.896029,6.0989,2.349014]]
+```
+
+If the PyIceberg table is partitioned, you can use `tbl.dynamic_partition_overwrite(df)` to replace the partitions with new ones provided in the dataframe. The partitions to be replaced are detected automatically.


Suggested change

If the PyIceberg table is partitioned, you can use `tbl.dynamic_partition_overwrite(df)` to replace the partitions with new ones provided in the dataframe. The partitions to be replaced are detected automatically.

If the PyIceberg table is partitioned, you can use `tbl.dynamic_partition_overwrite(df)` to replace the existing partitions with new ones provided in the dataframe. The partitions to be replaced are detected automatically from the provided arrow table.

sungwy · 2024-09-19T13:10:39Z

mkdocs/docs/api.md

+```
+
+If the PyIceberg table is partitioned, you can use `tbl.dynamic_partition_overwrite(df)` to replace the partitions with new ones provided in the dataframe. The partitions to be replaced are detected automatically.
+To try out it, you could firstly create a same PyIceberg table with partition specified on `"city"` field:


Suggested change

To try out it, you could firstly create a same PyIceberg table with partition specified on `"city"` field:

For example, with an iceberg table with a partition specified on `"city"` field:

sungwy · 2024-09-19T13:11:47Z

mkdocs/docs/api.md

+)
+```
+
+And then suppose the data for the partition of `"paris"` is wrong:


Suggested change

And then suppose the data for the partition of `"paris"` is wrong:

And we want to overwrite the data for the partition of `"Paris"`:

sungwy · 2024-09-19T13:12:48Z

mkdocs/docs/api.md

+tbl.append(df)
+```
+
+Then you could use dynamic overwrite on this partition:


Suggested change

Then you could use dynamic overwrite on this partition:

Then we can call `dynamic_partition_overwrite` with this arrow table:

sungwy · 2024-09-19T13:13:23Z

mkdocs/docs/api.md

+tbl.dynamic_partition_overwrite(df_corrected)
+```
+
+This results in such data if data is printed by `tbl.scan().to_arrow()`:


Suggested change

This results in such data if data is printed by `tbl.scan().to_arrow()`:

This produces the following result with `tbl.scan().to_arrow()`:

sungwy · 2024-09-19T13:14:28Z

pyiceberg/table/__init__.py

+        The function detects partition values in the provided arrow table that using the current table
+        partition spec, and deletes existing partitions matching these values. Finally, the
+        data in the table is appended to the table.


Suggested change

The function detects partition values in the provided arrow table that using the current table

partition spec, and deletes existing partitions matching these values. Finally, the

data in the table is appended to the table.

The function detects partition values in the provided arrow table using the current

partition spec, and deletes existing partitions matching these values. Finally, the

data in the table is appended to the table.

jqin61

@sungwy Thank you for the detailed wording enhancements guidance. I updated the docs. Pls re-review when you get a chance

sungwy · 2024-09-20T00:23:34Z

Thank you for making this contribution @jqin61 ! I'll leave this PR open for another review, especially given that it introduces a new table commit API

pyiceberg/table/__init__.py

sungwy · 2024-09-24T18:44:41Z

Hi @Fokko - this PR looks good from my end.

Would you have some time to take a look? Since this is a new API (which comes with another level of caution), I'd love to get your review before we merge in @jqin61 's awesome work

Fokko · 2024-11-04T18:21:00Z

@jqin61 @sungwy Sorry for leaving this hanging, I'll do a review first thing tomorrow 👍

Fokko · 2024-11-05T11:44:03Z

mkdocs/docs/api.md

+lat: double
+long: double
+----
+city: [["New York"],["Amsterdam","Drachten","Paris"]]


I don't think this example is correct. Paris should have been overwritten, right? It looks like we lost San Fran'.

Fokko · 2024-11-05T12:02:04Z

pyiceberg/table/__init__.py

@@ -456,6 +461,89 @@ def append(self, df: pa.Table, snapshot_properties: Dict[str, str] = EMPTY_DICT)
                for data_file in data_files:
                    append_files.append_data_file(data_file)

+    def _build_partition_predicate(self, partition_records: Set[Record]) -> BooleanExpression:


Can we add a little doc here describing what the function does?

Fokko · 2024-11-05T12:06:27Z

tests/integration/test_writes/test_partitioned_writes.py

+    tbl = session_catalog.create_table(
+        identifier=identifier,
+        schema=TABLE_SCHEMA,
+        properties={"format-version": "2"},


Looks like we're testing the same thing twice :)

Suggested change

properties={"format-version": "2"},

properties={"format-version": str(format_version)},

Fokko · 2024-11-05T12:29:20Z

tests/integration/test_writes/test_partitioned_writes.py

+    # expecting 3 files:
+    rows = spark.sql(f"select partition from {identifier}.files").collect()
+    assert len(rows) == 3
+


I think this is also a good test to have:

@pytest.mark.integration @pytest.mark.parametrize( "format_version", [1, 2], ) def test_dynamic_partition_overwrite_rename_column( spark: SparkSession, session_catalog: Catalog, format_version: int ) -> None: arrow_table = pa.Table.from_pydict( { "place": ["Amsterdam", "Drachten"], "inhabitants": [921402, 44940], }, ) identifier = f"default.partitioned_{format_version}_dynamic_partition_overwrite_rename_column" try: session_catalog.drop_table(identifier) except: pass tbl = session_catalog.create_table( identifier= identifier, schema=arrow_table.schema, properties={"format-version": str(format_version)}, ) with tbl.transaction() as tx: with tx.update_spec() as schema: schema.add_identity("place") tbl.append(arrow_table) with tbl.transaction() as tx: with tx.update_schema() as schema: schema.rename_column("place", "city") arrow_table = pa.Table.from_pydict( { "city": ["Drachten"], "inhabitants": [44941], # A new baby was born! }, ) tbl.dynamic_partition_overwrite(arrow_table) result = tbl.scan().to_arrow() assert result['city'].to_pylist() == ['Drachten', 'Amsterdam'] assert result['inhabitants'].to_pylist() == [44941, 921402]

@pytest.mark.integration @pytest.mark.parametrize( "format_version", [1, 2], ) @pytest.mark.filterwarnings("ignore") def test_dynamic_partition_overwrite_evolve_partition( spark: SparkSession, session_catalog: Catalog, format_version: int ) -> None: arrow_table = pa.Table.from_pydict( { "place": ["Amsterdam", "Drachten"], "inhabitants": [921402, 44940], }, ) identifier = f"default.partitioned_{format_version}_test_dynamic_partition_overwrite_evolve_partition" try: session_catalog.drop_table(identifier) except: pass tbl = session_catalog.create_table( identifier=identifier, schema=arrow_table.schema, properties={"format-version": str(format_version)}, ) with tbl.transaction() as tx: with tx.update_spec() as schema: schema.add_identity("place") tbl.append(arrow_table) with tbl.transaction() as tx: with tx.update_schema() as schema: schema.add_column("country", StringType()) with tx.update_spec() as schema: schema.add_identity("country") arrow_table = pa.Table.from_pydict( { "place": ["Groningen"], "country": ["Netherlands"], "inhabitants": [238147], }, ) tbl.dynamic_partition_overwrite(arrow_table) result = tbl.scan().to_arrow() assert result['place'].to_pylist() == ['Groningen', 'Amsterdam', 'Drachten'] assert result['inhabitants'].to_pylist() == [238147, 921402, 44940]

Fokko · 2024-11-05T12:30:43Z

pyiceberg/table/__init__.py

+        manifest_merge_enabled = property_as_bool(
+            self.table_metadata.properties,
+            TableProperties.MANIFEST_MERGE_ENABLED,
+            TableProperties.MANIFEST_MERGE_ENABLED_DEFAULT,
+        )
+        update_snapshot = self.update_snapshot(snapshot_properties=snapshot_properties)
+        append_method = update_snapshot.merge_append if manifest_merge_enabled else update_snapshot.fast_append


This logic is duplicated below as well, maybe move it into a function?

Fokko

Some small stuff, but apart from that it looks good to me 👍 Thanks for working on this, and sorry for the long wait

Fokko · 2024-11-19T22:32:19Z

@jqin61 Do you have time to follow up on the last few comments? Would be great to get this in 👍

…jqin61-dynamic-overwrite-reapplied

jqin61 · 2024-12-10T06:01:50Z

@Fokko @sungwy Thank you for the review and the suggestions! I fixed the latest comments and let's rerun CI and merge it if looks good to you.

jqin61 · 2024-12-11T23:33:09Z

Thanks for fixing the CI, shall we rerun and merge? @Fokko Thank you!

sungwy

Thanks @jqin61 again for contributing this feature!

dynamic overwrite

6a658df

sungwy requested review from Fokko and sungwy July 15, 2024 19:35

sungwy reviewed Jul 15, 2024

View reviewed changes

jqin61 added 3 commits July 15, 2024 21:23

raise alarm on dynamic overwriting unpartitioned table

21d5a64

for pr review

41e77fa

remove arrow df cast

05740ff

Fokko reviewed Jul 16, 2024

View reviewed changes

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved

Fokko reviewed Jul 16, 2024

View reviewed changes

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved

Fokko reviewed Jul 16, 2024

View reviewed changes

Fokko reviewed Jul 17, 2024

View reviewed changes

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved

Fokko reviewed Jul 17, 2024

View reviewed changes

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved

Fokko reviewed Jul 17, 2024

View reviewed changes

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved

Fokko reviewed Jul 17, 2024

View reviewed changes

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved

Fokko reviewed Jul 17, 2024

View reviewed changes

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved

Fokko reviewed Jul 17, 2024

View reviewed changes

tests/integration/test_writes/test_partitioned_writes.py Outdated Show resolved Hide resolved

sungwy mentioned this pull request Aug 1, 2024

Merge into / Upsert #402

Closed

jqin61 added 9 commits August 18, 2024 18:51

Merge branch 'main' into jqin61-dynamic-overwrite-reapplied

7dd8b20

renaming deleted_partitions to overlapping_partitions; import _check_…

5d173e4

…pyarrow_schema_compatible

remove unncessary lines in test

1aa1360

_check_pyarrow_schema_compatible

31c5ca4

fix property util

617e25b

remove uuid from UpdateSnapshot APIs

531cc75

should use table level delete rather than file level delete

4e7d585

fix test

2adfd84

Merge branch 'main' into jqin61-dynamic-overwrite-reapplied

d8ba25a

add a sample for dynamic partition overwrite in api doc

fa2109d

jqin61 requested a review from sungwy September 18, 2024 00:18

sungwy reviewed Sep 19, 2024

View reviewed changes

jqin61 added 2 commits September 19, 2024 17:51

typo

d6865ca

fix the wording

cc325ff

jqin61 commented Sep 19, 2024

View reviewed changes

sungwy approved these changes Sep 20, 2024

View reviewed changes

sungwy reviewed Sep 23, 2024

View reviewed changes

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved

forward snapshot_properties

7512e9a

Fokko self-requested a review November 4, 2024 18:20

Fokko mentioned this pull request Nov 4, 2024

Support dynamic overwrite #1287

Closed

Fokko reviewed Nov 5, 2024

View reviewed changes

Fokko approved these changes Nov 5, 2024

View reviewed changes

jqin61 added 4 commits December 9, 2024 06:11

Merge branch 'main' of https://github.com/apache/iceberg-python into …

297e0f2

…jqin61-dynamic-overwrite-reapplied

for pr comments

263f746

Merge branch 'main' into jqin61-dynamic-overwrite-reapplied

37588f1

fixed CI

3131f7d

sungwy approved these changes Dec 19, 2024

View reviewed changes

sungwy merged commit 952d7c0 into apache:main Dec 19, 2024
8 checks passed

sungwy pushed a commit to sungwy/iceberg-python that referenced this pull request Dec 24, 2024

Add Support for Dynamic Overwrite (apache#931)

6c7ad6b

	def _build_partition_predicate(self, spec_id: int, delete_partitions: List[Record]) -> BooleanExpression:
	def _build_partition_predicate(self, partition_records: List[Record]) -> BooleanExpression:

	You can use overwrite with an overwrite filter `tbl.overwrite(df,overwrite_filter)` to delete partial table data which matches the filter before appending new data.
	When using the `overwrite` API, you can use an `overwrite_filter` to delete data that that matches the filter before appending new data into the table.

	This results in such data if data is printed by `tbl.scan().to_arrow()`:
	This produces the following result with `tbl.scan().to_arrow()`:

	If the PyIceberg table is partitioned, you can use `tbl.dynamic_partition_overwrite(df)` to replace the partitions with new ones provided in the dataframe. The partitions to be replaced are detected automatically.
	If the PyIceberg table is partitioned, you can use `tbl.dynamic_partition_overwrite(df)` to replace the existing partitions with new ones provided in the dataframe. The partitions to be replaced are detected automatically from the provided arrow table.

	To try out it, you could firstly create a same PyIceberg table with partition specified on `"city"` field:
	For example, with an iceberg table with a partition specified on `"city"` field:

	And then suppose the data for the partition of `"paris"` is wrong:
	And we want to overwrite the data for the partition of `"Paris"`:

	Then you could use dynamic overwrite on this partition:
	Then we can call `dynamic_partition_overwrite` with this arrow table:

	properties={"format-version": "2"},
	properties={"format-version": str(format_version)},

Add Support for Dynamic Overwrite #931

Add Support for Dynamic Overwrite #931

Uh oh!

Conversation

jqin61 commented Jul 15, 2024 • edited by Fokko Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sungwy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Fokko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Fokko commented Jul 17, 2024

Uh oh!

Fokko commented Aug 7, 2024

Uh oh!

jqin61 commented Aug 7, 2024

Uh oh!

sungwy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jqin61 left a comment

Choose a reason for hiding this comment

Uh oh!

sungwy commented Sep 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

sungwy commented Sep 24, 2024

Uh oh!

Fokko commented Nov 4, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fokko Nov 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

jqin61 commented Jul 15, 2024 •

edited by Fokko

Loading

sungwy commented Sep 20, 2024 •

edited

Loading

Fokko Nov 5, 2024 •

edited

Loading

jqin61 commented Dec 11, 2024 •

edited

Loading