feat: validate snapshot write compatibility #1772

kaushiksrini · 2025-03-06T22:43:29Z

Description

This PR checks snapshot write compatibility and validates no conflicting concurrent operations have been written that can clash.
Added Snapshot util file that implements ancestors_between and ancestors_of from SnapshotUtil.java
Commit conflict resolution and retry as outlined in the spec will be completed in a subsequent PR.

pyiceberg/utils/snapshot.py

pyiceberg/table/update/snapshot.py

Fokko · 2025-03-20T14:54:06Z

@kaushiksrini can you check the CI? It looks like mypy has some issues:

pyiceberg/table/update/snapshot.py:303: error: Item "None" of "Snapshot | None" has no attribute "snapshot_id"  [union-attr]
pyiceberg/table/update/snapshot.py:306: error: Item "None" of "Summary | None" has no attribute "operation"  [union-attr]

Fokko · 2025-03-20T14:54:28Z

Let's add some tests as well:

@pytest.mark.integration
@pytest.mark.parametrize("format_version", [1, 2])
def test_conflict_delete_delete(
    spark: SparkSession, session_catalog: Catalog, arrow_table_with_null: pa.Table, format_version: int
) -> None:
    identifier = "default.test_conflict"
    tbl1 = _create_table(session_catalog, identifier, {"format-version": "1"}, [arrow_table_with_null])
    tbl2 = session_catalog.load_table(identifier)

    tbl1.delete("string == 'z'")

    with pytest.raises(CommitFailedException, match="(branch main has changed: expected id ).*"):
        # tbl2 isn't aware of the commit by tbl1
        tbl2.delete("string == 'z'")


@pytest.mark.integration
@pytest.mark.parametrize("format_version", [1, 2])
def test_conflict_delete_append(
    spark: SparkSession, session_catalog: Catalog, arrow_table_with_null: pa.Table, format_version: int
) -> None:
    identifier = "default.test_conflict"
    tbl1 = _create_table(session_catalog, identifier, {"format-version": "1"}, [arrow_table_with_null])
    tbl2 = session_catalog.load_table(identifier)

    # This is allowed
    tbl1.delete("string == 'z'")
    tbl2.append(arrow_table_with_null)


@pytest.mark.integration
@pytest.mark.parametrize("format_version", [1, 2])
def test_conflict_append_delete(
    spark: SparkSession, session_catalog: Catalog, arrow_table_with_null: pa.Table, format_version: int
) -> None:
    identifier = "default.test_conflict"
    tbl1 = _create_table(session_catalog, identifier, {"format-version": "1"}, [arrow_table_with_null])
    tbl2 = session_catalog.load_table(identifier)

    tbl1.delete("string == 'z'")

    with pytest.raises(CommitFailedException, match="(branch main has changed: expected id ).*"):
        # tbl2 isn't aware of the commit by tbl1
        tbl2.delete("string == 'z'")


@pytest.mark.integration
@pytest.mark.parametrize("format_version", [1, 2])
def test_conflict_append_append(
    spark: SparkSession, session_catalog: Catalog, arrow_table_with_null: pa.Table, format_version: int
) -> None:
    identifier = "default.test_conflict"
    tbl1 = _create_table(session_catalog, identifier, {"format-version": "1"}, [arrow_table_with_null])
    tbl2 = session_catalog.load_table(identifier)

    tbl1.append(arrow_table_with_null)
    tbl2.append(arrow_table_with_null)

Co-authored-by: Fokko Driesprong <[email protected]>

Today, we have a copy of the `TableMetadata` on the `Table` and the `Transaction`. This PR changes that logic to re-use the one on the table, and add the changes to the one on the `Transaction`. This also allows us to stack changes, for example, to first change a schema, and then write data with the new schema right away. Also a prerequisite for apache#1772

sungwy

Hi @kaushiksrini thanks for working on this PR! This is great progress.

Unfortunately, I think the documentation on the Commit Retries is lacking and would benefit from an update. I've left some comments and links that'll hopefully bring the feature closer to the Java implementation

sungwy · 2025-04-18T00:12:04Z

tests/integration/test_add_files.py

+    tbl2 = session_catalog.load_table(identifier)
+
+    tbl1.append(arrow_table_with_null)
+    tbl2.append(arrow_table_with_null)


could we introduce an assertion here to verify the content of the table is as we'd expect? (with 3*arrow_table_with_null data)

sungwy · 2025-04-18T00:12:19Z

tests/integration/test_add_files.py

+
+    # This is allowed
+    tbl1.delete("string == 'z'")
+    tbl2.append(arrow_table_with_null)


We should verify the content of the table here

sungwy · 2025-04-18T00:42:09Z

pyiceberg/table/update/snapshot.py

+        # Define allowed operations for each type of operation
+        allowed_operations = {
+            Operation.APPEND: {Operation.APPEND, Operation.REPLACE, Operation.OVERWRITE, Operation.DELETE},
+            Operation.REPLACE: {Operation.APPEND},


Suggested change

Operation.REPLACE: {Operation.APPEND},

Operation.REPLACE: {},

I think the spec may need a re-review because I think it's inaccurate to say that we only need to verify that the files we are trying to delete are still available when we are executing a REPLACE or DELETE operation.

In Spark, we also validate whether there's been a conflicting appends when we use SERIALIZABLE isolation level:

https://github.com/apache/iceberg/blob/9fc49e187069c7ec2493ac0abf20f73175b3df89/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java#L368-L374

I think it would be helpful to introduce all three types of isolation levels NONE, SERIALIZABLE and SNAPSHOT, and verify if conflicting appends or deletes have been introduced in the underlying partitions to be aligned with the implementation in Spark

Thanks @sungwy for jumping in here, and creating the issues 🙌

Indeed, depending on whether we do snapshot or serializable isolation, we should allow for new data (or not). Would you be willing to split out the different levels in a separate PR? It would be nice to get this in so we can start working independently on the subtasks that you created.

I think this one was mostly blocked on #1903

sungwy · 2025-04-18T01:21:57Z

I've created some subtasks on #819 that will help us implement the required validation functions that we can invoke to check that no conflicting commits have been made between two snapshots. @kaushiksrini would you be interested in helping out with some of those implementations?

# Rationale for this change Today, we have a copy of the `TableMetadata` on the `Table` and the `Transaction`. This PR changes that logic to re-use the one on the table, and add the changes to the one on the `Transaction`. This also allows us to stack changes, for example, to first change a schema, and then write data with the new schema right away. Also a prerequisite for #1772 # Are these changes tested? Includes a new test :) # Are there any user-facing changes?

kaushiksrini · 2025-04-19T14:54:09Z

Hey @sungwy thanks for the review! Will address the feedback soon - I will also take a look at the subtasks and would like to work on them!

kaushiksrini · 2025-05-09T01:20:24Z

@sungwy, I implemented your feedback in the PR. Let me know what you think!

Closes #1929 # Rationale for this change - Since we want to support snapshot write compatibility (#1772) and is part of the following parent issue #819 # Are these changes tested? Yes # Are there any user-facing changes? No  --------- Co-authored-by: Jayce Slesar <[email protected]> Co-authored-by: Fokko Driesprong <[email protected]>

Closes apache#1929 # Rationale for this change - Since we want to support snapshot write compatibility (apache#1772) and is part of the following parent issue apache#819 # Are these changes tested? Yes # Are there any user-facing changes? No  --------- Co-authored-by: Jayce Slesar <[email protected]> Co-authored-by: Fokko Driesprong <[email protected]>

feat: validate snapshot write compatibility

949e140

kaushiksrini marked this pull request as draft March 7, 2025 03:53

Fokko reviewed Mar 9, 2025

View reviewed changes

pyiceberg/utils/snapshot.py Outdated Show resolved Hide resolved

reuse ancestors_of existing functionality

e631ddf

Fokko reviewed Mar 20, 2025

View reviewed changes

pyiceberg/table/update/snapshot.py Outdated Show resolved Hide resolved

kaushiksrini and others added 4 commits March 22, 2025 22:23

Update pyiceberg/table/update/snapshot.py

740db96

Co-authored-by: Fokko Driesprong <[email protected]>

fix mypy errors

611b017

add tests for verifying snapshot compatibility

0923dc4

update parent snapshot when there are conflicts and change exception

57e0f90

kaushiksrini marked this pull request as ready for review March 27, 2025 18:10

This was referenced Apr 2, 2025

feat: Add refresh to get get updated TableMetadata apache/iceberg-rust#1154

Closed

feat: refresh table when committing to support concurrent appends #1885

Open

Fokko mentioned this pull request Apr 9, 2025

Refactor Metadata in Transaction #1903

Merged

sungwy reviewed Apr 18, 2025

View reviewed changes

Merge branch 'main' into check-write-compatibility-append

5122039

kaushiksrini added 2 commits May 8, 2025 21:12

add table content verification for tests

66849dd

modify allowed operations for replace

0824c35

kaushiksrini mentioned this pull request May 28, 2025

validate added data files for snapshot compatibility #2050

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: validate snapshot write compatibility #1772

feat: validate snapshot write compatibility #1772

Uh oh!

kaushiksrini commented Mar 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Fokko commented Mar 20, 2025

Uh oh!

Fokko commented Mar 20, 2025 •

edited

Loading

Uh oh!

sungwy left a comment

Uh oh!

sungwy Apr 18, 2025

Uh oh!

sungwy Apr 18, 2025

Uh oh!

sungwy Apr 18, 2025

Uh oh!

Fokko Apr 18, 2025 •

edited

Loading

Uh oh!

sungwy commented Apr 18, 2025

Uh oh!

kaushiksrini commented Apr 19, 2025

Uh oh!

kaushiksrini commented May 9, 2025

Uh oh!

Uh oh!

	Operation.REPLACE: {Operation.APPEND},
	Operation.REPLACE: {},

feat: validate snapshot write compatibility #1772

Are you sure you want to change the base?

feat: validate snapshot write compatibility #1772

Uh oh!

Conversation

kaushiksrini commented Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

Uh oh!

Uh oh!

Fokko commented Mar 20, 2025

Uh oh!

Fokko commented Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sungwy left a comment

Choose a reason for hiding this comment

Uh oh!

sungwy Apr 18, 2025

Choose a reason for hiding this comment

Uh oh!

sungwy Apr 18, 2025

Choose a reason for hiding this comment

Uh oh!

sungwy Apr 18, 2025

Choose a reason for hiding this comment

Uh oh!

Fokko Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sungwy commented Apr 18, 2025

Uh oh!

kaushiksrini commented Apr 19, 2025

Uh oh!

kaushiksrini commented May 9, 2025

Uh oh!

Uh oh!

kaushiksrini commented Mar 6, 2025 •

edited

Loading

Fokko commented Mar 20, 2025 •

edited

Loading

Fokko Apr 18, 2025 •

edited

Loading