Fallback for upsert when arrow cannot compare source rows with target rows #1878

koenvo · 2025-04-02T20:39:49Z

Rationale for this change

Upsert operations in PyIceberg rely on Arrow joins between source and target rows. However, Arrow Acero cannot compare certain complex types — like struct, list, and map — unless they’re part of the join key. When such types exist in non-join columns, the upsert fails with an error like:

ArrowInvalid: Data type struct<...> is not supported in join non-key field venue_geo

This PR introduces a fallback mechanism: if Arrow fails to join due to unsupported types, we fall back to comparing only the key columns. Non-key complex fields are ignored in the join condition, but still retained in the final upserted data.

Before

# Fails if venue_geo is a non-key struct field
txn.upsert(df, join_cols=["match_id"])

❌ ArrowInvalid: Data type struct<...> is not supported in join non-key field venue_geo

After

# Falls back to key-based join and proceeds
txn.upsert(df, join_cols=["match_id"])

✅ Successfully inserts or updates the record, skipping complex field comparison during join

✅ Are these changes tested?

Yes:

A test was added to reproduce the failure scenario with complex non-key fields.
The new behavior is verified by asserting that the upsert completes successfully using the fallback logic.

ℹ️ Note
This change does not affect users who do not include complex types in their schemas. For those who do, it improves resilience while preserving data correctness.

Are there any user-facing changes?

Yes — upserts involving complex non-key columns (like struct, list, or map) no longer fail. They now succeed by skipping unsupported comparisons during the join phase.

… rows

…arison

Fokko

Thanks for working on this @koenvo It looks like a lot of folks are waiting for this.

Could you run a poormans benchmark, similar to what I did here: #1685 (comment) Just to see how the two methods compare in terms of performance?

Fokko · 2025-04-06T19:05:06Z

pyiceberg/table/upsert_util.py

+
+        MARKER_COLUMN_NAME = "__from_target"
+
+        assert MARKER_COLUMN_NAME not in join_cols_set


We try to avoid assert outside of the tests. Could you raise a ValueError instead?

koenvo · 2025-04-08T17:23:46Z

Poor man's benchmark

This compares the performance of the original vs fallback upsert logic.
The "With skips" case simulates situations where non-matching rows can be skipped during comparison.

Condition	Original (s)	Fallback (s)	Diff (ms)	Diff (%)
Without skips	0.727	0.724	-2.73	-0.38%
With skips	0.681	0.732	+51.24	+7.53%

No significant performance regression observed. Fallback behaves as expected.

Fokko · 2025-04-17T19:14:28Z

First of all, sorry for the late reply, I was busy with the Iceberg summit :)

@koenvo The reason I was asking for a benchmark is to see if we can replace the existing logic with your logic that also works with {map,list,struct} types. I'm a simple man and I like simple things.

Fokko · 2025-04-17T19:40:54Z

pyiceberg/table/upsert_util.py

+
+        # Step 2: Prepare target index with join keys and a marker
+        target_index = target_table.select(join_cols_set).append_column(
+            MARKER_COLUMN_NAME, pa.array([True] * len(target_table), pa.bool_())


I think we can optimize this allocation by avoiding creating a Python array, but that can be done in a separate PR.

koenvo added 4 commits April 2, 2025 22:38

Fallback for upsert when arrow cannot compare source rows with target…

f16f8b3

… rows

Make upsert work for non-join complex column types - skip column comp…

06af05a

…arison

Merge branch 'main' into bugfix/upsert-complex-type

d8f71b5

minor

6b9ddf4

koenvo mentioned this pull request Apr 3, 2025

Upsert with list type not supported #1711

Open

3 tasks

koenvo marked this pull request as ready for review April 3, 2025 07:08

Linting

7131dc0

Fokko reviewed Apr 6, 2025

View reviewed changes

koenvo added 4 commits April 8, 2025 19:28

Replace assert with ValueError

0719ecf

Merge branch 'main' into bugfix/upsert-complex-type

54537c2

Make complex-test upsert test fail

a699602

Preserve order in get_rows_to_update for complex types

8e32e9c

Fokko reviewed Apr 17, 2025

View reviewed changes

Create marker column in pyarrow instead of Python list first

a088d6c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fallback for upsert when arrow cannot compare source rows with target rows #1878

Fallback for upsert when arrow cannot compare source rows with target rows #1878

koenvo commented Apr 2, 2025 •

edited

Loading

Fokko left a comment

Fokko Apr 6, 2025

koenvo commented Apr 8, 2025

Fokko commented Apr 17, 2025

Fokko Apr 17, 2025

koenvo Apr 17, 2025


		MARKER_COLUMN_NAME = "__from_target"

		assert MARKER_COLUMN_NAME not in join_cols_set

Fallback for upsert when arrow cannot compare source rows with target rows #1878

Are you sure you want to change the base?

Fallback for upsert when arrow cannot compare source rows with target rows #1878

Conversation

koenvo commented Apr 2, 2025 • edited Loading

Rationale for this change

Before

After

✅ Are these changes tested?

Are there any user-facing changes?

Fokko left a comment

Choose a reason for hiding this comment

Fokko Apr 6, 2025

Choose a reason for hiding this comment

koenvo commented Apr 8, 2025

Poor man's benchmark

Fokko commented Apr 17, 2025

Fokko Apr 17, 2025

Choose a reason for hiding this comment

koenvo Apr 17, 2025

Choose a reason for hiding this comment

koenvo commented Apr 2, 2025 •

edited

Loading