Inconsistent PyArrow Schema Field Metadata on `project_table`: Parquet Field ID #788

sungwy · 2024-06-02T15:53:37Z

Apache Iceberg version

None

Please describe the bug 🐞

While refactoring project_table(#786) I ran into some issues with the tests because the existing behavior for the project_table function isn’t consistent in terms of whether or not it returns the Parquet Field ID in its pyarrow schema field metadata.

There are cases where the parquet field ID is attached to the field metadata, and cases where they aren’t: https://github.com/apache/iceberg-python/blob/main/tests/io/test_pyarrow.py#L1062-L1080

I think this is because we use schema_to_pyarrow as a fallback schema which attaches the parquet field ID attribute onto the field metadata: https://github.com/apache/iceberg-python/blob/main/pyiceberg/io/pyarrow.py#L1133

I think we should correct this behavior so that it is consistent for all table scans.

Do we want to attach the parquet file ID attribute on all pyarrow schema returned by project_table?
Or should we remove parquet file ID attached on the field metadata of the pyarrow schema? The idea here is that we would have two modes of creating schema_to_pyarrow , with or without parquet Field ID (write, versus read use cases)

I think not having unintended metadata for a specific use case will be cleaner for the users. Parquet Field ID was added to schema_to_pyarrow so that we could persist the field ID into the parquet files on write. But we do not want them when we are reading the Table. Hence, I am leaning towards the second option.

Looking for some thoughts and direction on this issue so we can complete the refactoring to support Iterator[RecordBatch] output scans! @Fokko @HonahX

The text was updated successfully, but these errors were encountered:

Fokko · 2024-06-02T18:52:32Z

Thanks @syun64 for raising this, and it indeed looks inconsistent.

There has been a lot of confusion around this in the past. The Field-IDs are internal to Iceberg and should only be used when:

Reading: Looking up the field in the requested schema
Writing: Aligning the fields with the table schema

Do we want to attach the parquet file ID attribute on all pyarrow schema returned by project_table?

After the project table, it is not relevant anymore, so it is best to remove them.

Or should we remove parquet file ID attached on the field metadata of the pyarrow schema? The idea here is that we would have two modes of creating schema_to_pyarrow , with or without parquet Field ID (write, versus read use cases)

Yes, this makes sense to me. It would be good to have the option to omit field IDs.

sungwy · 2024-06-03T15:39:17Z

Thank you for the input @Fokko - sounds good 👍

I've put up #789 to fix this issue

sungwy mentioned this issue Jun 2, 2024

Support Table.to_arrow_batch_reader to return RecordBatchReader instead of a fully materialized Arrow Table #786

Merged

sungwy mentioned this issue Jun 3, 2024

include_field_ids flag in schema_to_pyarrow #789

Merged

Fokko closed this as completed in #789 Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent PyArrow Schema Field Metadata on `project_table`: Parquet Field ID #788

Inconsistent PyArrow Schema Field Metadata on `project_table`: Parquet Field ID #788

sungwy commented Jun 2, 2024

Fokko commented Jun 2, 2024

sungwy commented Jun 3, 2024

Inconsistent PyArrow Schema Field Metadata on project_table: Parquet Field ID #788

Inconsistent PyArrow Schema Field Metadata on project_table: Parquet Field ID #788

Comments

sungwy commented Jun 2, 2024

Apache Iceberg version

Please describe the bug 🐞

Fokko commented Jun 2, 2024

sungwy commented Jun 3, 2024

Inconsistent PyArrow Schema Field Metadata on `project_table`: Parquet Field ID #788

Inconsistent PyArrow Schema Field Metadata on `project_table`: Parquet Field ID #788