fix: Support Schema Evolution in iceberg #1723

huaxingao · 2025-05-08T04:22:56Z

Which issue does this PR close?

We original have CometConf.COMET_SCHEMA_EVOLUTION_ENABLED to set schema evolution to true in Scan rule if the scan is Iceberg table scan. However, it doesn't work for the following case:

    sql("CREATE TABLE %s (id Int) USING iceberg", table1);
    sql("INSERT INTO %s VALUES (1), (2), (3), (4)", table1);
    sql("alter table %s alter column id type bigint", table1);
    sql("SELECT * FROM %s", table1);

In this example, when executing SELECT * FROM table, Iceberg creates a Comet ColumnReader and invokes TypeUtil.checkParquetType. This throws an exception because the scan rule hasn't been applied yet, but the column type has already changed to bigint.

        org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException: column: [id], physicalType: INT32, logicalType: bigint
            at app//org.apache.comet.parquet.TypeUtil.checkParquetType(TypeUtil.java:222)
            at app//org.apache.comet.parquet.AbstractColumnReader.<init>(AbstractColumnReader.java:93)
            at app//org.apache.comet.parquet.ColumnReader.<init>(ColumnReader.java:104)
            at app//org.apache.comet.parquet.Utils.getColumnReader(Utils.java:50)

Instead of enabling schema evolution in the scan rule, I will update Utils.getColumnReader to accept a boolean supportsSchemaEvolution parameter and pass true from the Iceberg side.

Closes #.

Rationale for this change

What changes are included in this PR?

How are these changes tested?

I current test the new patch in Iceberg.

  @Test
  public void test() {
    String table1 = tableName("test");

    sql("CREATE TABLE %s (id Int) USING iceberg", table1);
    sql("INSERT INTO %s VALUES (1), (2), (3), (4)", table1);
    sql("alter table %s alter column id type bigint", table1);
    List<Object[]> results = sql("SELECT * FROM %s", table1);

    sql("DROP TABLE IF EXISTS %s", table1);
  }

Without the fix, I got

        org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException: column: [id], physicalType: INT32, logicalType: bigint
            at app//org.apache.comet.parquet.TypeUtil.checkParquetType(TypeUtil.java:222)
            at app//org.apache.comet.parquet.AbstractColumnReader.<init>(AbstractColumnReader.java:93)
            at app//org.apache.comet.parquet.ColumnReader.<init>(ColumnReader.java:104)
            at app//org.apache.comet.parquet.Utils.getColumnReader(Utils.java:50)
            at app//org.apache.iceberg.spark.data.vectorized.CometColumnReader.reset(CometColumnReader.java:103)

With fix, the problem went away.

andygrove · 2025-05-08T14:26:20Z

spark/src/test/scala/org/apache/comet/parquet/ParquetReadSuite.scala

@@ -1217,34 +1217,6 @@ abstract class ParquetReadSuite extends CometTestBase {
    }
  }

-  test("schema evolution") {


Can we update the test rather than remove it?

Seems to me this only tests CometConf.COMET_SCHEMA_EVOLUTION_ENABLED. Since I removed the config, I think this test is not needed any more.
I currently did my test in Iceberg. I am thinking maybe we can add an iceberg-integration module to have the iceberg related test, or depend on iceberg CI (#1715)?

andygrove · 2025-05-08T14:27:28Z

Thanks @huaxingao. The implementation changes LGTM, but I would like to understand how this will be tested.

hsiang-c · 2025-05-08T20:29:18Z

common/src/main/java/org/apache/comet/parquet/Utils.java

@@ -28,33 +28,33 @@

 public class Utils {

-  /** This method is called from Apache Iceberg. */


(nit) Shall we keep this comment? I think it is useful if it's still valid.

We will use the same method for both Comet and Iceberg. The comment is not needed any more.

hsiang-c · 2025-05-08T20:34:03Z

spark/src/main/scala/org/apache/comet/rules/CometScanRule.scala

@@ -33,7 +33,7 @@ import org.apache.spark.sql.execution.datasources.v2.parquet.ParquetScan
 import org.apache.spark.sql.internal.SQLConf


(nit) This import can be removed.

Removed. Thanks

codecov-commenter · 2025-05-09T00:37:10Z

Codecov Report

Attention: Patch coverage is 60.00000% with 4 lines in your changes missing coverage. Please review.

Project coverage is 58.70%. Comparing base (f09f8af) to head (74a1058).
Report is 186 commits behind head on main.

Files with missing lines	Patch %	Lines
...org/apache/comet/parquet/AbstractColumnReader.java	80.00%	0 Missing and 1 partial ⚠️
...in/java/org/apache/comet/parquet/ColumnReader.java	0.00%	1 Missing ⚠️
...ava/org/apache/comet/parquet/LazyColumnReader.java	0.00%	1 Missing ⚠️
...c/main/java/org/apache/comet/parquet/TypeUtil.java	50.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #1723      +/-   ##
============================================
+ Coverage     56.12%   58.70%   +2.58%     
- Complexity      976     1139     +163     
============================================
  Files           119      129      +10     
  Lines         11743    12707     +964     
  Branches       2251     2377     +126     
============================================
+ Hits           6591     7460     +869     
- Misses         4012     4058      +46     
- Partials       1140     1189      +49

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hsiang-c

LGTM

parthchandra · 2025-05-09T18:24:17Z

The config CometConf.COMET_SCHEMA_EVOLUTION_ENABLED is valid for Parquet files as well so removing it is not correct imo.
Also, ScanRule is called at planning time while getColumnReader is called during execution. Why is the config not set correctly?

parthchandra · 2025-05-09T18:32:34Z

One set of Spark Sql test failures in native_iceberg_compat is exactly because this schema evolution/type promotion check is not being performed correctly (or not being performed at all, in fact). I'm trying to address these failures atm by writing a checkParquetType for complex types which will end up calling the current checkParquetType when it find a PrimitiveType value. I was counting on this config to provide compatible results.
What types of schema evolution does iceberg require(i.e. support) for complex types?

andygrove reviewed May 8, 2025

View reviewed changes

hsiang-c reviewed May 8, 2025

View reviewed changes

huaxingao changed the title ~~Support Schema Evolution in iceberg~~ fix: Support Schema Evolution in iceberg May 8, 2025

huaxingao added 2 commits May 8, 2025 16:28

Support Schema Evolution in iceberg

899fe4f

rebase and address comments

6fc8b85

huaxingao force-pushed the allowSchemaEvolution branch from 62e547d to 6fc8b85 Compare May 8, 2025 23:35

hsiang-c approved these changes May 9, 2025

View reviewed changes

put back CometConf.COMET_SCHEMA_EVOLUTION_ENABLED

74a1058

huaxingao closed this May 13, 2025

huaxingao reopened this May 13, 2025

huaxingao closed this May 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Support Schema Evolution in iceberg #1723

fix: Support Schema Evolution in iceberg #1723

huaxingao commented May 8, 2025

andygrove May 8, 2025

huaxingao May 8, 2025

andygrove commented May 8, 2025

hsiang-c May 8, 2025

huaxingao May 8, 2025

hsiang-c May 8, 2025

huaxingao May 8, 2025

codecov-commenter commented May 9, 2025 •

edited

Loading

hsiang-c left a comment

parthchandra commented May 9, 2025

parthchandra commented May 9, 2025

		@@ -28,33 +28,33 @@

		public class Utils {

		/** This method is called from Apache Iceberg. */

		@@ -33,7 +33,7 @@ import org.apache.spark.sql.execution.datasources.v2.parquet.ParquetScan
		import org.apache.spark.sql.internal.SQLConf

fix: Support Schema Evolution in iceberg #1723

fix: Support Schema Evolution in iceberg #1723

Conversation

huaxingao commented May 8, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

andygrove May 8, 2025

Choose a reason for hiding this comment

huaxingao May 8, 2025

Choose a reason for hiding this comment

andygrove commented May 8, 2025

hsiang-c May 8, 2025

Choose a reason for hiding this comment

huaxingao May 8, 2025

Choose a reason for hiding this comment

hsiang-c May 8, 2025

Choose a reason for hiding this comment

huaxingao May 8, 2025

Choose a reason for hiding this comment

codecov-commenter commented May 9, 2025 • edited Loading

Codecov Report

hsiang-c left a comment

Choose a reason for hiding this comment

parthchandra commented May 9, 2025

parthchandra commented May 9, 2025

codecov-commenter commented May 9, 2025 •

edited

Loading