Flink: Add DynamicRecord / DynamicRecordInternal / DynamicRecordInternalSerializer #12996

mxm · 2025-05-07T13:51:54Z

This adds the user-facing type DynamicRecord, alongside with its internal representation DynamicRecordInternal and its type information and serializer.

Broken out of #12424.

The original PR is based on Flink 1.20. This version is based on Flink 2.0.

gyfora · 2025-05-07T14:58:18Z

flink/v2.0/flink/src/main/java/org/apache/iceberg/flink/sink/dynamic/DynamicRecordInternal.java

+  private PartitionSpec spec;
+  private int writerKey;
+  private RowData rowData;
+  private boolean upsertMode;


Should we rename this to isUpsert or if it denotes an actual mode use an enum instead?

Can do but it's consistent with the coding style. We often omit these verbs from the getters in Iceberg.

in that case upsert or useUpsertMode would probably a better name

gyfora · 2025-05-07T14:59:25Z

flink/v2.0/flink/src/main/java/org/apache/iceberg/flink/sink/dynamic/DynamicRecordInternal.java

+  private String tableName;
+  private String branch;
+  private Schema schema;
+  private PartitionSpec spec;


Should we rename this to partitionSpec in case some other kind of spec appears in the future?

I was also leaning towards this name in the beginning, but it's Iceberg convention to use this name across the code base. We can rename though if this is a concern.

gyfora · 2025-05-07T15:01:21Z

...ink/src/main/java/org/apache/iceberg/flink/sink/dynamic/DynamicRecordInternalSerializer.java

+      // Check that the schema id can be resolved. Not strictly necessary for serialization.
+      Tuple3<RowDataSerializer, Schema, PartitionSpec> serializer =
+          serializerCache.serializerWithSchemaAndSpec(
+              toSerialize.tableName(),
+              toSerialize.schema().schemaId(),
+              toSerialize.spec().specId());


if not strictly necessary why do we do it? What happens if this fails / why would it fail?

This is basically a sanity-test, to test that looking up the serializer by id on the remote side will work. The remote side won't have the schema available, because it is not written in this branch. If there are any issues, we will know about them on the sender side, as opposed on the receiving side.

I've added a JavaDoc which should clarify things.

gyfora · 2025-05-07T15:02:17Z

flink/v2.0/flink/src/main/java/org/apache/iceberg/flink/sink/dynamic/DynamicRecord.java

+  private String branch;
+  private Schema schema;
+  private RowData rowData;
+  private PartitionSpec spec;


should this be called partitionSpec in case other specs are added in the future?

Same as #12996 (comment)

gyfora · 2025-05-07T15:02:55Z

flink/v2.0/flink/src/main/java/org/apache/iceberg/flink/sink/dynamic/DynamicRecord.java

+  private PartitionSpec spec;
+  private DistributionMode mode;
+  private int writeParallelism;
+  private boolean upsertMode;


a boolean doesn't really describe a mode , should this be an enum or isUpsert maybe?

I think it does. If enabled, upsert mode will be used.

See also #12996 (comment)

gyfora · 2025-05-07T15:03:25Z

flink/v2.0/flink/src/main/java/org/apache/iceberg/flink/sink/dynamic/DynamicRecord.java

+  private Schema schema;
+  private RowData rowData;
+  private PartitionSpec spec;
+  private DistributionMode mode;


should this be distribution or distributionMode? (it is already clashing with upsertMode a little)

Yes, it makes sense to rename to distributionMode.

mxm · 2025-05-08T06:15:03Z

Thanks for the review @gyfora! I think it makes sense to rename the API-facing fields / getters / setters to avoid confusion for users.

gyfora · 2025-05-08T06:26:12Z

Thanks for the review @gyfora! I think it makes sense to rename the API-facing fields / getters / setters to avoid confusion for users.

I am not yet aware of all the conventions here, @pvary maybe you could chime in related to the naming and then I will learn once and for all :D

pvary · 2025-05-08T10:07:28Z

flink/v2.0/flink/src/main/java/org/apache/iceberg/flink/sink/dynamic/DynamicRecord.java

+  private DistributionMode mode;
+  private int writeParallelism;
+  private boolean upsertMode;
+  @Nullable private List<String> equalityFields;


Only this field is nullable?
Shall we use the annotation consistently?

Correct, only this field is currently nullable / optional. We could add some defaults. I was thinking to add a builder, what do you think?

A builder makes sense to me, as we have many parameters

flink/v2.0/flink/src/main/java/org/apache/iceberg/flink/sink/dynamic/DynamicRecordInternal.java

...ink/src/main/java/org/apache/iceberg/flink/sink/dynamic/DynamicRecordInternalSerializer.java

...k/v2.0/flink/src/main/java/org/apache/iceberg/flink/sink/dynamic/RowDataSerializerCache.java

pvary · 2025-05-09T14:46:25Z

Thanks for the review @gyfora! I think it makes sense to rename the API-facing fields / getters / setters to avoid confusion for users.

I am not yet aware of all the conventions here, @pvary maybe you could chime in related to the naming and then I will learn once and for all :D

It might be strange for new developers, but we always omit get, set, is from the method names.
Here is the guide: https://iceberg.apache.org/contribute/#iceberg-code-contribution-guidelines
The only exceptions are the overrides for external APIs

gyfora · 2025-05-09T14:49:02Z

Thanks for the review @gyfora! I think it makes sense to rename the API-facing fields / getters / setters to avoid confusion for users.

I am not yet aware of all the conventions here, @pvary maybe you could chime in related to the naming and then I will learn once and for all :D

It might be strange for new developers, but we always omit get, set, is from the method names. Here is the guide: https://iceberg.apache.org/contribute/#iceberg-code-contribution-guidelines The only exceptions are the overrides for external APIs

In general I get the idea, but my particular concern was related to upgradeMode the convention clearly doesn't work well with a name like this as it's immediately confusing when you have other xxMode fields that are enums etc.

pvary · 2025-05-09T15:16:19Z

Thanks for the review @gyfora! I think it makes sense to rename the API-facing fields / getters / setters to avoid confusion for users.

I am not yet aware of all the conventions here, @pvary maybe you could chime in related to the naming and then I will learn once and for all :D

It might be strange for new developers, but we always omit get, set, is from the method names. Here is the guide: https://iceberg.apache.org/contribute/#iceberg-code-contribution-guidelines The only exceptions are the overrides for external APIs

In general I get the idea, but my particular concern was related to upgradeMode the convention clearly doesn't work well with a name like this as it's immediately confusing when you have other xxMode fields that are enums etc.

I assume this is upsertMode?
While I understand your concern, the IcebergSink contains upsertMode, and this convention is used throughout the Flink code, so I would stick to it.

mxm · 2025-05-09T15:17:18Z

I've pushed an update to address the comments.

On the name discussion: I think this is all just convention. Every community has its own styles. I don't think either way makes more sense. upsertMode makes perfect sense to me, isUpsert not so much because not every record produces an upsert, but isUpsertMode makes just as much sense, even though the boolean type is next to the name.

The most important reason is consistency. All existing Flink Iceberg sinks use that name. I don't see a strong case to deviate from it.

I did rename mode to distributionMode and spec to partitionSpec.

…nalSerializer This adds the user-facing type DynamicRecord, alongside with its internal representation DynamicRecordInternal and its type information and serializer. Broken out of github.com/apache/pull/12424.

mxm · 2025-05-12T11:11:49Z

(rebased and squashed commits)

pvary · 2025-05-12T12:16:48Z

flink/v2.0/flink/src/main/java/org/apache/iceberg/flink/sink/dynamic/DynamicRecord.java

+    return tableIdentifier;
+  }
+
+  public void setTableIdentifier(TableIdentifier tableIdentifier) {


Do we need these setters, if we have a builder?

We wouldn't. I'm not sure though we should remove these methods, as they allow DynamicRecord to be reused. If we add the builder, that won't be possible anymore.

pvary · 2025-05-12T12:17:15Z

flink/v2.0/flink/src/main/java/org/apache/iceberg/flink/sink/dynamic/DynamicRecordInternal.java

+    return tableName;
+  }
+
+  public void setTableName(String tableName) {


Do we need this setters?

We currently use these setters here to allow for Flink's object reuse mode:

iceberg/flink/v2.0/flink/src/main/java/org/apache/iceberg/flink/sink/dynamic/DynamicRecordInternalSerializer.java

Line 204 in bf02ad6

reuse.setTableName(from.tableName());

This adds the classes around schema / spec comparison and evolution. A breakdown of the classes follows: # CompareSchemasVisitor Compares the user-provided schema against the current table schema. # EvolveSchemaVisitor Computes the changes required to the table schema to be compatible with the user-provided schema. # ParititonSpecEvolution Code for checking compatibility with the user-provided PartitionSpec and computing a set of changes to rewrite the PartitionSpec. # TableDataCache Cache which holds all relevant metadata of a table like its name, branch, schema, partition spec. Also holds a cache of past comparison results for a given table's schema and the user-provided input schema. # Table Updater Core logic to compare and create/update a table given a user-provided input schema. Broken out of apache#12424, depends on apache#12996.

…ternal / DynamicRecordInternalSerializer This adds the user-facing type DynamicRecord, alongside with its internal representation DynamicRecordInternal and its type information and serializer. Broken out of github.com/apache/pull/12424.

This adds the classes around schema / spec comparison and evolution. A breakdown of the classes follows: # CompareSchemasVisitor Compares the user-provided schema against the current table schema. # EvolveSchemaVisitor Computes the changes required to the table schema to be compatible with the user-provided schema. # ParititonSpecEvolution Code for checking compatibility with the user-provided PartitionSpec and computing a set of changes to rewrite the PartitionSpec. # TableDataCache Cache which holds all relevant metadata of a table like its name, branch, schema, partition spec. Also holds a cache of past comparison results for a given table's schema and the user-provided input schema. # Table Updater Core logic to compare and create/update a table given a user-provided input schema. Broken out of apache#12424, depends on apache#12996.

pvary · 2025-05-14T20:09:29Z

Merged to main.
Thanks @mxm for the PR and @gyfora for the review!

mxm · 2025-05-15T08:46:08Z

Thanks @pvary @gyfora for reviewing! Thanks @pvary for the merge!

This adds the classes around schema / spec comparison and evolution. A breakdown of the classes follows: # CompareSchemasVisitor Compares the user-provided schema against the current table schema. # EvolveSchemaVisitor Computes the changes required to the table schema to be compatible with the user-provided schema. # ParititonSpecEvolution Code for checking compatibility with the user-provided PartitionSpec and computing a set of changes to rewrite the PartitionSpec. # TableDataCache Cache which holds all relevant metadata of a table like its name, branch, schema, partition spec. Also holds a cache of past comparison results for a given table's schema and the user-provided input schema. # Table Updater Core logic to compare and create/update a table given a user-provided input schema. Broken out of apache#12424, depends on apache#12996.

…mparison and evolution This adds the classes around schema / spec comparison and evolution. A breakdown of the classes follows: # CompareSchemasVisitor Compares the user-provided schema against the current table schema. # EvolveSchemaVisitor Computes the changes required to the table schema to be compatible with the user-provided schema. # ParititonSpecEvolution Code for checking compatibility with the user-provided PartitionSpec and computing a set of changes to rewrite the PartitionSpec. # TableDataCache Cache which holds all relevant metadata of a table like its name, branch, schema, partition spec. Also holds a cache of past comparison results for a given table's schema and the user-provided input schema. # Table Updater Core logic to compare and create/update a table given a user-provided input schema. Broken out of apache#12424, depends on apache#12996.

github-actions bot added flink build labels May 7, 2025

mxm force-pushed the dynamic-sink-contrib-breakdown branch from 9081c13 to 0e50889 Compare May 7, 2025 14:32

gyfora reviewed May 7, 2025

View reviewed changes

pvary reviewed May 8, 2025

View reviewed changes

flink/v2.0/flink/src/main/java/org/apache/iceberg/flink/sink/dynamic/DynamicRecordInternal.java Show resolved Hide resolved

pvary reviewed May 8, 2025

View reviewed changes

...ink/src/main/java/org/apache/iceberg/flink/sink/dynamic/DynamicRecordInternalSerializer.java Show resolved Hide resolved

pvary reviewed May 8, 2025

View reviewed changes

...k/v2.0/flink/src/main/java/org/apache/iceberg/flink/sink/dynamic/RowDataSerializerCache.java Outdated Show resolved Hide resolved

This comment was marked as resolved.

Sign in to view

Flink: Add DynamicRecord / DynamicRecordInternal / DynamicRecordInter…

ec7d036

…nalSerializer This adds the user-facing type DynamicRecord, alongside with its internal representation DynamicRecordInternal and its type information and serializer. Broken out of github.com/apache/pull/12424.

mxm force-pushed the dynamic-sink-contrib-breakdown branch from 665aa07 to ec7d036 Compare May 12, 2025 11:11

pvary reviewed May 12, 2025

View reviewed changes

mxm mentioned this pull request May 12, 2025

Flink: Dynamic Iceberg Sink: Add table update code for schema comparison and evolution #13032

Open

pvary approved these changes May 14, 2025

View reviewed changes

pvary mentioned this pull request May 14, 2025

Flink Dynamic Sink #11536

Open

6 tasks

gyfora approved these changes May 14, 2025

View reviewed changes

pvary merged commit 268661a into apache:main May 14, 2025
20 checks passed

mxm deleted the dynamic-sink-contrib-breakdown branch May 15, 2025 08:45

mxm mentioned this pull request May 16, 2025

Flink: Dynamic Iceberg Sink Contribution #12424

Open

Flink: Add DynamicRecord / DynamicRecordInternal / DynamicRecordInternalSerializer #12996

Flink: Add DynamicRecord / DynamicRecordInternal / DynamicRecordInternalSerializer #12996

Uh oh!

Conversation

mxm commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gyfora May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mxm May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mxm commented May 8, 2025

Uh oh!

gyfora commented May 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pvary commented May 9, 2025

Uh oh!

gyfora commented May 9, 2025

Uh oh!

pvary commented May 9, 2025

Uh oh!

mxm commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

mxm commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pvary commented May 14, 2025

Uh oh!

mxm commented May 15, 2025

Uh oh!

Uh oh!

mxm commented May 7, 2025 •

edited

Loading

gyfora May 8, 2025 •

edited

Loading

mxm May 8, 2025 •

edited

Loading

mxm commented May 9, 2025 •

edited

Loading

mxm commented May 12, 2025 •

edited

Loading