Spark: Fix row lineage inheritance for distributed planning #13061

amogh-jahagirdar · 2025-05-14T23:10:47Z

For Spark Distributed planning we use a ManifestFileBean implementation of ManifestFile which is serializable and encodes the minimal amount of manifest fields required during distributed planning. This was missing firstRowId and as a result null values would be propogated for the inherited firstRowId. This fixes the issue by simply adding the firstRowId field to the bean which will be set correctly and as a result be inherited correctly during Spark distributed planning.

I discovered this when going through the DML row lineage tests and noticed we weren't exercising a distributed planning case and after enabling, debugged. I added another test parameter set for distributed planning.

amogh-jahagirdar · 2025-05-14T23:12:10Z

...sions/src/test/java/org/apache/iceberg/spark/extensions/SparkRowLevelOperationsTestBase.java

+      {
+        "testhadoop",
+        SparkCatalog.class.getName(),
+        ImmutableMap.of("type", "hadoop"),
+        FileFormat.PARQUET,
+        false,
+        WRITE_DISTRIBUTION_MODE_HASH,
+        true,
+        null,
+        DISTRIBUTED,
+        3
+      },


I'll see how much more time this adds, but at least in the interim I feel like it's worth having this as it's what caught the issue.

What we can probably do once the vectorized reader change is in, is remove the parquet + local test above since the vectorized reader is already testing local. Then we'll still have coverage of both local + distributed without multiple parquet local cases like we have right now.

technically you could also override parameters() in TestRowLevelOperationsWithLineage so that the test matrix is only increased for those tests and not across all tests

amogh-jahagirdar · 2025-05-15T00:47:14Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/ManifestFileBean.java

@@ -36,6 +36,7 @@ public class ManifestFileBean implements ManifestFile, Serializable {
  private Long addedSnapshotId = null;
  private Integer content = null;
  private Long sequenceNumber = null;
+  private Long firstRowId = null;


Ok this may not be quite right, forgot about delete manifests...

Why would this not work for delete manifests?

It should (it'd just be null for delete manifests). I confused myself when debugging an issue with failing tests but those tests are unrelated to if it's a delete manifest or not. Checkout my comment below on why I removed the getter on my latest update

aokolnychyi · 2025-05-15T01:19:41Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/ManifestFileBean.java

@@ -46,6 +47,7 @@ public static ManifestFileBean fromManifest(ManifestFile manifest) {
    bean.setAddedSnapshotId(manifest.snapshotId());
    bean.setContent(manifest.content().id());
    bean.setSequenceNumber(manifest.sequenceNumber());
+    bean.setFirstRowId(manifest.firstRowId());


Do we need a proper getter for it?

On my latest push I removed the getter because the spark actions that read the paths in the manifest file as a dataframe (e.g. orphan files/expire snapshots) also use ManifestFileBean and when reading the manifest DF, it was failing to find firstRowId (the existence of the getter makes it so that every record read by these actions needs to have this field, and if it doesn't it fails during analysis).

The getter isn't needed for the distributed planning case since ManifestFileBean implements manifestFile and firstRowId API gets used. It's also not required for the Spark actions which just need the minimal file info.

But it's a bit odd that this one particular field won't have a getter, let me think if there's a cleaner way. Having two different manifest file bean structures where one is even more minimal seems a bit messy, at least just for this case.
At the very least, if we go with this approach I should inline comment why there's no getter for the field.

github-actions bot added the spark label May 14, 2025

amogh-jahagirdar commented May 14, 2025

View reviewed changes

amogh-jahagirdar requested review from aokolnychyi, rdblue, nastra and RussellSpitzer May 14, 2025 23:13

amogh-jahagirdar commented May 15, 2025

View reviewed changes

Spark: Fix row lineage inheritance for distributed planning

ebbb9a1

amogh-jahagirdar force-pushed the row-lineage-distributed-planning-fix branch from 561a43e to ebbb9a1 Compare May 15, 2025 01:15

aokolnychyi reviewed May 15, 2025

View reviewed changes

amogh-jahagirdar mentioned this pull request May 15, 2025

Spark, Avro: Add support for row lineage in Avro reader #13070

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark: Fix row lineage inheritance for distributed planning #13061

Spark: Fix row lineage inheritance for distributed planning #13061

amogh-jahagirdar commented May 14, 2025

amogh-jahagirdar May 14, 2025 •

edited

Loading

nastra May 15, 2025 •

edited

Loading

amogh-jahagirdar May 15, 2025

aokolnychyi May 15, 2025

amogh-jahagirdar May 15, 2025

aokolnychyi May 15, 2025

amogh-jahagirdar May 15, 2025 •

edited

Loading

Spark: Fix row lineage inheritance for distributed planning #13061

Are you sure you want to change the base?

Spark: Fix row lineage inheritance for distributed planning #13061

Conversation

amogh-jahagirdar commented May 14, 2025

amogh-jahagirdar May 14, 2025 • edited Loading

Choose a reason for hiding this comment

nastra May 15, 2025 • edited Loading

Choose a reason for hiding this comment

amogh-jahagirdar May 15, 2025

Choose a reason for hiding this comment

aokolnychyi May 15, 2025

Choose a reason for hiding this comment

amogh-jahagirdar May 15, 2025

Choose a reason for hiding this comment

aokolnychyi May 15, 2025

Choose a reason for hiding this comment

amogh-jahagirdar May 15, 2025 • edited Loading

Choose a reason for hiding this comment

amogh-jahagirdar May 14, 2025 •

edited

Loading

nastra May 15, 2025 •

edited

Loading

amogh-jahagirdar May 15, 2025 •

edited

Loading