[SPARK-49723][SQL] Add Variant metrics to the JSON File Scan node #48172

harshmotw-db · 2024-09-19T21:50:47Z

What changes were proposed in this pull request?

This pull request adds the following metrics to JSON file scan nodes to collect metrics related to variants being constructed as part of the scan:

variant top-level - total count
variant top-level - total byte size
variant top-level - total number of paths
variant top-level - total number of scalar values
variant top-level - max depth
variant nested - total count
variant nested - total byte size
variant nested - total number of paths
variant nested - total number of scalar values
variant nested - max depth

Top level and nested variant metrics are separated as they can have different usage patterns. singleVariantColumn scans and columns in user-provided schema scans where the column type is a top level variant (not variant nested in a struct/array/map) are considered to be top level variants while variants nested in other data types are considered to be nested variants.

Why are the changes needed?

This change allows users to collect metrics on variant usage to better monitor their data/workloads.

Does this PR introduce any user-facing change?

Users will now be able to see variant metrics in JSON scan nodes which were not available earlier.

How was this patch tested?

Comprehensive unit tests in VariantEndToEndSuite.scala

Was this patch authored or co-authored using generative AI tooling?

Yes, got some help related to scala syntax.
Generated by: ChatGPT 4o, GitHub CoPilot.

common/variant/src/main/java/org/apache/spark/types/variant/VariantBuilder.java

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

cloud-fan · 2024-09-20T09:13:53Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

+    val readFile: (PartitionedFile) => Iterator[InternalRow] = {
+      val hadoopConf = relation.sparkSession.sessionState.newHadoopConfWithOptions(relation.options)
+      relation.fileFormat match {
+        case f: JsonFileFormat =>


We should probably make it more general and allow FileFormat implementations to report additional metrics.

I tried doing this initially but that required me to unnecessarily change every definition of this method in child classes of FileFormat which would make the PR bigger. I think there were also some issues after I overrode every definition but I don't fully remember them.

Let me know if you think this suggestion is important. In my opinion, people who wish to add metrics in the future can just follow my idiom.

gene-db

@harshmotw-db Thanks for this feature! I left a few questions.

common/variant/src/main/java/org/apache/spark/types/variant/VariantBuilder.java

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala

…ariant_metrics

…riant metrics

gene-db

@harshmotw-db Thanks! I left a few questions.

sql/core/src/main/scala/org/apache/spark/sql/execution/metric/VariantConstructionMetrics.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala

gene-db

@harshmotw-db Thanks! I left a few followup comments.

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala

gene-db

@harshmotw-db Thanks! I left a few more minor comments.

gene-db · 2024-10-08T15:40:31Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

@@ -628,18 +645,39 @@ case class FileSourceScanExec(
    }
  }

+  val topLevelVariantMetrics: VariantMetrics = new VariantMetrics()


We should add some comments for these member variables.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/metric/VariantConstructionMetrics.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala

…ariant_metrics

gene-db

@harshmotw-db Thanks for these valuable metrics! I just had 1 minor comment.

LGTM

common/variant/src/main/java/org/apache/spark/types/variant/VariantBuilder.java

harshmotw-db · 2024-10-14T20:55:17Z

@cloud-fan Can you go over this PR again whenever you have time? Thanks!

github-actions · 2025-01-24T00:23:52Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Added variant metrics to JSON Scans

85f01af

github-actions bot added the SQL label Sep 19, 2024

minor change

a902d7f

HyukjinKwon reviewed Sep 19, 2024

View reviewed changes

common/variant/src/main/java/org/apache/spark/types/variant/VariantBuilder.java Outdated Show resolved Hide resolved

fixed indent

f628569

cloud-fan reviewed Sep 20, 2024

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Sep 20, 2024

View reviewed changes

harshmotw-db requested a review from HyukjinKwon September 20, 2024 17:03

gene-db reviewed Sep 23, 2024

View reviewed changes

harshmotw-db added 3 commits September 23, 2024 14:17

Merge branch 'master' of https://github.com/harshmotw-db/spark into v…

8a12cce

…ariant_metrics

made it so that only json scans with variant in schema can produce va…

90a23d0

…riant metrics

Addressed Wenchen's and Gene's comments

6ef3e70

harshmotw-db requested review from cloud-fan and gene-db September 23, 2024 21:50

addressed Gene's major comment

60c18d8

gene-db reviewed Sep 24, 2024

View reviewed changes

Gene's recommendations

8ac576d

harshmotw-db requested a review from gene-db September 24, 2024 19:44

harshmotw-db mentioned this pull request Sep 25, 2024

[SPARK-49723][FOLLOW-UP][SQL] Add Variant metrics to the parse_json expression in the project node #48241

Closed

gene-db reviewed Sep 25, 2024

View reviewed changes

addressed comments raised by Gene

5c518ef

harshmotw-db requested a review from gene-db October 7, 2024 21:08

gene-db reviewed Oct 8, 2024

View reviewed changes

harshmotw-db added 2 commits October 10, 2024 13:48

Merge branch 'master' of https://github.com/harshmotw-db/spark into v…

0b01953

…ariant_metrics

Addressed comments made by Gene

6c3a662

harshmotw-db requested a review from gene-db October 10, 2024 21:33

gene-db approved these changes Oct 10, 2024

View reviewed changes

common/variant/src/main/java/org/apache/spark/types/variant/VariantBuilder.java Outdated Show resolved Hide resolved

Update VariantBuilder.java

e209634

Merge branch 'master' into harshmotw-db/variant_metrics

48e8ebb

github-actions bot added the Stale label Jan 24, 2025

github-actions bot closed this Jan 25, 2025

[SPARK-49723][SQL] Add Variant metrics to the JSON File Scan node #48172

[SPARK-49723][SQL] Add Variant metrics to the JSON File Scan node #48172

Uh oh!

Conversation

harshmotw-db commented Sep 19, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Uh oh!

Uh oh!

cloud-fan Sep 20, 2024

Choose a reason for hiding this comment

Uh oh!

harshmotw-db Sep 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gene-db left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gene-db left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gene-db left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gene-db left a comment

Choose a reason for hiding this comment

Uh oh!

gene-db Oct 8, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gene-db left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

harshmotw-db commented Oct 14, 2024

Uh oh!

github-actions bot commented Jan 24, 2025

Uh oh!

Uh oh!

harshmotw-db Sep 23, 2024 •

edited

Loading