[WIP][SPARK-52011][SQL] Reduce HDFS NameNode RPC on vectorized Parquet reader #50765

pan3793 · 2025-04-30T09:09:47Z

What changes were proposed in this pull request?

On a busy Hadoop cluster, the GetFileInfo and GetBlockLocations contribute the most RPCs to the HDFS NameNode. After investigating the Spark Parquet vectorized reader, I think 3/4 RPCs can be reduced.

Currently, the Parquet vectorized reader produces 4 NameNode RPCs on reading each file (or split):

Read the footer - one GetFileInfo and one GetBlockLocations
Read the data (row groups) - one GetFileInfo and one GetBlockLocations

The key idea of this PR is:

Driver already knows the FileStatus for each Parquet file during the planning phase, we can transfer the FileStatus from the driver to the executor via PartitionFile, so that the task doesn't need to ask the NameNode again, this saves two GetFileInfo RPCs.
Reuse the SeekableInputStream on reading footer and row groups, this saves one GetBlockLocations RPC.

TODO: The PR currently requires some changes on Parquet side first.

Why are the changes needed?

Reduce unnecessary RPCs of NameNode to improve performance and stability for large Hadoop clusters.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Manually tested on a small Hadoop cluster, the test uses TPC-H Q4, based on sf3000 Parquet tables.

HDFS NameNode metrics (master VS. this PR)

HDFS NameNode audit logs:

Taking file part-01027-419c80f3-8921-4ed3-b31a-0fe72b9c6732-c000.zstd.parquet as example, the file is supposed to be split into 3 splits

$ hadoop fs -ls /warehouse/tpch_3t_hive_parquet.db/lineitem/part-01027-419c80f3-8921-4ed3-b31a-0fe72b9c6732-c000.zstd.parquet
-rwxr-xr-x   3 hadoop supergroup  283071739 2025-04-04 01:37 /warehouse/tpch_3t_hive_parquet.db/lineitem/part-01027-419c80f3-8921-4ed3-b31a-0fe72b9c6732-c000.zstd.parquet

Before

$ cat hdfs-audit.log | grep part-01027-419c80f3-8921-4ed3-b31a-0fe72b9c6732-c000.zstd.parquet | grep application_1743671377509_0064
2025-04-30 16:34:23,533 INFO FSNamesystem.audit: allowed=true	ugi=hadoop (auth:SIMPLE)	ip=/10.45.133.85	cmd=open	src=/warehouse/tpch_3t_hive_parquet.db/lineitem/part-01027-419c80f3-8921-4ed3-b31a-0fe72b9c6732-c000.zstd.parquet	dst=null	perm=null	proto=rpc	callerContext=SPARK_TASK_application_1743671377509_0064_JId_1_SId_1_0_TId_3389_0
2025-04-30 16:34:23,546 INFO FSNamesystem.audit: allowed=true	ugi=hadoop (auth:SIMPLE)	ip=/10.45.133.85	cmd=getfileinfo	src=/warehouse/tpch_3t_hive_parquet.db/lineitem/part-01027-419c80f3-8921-4ed3-b31a-0fe72b9c6732-c000.zstd.parquet	dst=null	perm=null	proto=rpc	callerContext=SPARK_TASK_application_1743671377509_0064_JId_1_SId_1_0_TId_3389_0
2025-04-30 16:34:23,547 INFO FSNamesystem.audit: allowed=true	ugi=hadoop (auth:SIMPLE)	ip=/10.45.133.85	cmd=open	src=/warehouse/tpch_3t_hive_parquet.db/lineitem/part-01027-419c80f3-8921-4ed3-b31a-0fe72b9c6732-c000.zstd.parquet	dst=null	perm=null	proto=rpc	callerContext=SPARK_TASK_application_1743671377509_0064_JId_1_SId_1_0_TId_3389_0
2025-04-30 16:34:23,560 INFO FSNamesystem.audit: allowed=true	ugi=hadoop (auth:SIMPLE)	ip=/10.45.133.86	cmd=open	src=/warehouse/tpch_3t_hive_parquet.db/lineitem/part-01027-419c80f3-8921-4ed3-b31a-0fe72b9c6732-c000.zstd.parquet	dst=null	perm=null	proto=rpc	callerContext=SPARK_TASK_application_1743671377509_0064_JId_1_SId_1_0_TId_3392_0
2025-04-30 16:34:23,585 INFO FSNamesystem.audit: allowed=true	ugi=hadoop (auth:SIMPLE)	ip=/10.45.133.86	cmd=getfileinfo	src=/warehouse/tpch_3t_hive_parquet.db/lineitem/part-01027-419c80f3-8921-4ed3-b31a-0fe72b9c6732-c000.zstd.parquet	dst=null	perm=null	proto=rpc	callerContext=SPARK_TASK_application_1743671377509_0064_JId_1_SId_1_0_TId_3392_0
2025-04-30 16:34:23,586 INFO FSNamesystem.audit: allowed=true	ugi=hadoop (auth:SIMPLE)	ip=/10.45.133.86	cmd=open	src=/warehouse/tpch_3t_hive_parquet.db/lineitem/part-01027-419c80f3-8921-4ed3-b31a-0fe72b9c6732-c000.zstd.parquet	dst=null	perm=null	proto=rpc	callerContext=SPARK_TASK_application_1743671377509_0064_JId_1_SId_1_0_TId_3392_0
2025-04-30 16:35:01,762 INFO FSNamesystem.audit: allowed=true	ugi=hadoop (auth:SIMPLE)	ip=/10.45.133.85	cmd=open	src=/warehouse/tpch_3t_hive_parquet.db/lineitem/part-01027-419c80f3-8921-4ed3-b31a-0fe72b9c6732-c000.zstd.parquet	dst=null	perm=null	proto=rpc	callerContext=SPARK_TASK_application_1743671377509_0064_JId_1_SId_1_0_TId_5328_0
2025-04-30 16:35:01,769 INFO FSNamesystem.audit: allowed=true	ugi=hadoop (auth:SIMPLE)	ip=/10.45.133.85	cmd=getfileinfo	src=/warehouse/tpch_3t_hive_parquet.db/lineitem/part-01027-419c80f3-8921-4ed3-b31a-0fe72b9c6732-c000.zstd.parquet	dst=null	perm=null	proto=rpc	callerContext=SPARK_TASK_application_1743671377509_0064_JId_1_SId_1_0_TId_5328_0
2025-04-30 16:35:01,770 INFO FSNamesystem.audit: allowed=true	ugi=hadoop (auth:SIMPLE)	ip=/10.45.133.85	cmd=open	src=/warehouse/tpch_3t_hive_parquet.db/lineitem/part-01027-419c80f3-8921-4ed3-b31a-0fe72b9c6732-c000.zstd.parquet	dst=null	perm=null	proto=rpc	callerContext=SPARK_TASK_application_1743671377509_0064_JId_1_SId_1_0_TId_5328_0

After

$ cat hdfs-audit.log | grep part-01027-419c80f3-8921-4ed3-b31a-0fe72b9c6732-c000.zstd.parquet | grep application_1743671377509_0065
2025-04-30 16:39:29,684 INFO FSNamesystem.audit: allowed=true	ugi=hadoop (auth:SIMPLE)	ip=/10.45.133.86	cmd=open	src=/warehouse/tpch_3t_hive_parquet.db/lineitem/part-01027-419c80f3-8921-4ed3-b31a-0fe72b9c6732-c000.zstd.parquet	dst=null	perm=null	proto=rpc	callerContext=SPARK_TASK_application_1743671377509_0065_JId_0_SId_0_0_TId_3387_0
2025-04-30 16:39:29,702 INFO FSNamesystem.audit: allowed=true	ugi=hadoop (auth:SIMPLE)	ip=/10.45.133.85	cmd=open	src=/warehouse/tpch_3t_hive_parquet.db/lineitem/part-01027-419c80f3-8921-4ed3-b31a-0fe72b9c6732-c000.zstd.parquet	dst=null	perm=null	proto=rpc	callerContext=SPARK_TASK_application_1743671377509_0065_JId_0_SId_0_0_TId_3388_0
2025-04-30 16:40:08,547 INFO FSNamesystem.audit: allowed=true	ugi=hadoop (auth:SIMPLE)	ip=/10.45.133.86	cmd=open	src=/warehouse/tpch_3t_hive_parquet.db/lineitem/part-01027-419c80f3-8921-4ed3-b31a-0fe72b9c6732-c000.zstd.parquet	dst=null	perm=null	proto=rpc	callerContext=SPARK_TASK_application_1743671377509_0065_JId_0_SId_0_0_TId_5342_0

The patch has also been deployed to a production cluster, where 95% of workloads are Spark jobs.

Before

After

Was this patch authored or co-authored using generative AI tooling?

No.

pan3793 · 2025-04-30T09:49:57Z

...java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java

    ParquetFileReader fileReader;
-    if (fileFooter.isDefined()) {
-      fileReader = new ParquetFileReader(configuration, file, fileFooter.get());


This constructor internally calls HadoopInputFile.fromPath(file, configuration), which produces an unnecessary GetFileInfo RPC

public static HadoopInputFile fromPath(Path path, Configuration conf) throws IOException { FileSystem fs = path.getFileSystem(conf); return new HadoopInputFile(fs, fs.getFileStatus(path), conf); }

pan3793 · 2025-05-06T02:13:42Z

cc @sunchao @wangyum @wgtmac I mark this PR draft because it requires changes on Parquet side first, would be great if you can take a look at this idea first, thank you in advance.

wangyum · 2025-05-06T03:13:43Z

also cc @turboFei

pan3793 · 2025-05-06T10:08:47Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala

+        val footerFilter = ParquetFooterReader.footerFilter(
+          sharedConf, file, ParquetFooterReader.WITH_ROW_GROUPS)
+        val footer = ParquetFooterReader.readFooter(
+          hadoopInputFile, fileInputStream, footerFilter)


TODO: close the reader but keep the fileInputStream open, waiting for apache/parquet-java#3208

github-actions bot added the SQL label Apr 30, 2025

pan3793 mentioned this pull request Apr 30, 2025

GH-3141: Add constructor to ParquetFileReader to allow passing in parquet footer apache/parquet-java#3165

Open

pan3793 commented Apr 30, 2025

View reviewed changes

pan3793 marked this pull request as draft May 6, 2025 02:13

pan3793 changed the title ~~[WIP] Reduce HDFS NameNode RPC on vectorized Parquet reader~~ [WIP][SPARK-52011][SQL] Reduce HDFS NameNode RPC on vectorized Parquet reader May 6, 2025

pan3793 commented May 6, 2025

View reviewed changes

Reduce HDFS NameNode RPC on vectorized Parquet reader

9b1df55

pan3793 force-pushed the nn-rpc branch from 7caef51 to 9b1df55 Compare July 3, 2025 15:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP][SPARK-52011][SQL] Reduce HDFS NameNode RPC on vectorized Parquet reader #50765

[WIP][SPARK-52011][SQL] Reduce HDFS NameNode RPC on vectorized Parquet reader #50765

Uh oh!

pan3793 commented Apr 30, 2025 •

edited

Loading

Uh oh!

pan3793 Apr 30, 2025

Uh oh!

pan3793 commented May 6, 2025

Uh oh!

wangyum commented May 6, 2025

Uh oh!

pan3793 May 6, 2025

Uh oh!

Uh oh!

[WIP][SPARK-52011][SQL] Reduce HDFS NameNode RPC on vectorized Parquet reader #50765

Are you sure you want to change the base?

[WIP][SPARK-52011][SQL] Reduce HDFS NameNode RPC on vectorized Parquet reader #50765

Uh oh!

Conversation

pan3793 commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

pan3793 Apr 30, 2025

Choose a reason for hiding this comment

Uh oh!

pan3793 commented May 6, 2025

Uh oh!

wangyum commented May 6, 2025

Uh oh!

pan3793 May 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pan3793 commented Apr 30, 2025 •

edited

Loading