Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Tool output file support for writing to HDFS file system #1348

Open
Tracked by #1304
wjxiz1992 opened this issue Sep 18, 2024 · 0 comments
Open
Tracked by #1304

[FEA] Tool output file support for writing to HDFS file system #1348

wjxiz1992 opened this issue Sep 18, 2024 · 0 comments
Labels
feature request New feature or request user_tools Scope the wrapper module running CSP, QualX, and reports (python)

Comments

@wjxiz1992
Copy link
Collaborator

wjxiz1992 commented Sep 18, 2024

Is your feature request related to a problem? Please describe.
This is from a customer request.

Background:
The customer is trying to qualify all their current Spark jobs. The number of spark jobs are so many that they have to write a customized UDF to make the qualification work distributed: #1347

In the PR above, the general IDEA is to do a two step thing in the UDF:

  1. Leverage Scala qualification code to get qualification output.
  2. Call qualx command line to do predictions with the output files produced in step-1

Step-1 has been proved not valid, as some essential files for step-2 doesn't produced by Scala code. It requires also the python side code. Also it doesn't support hdfs as output filesystem

spark_rapids_tools.exceptions.InvalidProtocolPrefixError: "hdfs:///..../..../...." is not a valid path since it does not start with "file://"

Step-2 only accepts local file system, which is inconvinient for distributed work in their production environment.

Thus, this issue is created to ask for full-chain support for HDFS file system. The command line after this feature is supported can be imagined like below:

spark_rapids qualification \
--platform onprem \
--eventlogs hdfs:///<PATH_TO_EVENTLOG_FILE> \
--output_folder hdfs:///<PATH_TO_OUTPUT_PATH>


spark_rapids prediction \
--platform onprem \
--qual_output hdfs:///<PATH_TO_OUTPUT_PATH_PRODUCED IN QUALIFICATION>\
--output_folder hdfs:///<PATH_TO_OUTPUT_PREDICTIONS>

This should contribute to #1304

cc @Heatao @winningsix

@amahussein amahussein added the user_tools Scope the wrapper module running CSP, QualX, and reports (python) label Sep 18, 2024
@tgravescs tgravescs changed the title [FEA] Full chain support for HDFS file system [FEA] Tool output file support for writing to HDFS file system Sep 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request user_tools Scope the wrapper module running CSP, QualX, and reports (python)
Projects
None yet
Development

No branches or pull requests

2 participants