Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ERROR rapids.tools.qualification: Failed to execute the prediction model #1393

Open
estebanmodmed opened this issue Oct 24, 2024 · 1 comment
Assignees
Labels
question Further information is requested user_tools Scope the wrapper module running CSP, QualX, and reports (python)

Comments

@estebanmodmed
Copy link

estebanmodmed commented Oct 24, 2024

Describe the bug
I'm using the qualification tool over an eventlog generated by the execution of a Databricks Workflow Job.

I'm getting the following errors when using the qualification tool:

Processing...⣟2024-10-24 10:48:43,531 ERROR rapids.tools.qualification: Failed to execute the prediction model. Using default speed up of 1.0 for all apps. Reason - KeyError:'startTime' ERROR: Could not find elements [('rd-fleet.8xlarge',)] 2024-10-24 10:48:43,542 ERROR rapids.tools.cluster_inference: Error while inferring cluster: Instance type rd-fleet.8xlarge is not found in catalog. Processing...⡿2024-10-24 10:48:43,609 ERROR rapids.tools.AdditionalHeuristics: Cannot apply heuristics for qualification. Reason - FileNotFoundError:[Errno 2] No such file or directory: '/Users/username/repos/nvidia-rapids/qual_20241024134808_8B440b4b/rapids_4_spark_qualification_output/raw_metrics/app-20241022192347-0000/stage_level_aggregated_task_metrics.csv'

After the error is thrown, the tool generates the report but indicates there are no compatible apps.

Steps/Code to reproduce bug

  • Execute qualification tool with the following parameters:

spark_rapids qualification --platform databricks-aws --eventlogs logs/cluster_id/eventlog/cluster_id_10_69_238_61/some_id/eventlog

Expected behavior
No errors and a recommendation about the cluster shape I should use to improve performance.

Environment details (please complete the following information)

  • Environment location: It was executed in my local environment, using logs generated by Databricks AWS, which I previously downloaded to my machine.

Additional context
No additional context.

@estebanmodmed estebanmodmed added ? - Needs Triage bug Something isn't working labels Oct 24, 2024
@estebanmodmed estebanmodmed changed the title [BUG] [BUG] ERROR rapids.tools.qualification: Failed to execute the prediction model Oct 24, 2024
@parthosa
Copy link
Collaborator

parthosa commented Oct 25, 2024

Hi @estebanmodmed,

  1. It seems the path you provided maybe incomplete, which is causing the Tool to read partial event logs. Databricks stores event logs in a rolling manner as:

    ls -l logs/<cluster-id>/eventlog/<cluster-id>_<some-id>/<some-id>
    eventlog
    eventlog-2024-02-20--04-50.gz
    eventlog-2024-02-20--05-00.gz
    eventlog-2024-02-20--05-10.gz
    eventlog-2024-02-20--05-20.gz
    

    To fix this, I would recommend using the parent directory instead of pointing directly to a specific eventlog file.

    Recommended CMD:

     spark_rapids qualification  --platform databricks-aws  --eventlogs logs/cluster_id/eventlog/cluster_id_10_69_238_61/some_id
    
  2. The application seems to have run on Databricks Fleet instances ('rd-fleet.8xlarge'). Currently, we don't support fleet instances, but we will update our catalog to include them. However, this is mostly a log message and is not related to the tool’s failure.

With the recommended CMD and path, you should be able to run the tool and get speedup estimation and recommendation about the cluster shape.

@amahussein amahussein added user_tools Scope the wrapper module running CSP, QualX, and reports (python) question Further information is requested and removed ? - Needs Triage bug Something isn't working labels Oct 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested user_tools Scope the wrapper module running CSP, QualX, and reports (python)
Projects
None yet
Development

No branches or pull requests

3 participants