Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Cannot auto-tune Dataproc on GKE using spark_rapids profiling --cluster #1433

Open
wiwa opened this issue Nov 23, 2024 · 3 comments
Open
Assignees
Labels
bug Something isn't working user_tools Scope the wrapper module running CSP, QualX, and reports (python)

Comments

@wiwa
Copy link

wiwa commented Nov 23, 2024

Describe the bug

Following the official examples here, cannot profile event logs from Dataproc on GKE cluster due to following error:

2024-11-23 13:17:23,175 DEBUG rapids.tools.profiling: Processing Rapids plugin Arguments {}
2024-11-23 13:17:23,175 INFO rapids.tools.profiling: Loading GPU cluster properties from file dp.yml
ERROR: Could not find elements [('config', 'masterConfig', 'instanceNames')]
ERROR: Could not find elements [('config', 'masterConfig')]
2024-11-23 13:17:23,177 ERROR rapids.tools.profiling: Failed in processing arguments
2024-11-23 13:17:23,177 ERROR root: Profiling. Raised an error in phase [Process-Arguments]
Traceback (most recent call last):
  File "./spark-rapids-tools/user_tools/.venv/lib/python3.11/site-packages/spark_rapids_pytools/rapids/rapids_tool.py", line 116, in wrapper
    func_cb(self, *args, **kwargs)  # pylint: disable=not-callable
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "./spark-rapids-tools/user_tools/.venv/lib/python3.11/site-packages/spark_rapids_pytools/rapids/rapids_tool.py", line 172, in _process_arguments
    raise ex
  File "./spark-rapids-tools/user_tools/.venv/lib/python3.11/site-packages/spark_rapids_pytools/rapids/rapids_tool.py", line 167, in _process_arguments
    self._process_custom_args()
  File "./spark-rapids-tools/user_tools/.venv/lib/python3.11/site-packages/spark_rapids_pytools/rapids/profiling.py", line 59, in _process_custom_args
    self._process_offline_cluster_args()
  File "./spark-rapids-tools/user_tools/.venv/lib/python3.11/site-packages/spark_rapids_pytools/rapids/profiling.py", line 66, in _process_offline_cluster_args
    if self._process_gpu_cluster_args(offline_cluster_opts):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "./spark-rapids-tools/user_tools/.venv/lib/python3.11/site-packages/spark_rapids_pytools/rapids/profiling.py", line 73, in _process_gpu_cluster_args
    gpu_cluster_obj = self._create_migration_cluster('GPU', gpu_cluster_arg)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "./spark-rapids-tools/user_tools/.venv/lib/python3.11/site-packages/spark_rapids_pytools/rapids/rapids_tool.py", line 732, in _create_migration_cluster
    cluster_obj = self.ctxt.platform.load_cluster_by_prop_file(cluster_conf_path)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "./spark-rapids-tools/user_tools/.venv/lib/python3.11/site-packages/spark_rapids_pytools/cloud_api/sp_types.py", line 856, in load_cluster_by_prop_file
    return self.load_cluster_by_prop(prop_container)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "./spark-rapids-tools/user_tools/.venv/lib/python3.11/site-packages/spark_rapids_pytools/cloud_api/sp_types.py", line 849, in load_cluster_by_prop
    return self._construct_cluster_from_props(cluster=cluster,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "./spark-rapids-tools/user_tools/.venv/lib/python3.11/site-packages/spark_rapids_pytools/cloud_api/dataproc.py", line 95, in _construct_cluster_from_props
    return DataprocCluster(self, is_inferred=is_inferred).set_connection(cluster_id=cluster, props=props)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "./spark-rapids-tools/user_tools/.venv/lib/python3.11/site-packages/spark_rapids_pytools/cloud_api/sp_types.py", line 1043, in set_connection
    self._init_nodes()
  File "./spark-rapids-tools/user_tools/.venv/lib/python3.11/site-packages/spark_rapids_pytools/cloud_api/dataproc.py", line 503, in _init_nodes
    'name': master_nodes_from_conf[0],
            ~~~~~~~~~~~~~~~~~~~~~~^^^
TypeError: 'NoneType' object is not subscriptable

Processing Completed!

Steps/Code to reproduce bug

  1. Create a GPU-accelerated Dataproc on GKE cluster.
  2. Run a PySpark job
  3. Attempt to profile it using: spark_rapids profiling --cluster <cluster> -p dataproc -v --eventlogs gs://<logs>

This problem persists regardless of using the cluster name or the YAML from gcloud dataproc clusters describe
I was able to actually run the tool successfully by omitting the --cluster portion. However, this seems to not give any cluster recommendations, despite having a large amount of cluster config info in the profiling output.

Expected behavior
I expected the tool to work with Dataproc on GKE.

Environment details (please complete the following information)

  • Environment location: [Standalone, YARN, Kubernetes, Cloud(specify cloud provider)]

Spark is being run on Dataproc on GKE. Tool is being run locally.

Additional context
I tried using a couple versions of the tool (built from source): v24.10.0, v24.10.1, v24.08.2. I also tried from simple pip install.
I also tried manually defaulting the master_nodes_from_conf[0] variable, but it simply uncovered a string of other issues revolving around not detecting the cluster config.
I believe the issue is partially due to the fact that, with Dataproc on GKE, the nodes can be/are expected to be ephemeral.
As a quick "remedy", we can document the lack of "Dataproc on GKE" support for the auto-tuning part of the tool.

Possibly relevant issues:

@wiwa wiwa added ? - Needs Triage bug Something isn't working labels Nov 23, 2024
@wiwa wiwa changed the title [BUG] [BUG] Cannot auto-tune Dataproc on GKE using spark_rapids profiling --cluster Nov 23, 2024
@wiwa
Copy link
Author

wiwa commented Nov 23, 2024

Perhaps we could also allow the user to manually specify the required cluster config? That would also help with "on-prem" usecase.

@amahussein amahussein added the user_tools Scope the wrapper module running CSP, QualX, and reports (python) label Nov 25, 2024
@cindyyuanjiang
Copy link
Collaborator

cindyyuanjiang commented Dec 5, 2024

Hi @wiwa, thanks for raising this issue. I was able to reproduce this in my local tools setup.

We have dropped Dataproc-GKE support for spark_rapids profiling. The supported platforms are:

    -p, --platform=PLATFORM
        Type: Optional[str]
        Default: None
        defines one of the following "onprem", "emr", "dataproc", "databricks-aws", and "databricks-azure".

I saw your cmd specified -p dataproc, this will work with a Dataproc cluster (cluster name or a properties file) but not with a Dataproc-GKE cluster. This is because the cluster properties files are different between Dataproc and Dataproc-GKE. For tools, we consider Dataproc and Dataproc-GKE as two platforms.

Do you need Dataproc-GKE support for spark_rapids profiling tool?

@cindyyuanjiang
Copy link
Collaborator

Perhaps we could also allow the user to manually specify the required cluster config? That would also help with "on-prem" usecase.

Could you elaborate on the "required cluster config"? We may already have support for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working user_tools Scope the wrapper module running CSP, QualX, and reports (python)
Projects
None yet
Development

No branches or pull requests

4 participants