[BUG] Cannot auto-tune Dataproc on GKE using spark_rapids profiling --cluster #1433

wiwa · 2024-11-23T18:43:00Z

Describe the bug

Following the official examples here, cannot profile event logs from Dataproc on GKE cluster due to following error:

2024-11-23 13:17:23,175 DEBUG rapids.tools.profiling: Processing Rapids plugin Arguments {}
2024-11-23 13:17:23,175 INFO rapids.tools.profiling: Loading GPU cluster properties from file dp.yml
ERROR: Could not find elements [('config', 'masterConfig', 'instanceNames')]
ERROR: Could not find elements [('config', 'masterConfig')]
2024-11-23 13:17:23,177 ERROR rapids.tools.profiling: Failed in processing arguments
2024-11-23 13:17:23,177 ERROR root: Profiling. Raised an error in phase [Process-Arguments]
Traceback (most recent call last):
  File "./spark-rapids-tools/user_tools/.venv/lib/python3.11/site-packages/spark_rapids_pytools/rapids/rapids_tool.py", line 116, in wrapper
    func_cb(self, *args, **kwargs)  # pylint: disable=not-callable
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "./spark-rapids-tools/user_tools/.venv/lib/python3.11/site-packages/spark_rapids_pytools/rapids/rapids_tool.py", line 172, in _process_arguments
    raise ex
  File "./spark-rapids-tools/user_tools/.venv/lib/python3.11/site-packages/spark_rapids_pytools/rapids/rapids_tool.py", line 167, in _process_arguments
    self._process_custom_args()
  File "./spark-rapids-tools/user_tools/.venv/lib/python3.11/site-packages/spark_rapids_pytools/rapids/profiling.py", line 59, in _process_custom_args
    self._process_offline_cluster_args()
  File "./spark-rapids-tools/user_tools/.venv/lib/python3.11/site-packages/spark_rapids_pytools/rapids/profiling.py", line 66, in _process_offline_cluster_args
    if self._process_gpu_cluster_args(offline_cluster_opts):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "./spark-rapids-tools/user_tools/.venv/lib/python3.11/site-packages/spark_rapids_pytools/rapids/profiling.py", line 73, in _process_gpu_cluster_args
    gpu_cluster_obj = self._create_migration_cluster('GPU', gpu_cluster_arg)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "./spark-rapids-tools/user_tools/.venv/lib/python3.11/site-packages/spark_rapids_pytools/rapids/rapids_tool.py", line 732, in _create_migration_cluster
    cluster_obj = self.ctxt.platform.load_cluster_by_prop_file(cluster_conf_path)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "./spark-rapids-tools/user_tools/.venv/lib/python3.11/site-packages/spark_rapids_pytools/cloud_api/sp_types.py", line 856, in load_cluster_by_prop_file
    return self.load_cluster_by_prop(prop_container)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "./spark-rapids-tools/user_tools/.venv/lib/python3.11/site-packages/spark_rapids_pytools/cloud_api/sp_types.py", line 849, in load_cluster_by_prop
    return self._construct_cluster_from_props(cluster=cluster,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "./spark-rapids-tools/user_tools/.venv/lib/python3.11/site-packages/spark_rapids_pytools/cloud_api/dataproc.py", line 95, in _construct_cluster_from_props
    return DataprocCluster(self, is_inferred=is_inferred).set_connection(cluster_id=cluster, props=props)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "./spark-rapids-tools/user_tools/.venv/lib/python3.11/site-packages/spark_rapids_pytools/cloud_api/sp_types.py", line 1043, in set_connection
    self._init_nodes()
  File "./spark-rapids-tools/user_tools/.venv/lib/python3.11/site-packages/spark_rapids_pytools/cloud_api/dataproc.py", line 503, in _init_nodes
    'name': master_nodes_from_conf[0],
            ~~~~~~~~~~~~~~~~~~~~~~^^^
TypeError: 'NoneType' object is not subscriptable

Processing Completed!

Steps/Code to reproduce bug

Create a GPU-accelerated Dataproc on GKE cluster.
Run a PySpark job
Attempt to profile it using: spark_rapids profiling --cluster <cluster> -p dataproc -v --eventlogs gs://<logs>

This problem persists regardless of using the cluster name or the YAML from gcloud dataproc clusters describe
I was able to actually run the tool successfully by omitting the --cluster portion. However, this seems to not give any cluster recommendations, despite having a large amount of cluster config info in the profiling output.

Expected behavior
I expected the tool to work with Dataproc on GKE.

Environment details (please complete the following information)

Environment location: [Standalone, YARN, Kubernetes, Cloud(specify cloud provider)]

Spark is being run on Dataproc on GKE. Tool is being run locally.

Additional context
I tried using a couple versions of the tool (built from source): v24.10.0, v24.10.1, v24.08.2. I also tried from simple pip install.
I also tried manually defaulting the master_nodes_from_conf[0] variable, but it simply uncovered a string of other issues revolving around not detecting the cluster config.
I believe the issue is partially due to the fact that, with Dataproc on GKE, the nodes can be/are expected to be ephemeral.
As a quick "remedy", we can document the lack of "Dataproc on GKE" support for the auto-tuning part of the tool.

Possibly relevant issues:

The text was updated successfully, but these errors were encountered:

wiwa · 2024-11-23T19:01:53Z

Perhaps we could also allow the user to manually specify the required cluster config? That would also help with "on-prem" usecase.

cindyyuanjiang · 2024-12-05T22:24:06Z

Hi @wiwa, thanks for raising this issue. I was able to reproduce this in my local tools setup.

We have dropped Dataproc-GKE support for spark_rapids profiling. The supported platforms are:

    -p, --platform=PLATFORM
        Type: Optional[str]
        Default: None
        defines one of the following "onprem", "emr", "dataproc", "databricks-aws", and "databricks-azure".

I saw your cmd specified -p dataproc, this will work with a Dataproc cluster (cluster name or a properties file) but not with a Dataproc-GKE cluster. This is because the cluster properties files are different between Dataproc and Dataproc-GKE. For tools, we consider Dataproc and Dataproc-GKE as two platforms.

Do you need Dataproc-GKE support for spark_rapids profiling tool?

cindyyuanjiang · 2024-12-05T22:27:22Z

Perhaps we could also allow the user to manually specify the required cluster config? That would also help with "on-prem" usecase.

Could you elaborate on the "required cluster config"? We may already have support for this.

wiwa added ? - Needs Triage bug Something isn't working labels Nov 23, 2024

wiwa changed the title ~~[BUG]~~ [BUG] Cannot auto-tune Dataproc on GKE using spark_rapids profiling --cluster Nov 23, 2024

amahussein added the user_tools Scope the wrapper module running CSP, QualX, and reports (python) label Nov 25, 2024

mattahrens assigned cindyyuanjiang Dec 4, 2024

mattahrens removed the ? - Needs Triage label Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Cannot auto-tune Dataproc on GKE using spark_rapids profiling --cluster #1433

[BUG] Cannot auto-tune Dataproc on GKE using spark_rapids profiling --cluster #1433

wiwa commented Nov 23, 2024 •

edited

Loading

wiwa commented Nov 23, 2024

cindyyuanjiang commented Dec 5, 2024 •

edited

Loading

cindyyuanjiang commented Dec 5, 2024

[BUG] Cannot auto-tune Dataproc on GKE using spark_rapids profiling --cluster #1433

[BUG] Cannot auto-tune Dataproc on GKE using spark_rapids profiling --cluster #1433

Comments

wiwa commented Nov 23, 2024 • edited Loading

wiwa commented Nov 23, 2024

cindyyuanjiang commented Dec 5, 2024 • edited Loading

cindyyuanjiang commented Dec 5, 2024

wiwa commented Nov 23, 2024 •

edited

Loading

cindyyuanjiang commented Dec 5, 2024 •

edited

Loading