Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SparkHiveDataset is incompatible with Databricks Connect V2 #467

Open
alamastor opened this issue Dec 8, 2023 · 2 comments
Open

SparkHiveDataset is incompatible with Databricks Connect V2 #467

alamastor opened this issue Dec 8, 2023 · 2 comments
Labels
help wanted Contribution task, outside help would be appreciated!

Comments

@alamastor
Copy link
Contributor

Description

SparkHiveDataset.exists raises when called using a Databricks Connect V2 SparkSession.

Using kedro-plugins commit f59e930, i.e. an unreleased version, downstream of #352 (which adds support for DB Connect V2).

This occurs because DB Connect V2 doesn't support accessing _jsparkSession on the SparkSession, however it's used SparkHiveDataset.exists.

The obvious solution is to replace _get_spark()._jsparkSession.catalog().tableExists(self._database, self._table) with _get_spark().catalog.tableExists(self._database, self._table), however there may be a reason _jsparkSession was used that I'm not aware of.

I'm happy to raise a PR with this change.

Context

Use SparkHiveDataset with Databricks connect V2.

Steps to Reproduce

  1. Intstall kedro-plugins from master / a commit downstream of feat(datasets): Add support for databricks-connect>=13.0 #352
  2. Setup Databricks Connect per https://docs.databricks.com/en/dev-tools/databricks-connect/python/install.html
  3. Use a SparkHiveDataset

Expected Result

The dataset doesn't raise when calling _exists (works with Databricks connect V1)

Actual Result

[JVM_ATTRIBUTE_NOT_SUPPORTED] Attribute `_jsparkSession` is not supported in Spark Connect as it depends on the JVM. If you need to use this attribute, do not use Spark Connect when creating your session.
@astrojuanlu astrojuanlu added the Community Issue/PR opened by the open-source community label Dec 8, 2023
@sbrugman
Copy link
Contributor

I've also encountered this. catalog.tableExists only was introduced in spark 3.3, so making this change will break some backwards compatibility (current constraint is pyspark>=2.2). The datasets itself require Python 3.9. This makes that the effective lower bound is pyspark>3 already. I'm in favour of upgrading.

@merelcht
Copy link
Member

merelcht commented Jul 8, 2024

We welcome PR contributions to fix this!

@merelcht merelcht added the help wanted Contribution task, outside help would be appreciated! label Jul 8, 2024
@merelcht merelcht removed the Community Issue/PR opened by the open-source community label Nov 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Contribution task, outside help would be appreciated!
Projects
No open projects
Status: Todo
Development

No branches or pull requests

4 participants