SparkHiveDataset is incompatible with Databricks Connect V2 #467

alamastor · 2023-12-08T00:34:47Z

Description

SparkHiveDataset.exists raises when called using a Databricks Connect V2 SparkSession.

Using kedro-plugins commit f59e930, i.e. an unreleased version, downstream of #352 (which adds support for DB Connect V2).

This occurs because DB Connect V2 doesn't support accessing _jsparkSession on the SparkSession, however it's used SparkHiveDataset.exists.

The obvious solution is to replace _get_spark()._jsparkSession.catalog().tableExists(self._database, self._table) with _get_spark().catalog.tableExists(self._database, self._table), however there may be a reason _jsparkSession was used that I'm not aware of.

I'm happy to raise a PR with this change.

Context

Use SparkHiveDataset with Databricks connect V2.

Steps to Reproduce

Intstall kedro-plugins from master / a commit downstream of feat(datasets): Add support for databricks-connect>=13.0 #352
Setup Databricks Connect per https://docs.databricks.com/en/dev-tools/databricks-connect/python/install.html
Use a SparkHiveDataset

Expected Result

The dataset doesn't raise when calling _exists (works with Databricks connect V1)

Actual Result

[JVM_ATTRIBUTE_NOT_SUPPORTED] Attribute `_jsparkSession` is not supported in Spark Connect as it depends on the JVM. If you need to use this attribute, do not use Spark Connect when creating your session.

The text was updated successfully, but these errors were encountered:

sbrugman · 2023-12-20T21:13:57Z

I've also encountered this. catalog.tableExists only was introduced in spark 3.3, so making this change will break some backwards compatibility (current constraint is pyspark>=2.2). The datasets itself require Python 3.9. This makes that the effective lower bound is pyspark>3 already. I'm in favour of upgrading.

merelcht · 2024-07-08T14:25:35Z

We welcome PR contributions to fix this!

astrojuanlu added the Community Issue/PR opened by the open-source community label Dec 8, 2023

astrojuanlu added this to Kedro Framework Dec 8, 2023

merelcht added the help wanted Contribution task, outside help would be appreciated! label Jul 8, 2024

merelcht added the Hacktoberfest label Sep 16, 2024

ankatiyar added this to 🎃 Hacktoberfest 🎃 Sep 30, 2024

ankatiyar moved this to Todo in 🎃 Hacktoberfest 🎃 Sep 30, 2024

merelcht removed this from Kedro Framework Oct 31, 2024

merelcht removed the Community Issue/PR opened by the open-source community label Nov 1, 2024

merelcht removed the Hacktoberfest label Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SparkHiveDataset is incompatible with Databricks Connect V2 #467

SparkHiveDataset is incompatible with Databricks Connect V2 #467

alamastor commented Dec 8, 2023

sbrugman commented Dec 20, 2023

merelcht commented Jul 8, 2024

SparkHiveDataset is incompatible with Databricks Connect V2 #467

SparkHiveDataset is incompatible with Databricks Connect V2 #467

Comments

alamastor commented Dec 8, 2023

Description

Context

Steps to Reproduce

Expected Result

Actual Result

sbrugman commented Dec 20, 2023

merelcht commented Jul 8, 2024