You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
SparkHiveDataset.exists raises when called using a Databricks Connect V2 SparkSession.
Using kedro-plugins commit f59e930, i.e. an unreleased version, downstream of #352 (which adds support for DB Connect V2).
This occurs because DB Connect V2 doesn't support accessing _jsparkSession on the SparkSession, however it's used SparkHiveDataset.exists.
The obvious solution is to replace _get_spark()._jsparkSession.catalog().tableExists(self._database, self._table) with _get_spark().catalog.tableExists(self._database, self._table), however there may be a reason _jsparkSession was used that I'm not aware of.
The dataset doesn't raise when calling _exists (works with Databricks connect V1)
Actual Result
[JVM_ATTRIBUTE_NOT_SUPPORTED] Attribute `_jsparkSession` is not supported in Spark Connect as it depends on the JVM. If you need to use this attribute, do not use Spark Connect when creating your session.
The text was updated successfully, but these errors were encountered:
I've also encountered this. catalog.tableExists only was introduced in spark 3.3, so making this change will break some backwards compatibility (current constraint is pyspark>=2.2). The datasets itself require Python 3.9. This makes that the effective lower bound is pyspark>3 already. I'm in favour of upgrading.
Description
SparkHiveDataset.exists
raises when called using a Databricks Connect V2 SparkSession.Using
kedro-plugins
commitf59e930
, i.e. an unreleased version, downstream of #352 (which adds support for DB Connect V2).This occurs because DB Connect V2 doesn't support accessing
_jsparkSession
on theSparkSession
, however it's used SparkHiveDataset.exists.The obvious solution is to replace
_get_spark()._jsparkSession.catalog().tableExists(self._database, self._table)
with_get_spark().catalog.tableExists(self._database, self._table)
, however there may be a reason_jsparkSession
was used that I'm not aware of.I'm happy to raise a PR with this change.
Context
Use
SparkHiveDataset
with Databricks connect V2.Steps to Reproduce
kedro-plugins
from master / a commit downstream of feat(datasets): Add support fordatabricks-connect>=13.0
#352SparkHiveDataset
Expected Result
The dataset doesn't raise when calling
_exists
(works with Databricks connect V1)Actual Result
The text was updated successfully, but these errors were encountered: