Add support for Spark Connect (SQL models) #899

vakarisbk · 2023-10-03T14:40:44Z

partially resolves #814
docs dbt-labs/docs.getdbt.com/#

Problem

dbt-spark has limited options for open-source Spark integrations. Currently, the only available method to run dbt with open-source Spark in production is through a Thrift connection. However, a Thrift connection isn't suitable for all use cases. For instance, it doesn't support thrift over HTTP. Also, the PyHive project, that dbt thrift relies on, is unsupported (at least according to their GitHub page).

Solution

Propose introducing support for Spark Connect (for SQL models only).

Checklist

I have read the contributing guide and understand what's expected of me
I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
This PR has no interface changes (e.g. macros, cli, logs, json artifacts, config files, adapter interface, etc) or this PR has already received feedback and approval from Product or DX

How to test locally?

Follow the instructions in the Spark documentation to download Spark distribution. https://spark.apache.org/docs/latest/spark-connect-overview.html
Start spark connect server with Hive metastore enabled ./start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:3.5.0 --conf spark.sql.catalogImplementation=hive
Add the Spark Connect configuration to your profiles.yml:

spark_connect:
  outputs:
    dev:
      host: localhost
      method: connect
      port: 15002
      schema: default
      type: spark
  target: dev

Known issues: #901

cla-bot · 2023-10-03T14:40:47Z

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Vakaris.
This is most likely caused by a git client misconfiguration; please make sure to:

check if your git client is configured with an email to sign commits git config --list | grep email
If not, set it up using git config --global user.email [email protected]
Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

cla-bot · 2023-10-03T17:13:35Z

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Vakaris.
This is most likely caused by a git client misconfiguration; please make sure to:

check if your git client is configured with an email to sign commits git config --list | grep email
If not, set it up using git config --global user.email [email protected]
Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

ben-schreiber · 2024-02-06T14:08:58Z

setup.py

@@ -59,7 +59,16 @@ def _get_dbt_core_version():
    "thrift>=0.11.0,<0.17.0",
 ]
 session_extras = ["pyspark>=3.0.0,<4.0.0"]
-all_extras = odbc_extras + pyhive_extras + session_extras
+connect_extras = [
+    "pyspark==3.5.0",


Can we support pyspark>=3.4.0,<4, or at least pyspark>=3.5.0,<4?

pyspark>=3.5.0,<4 added.
3.4.0 connect module has an issue where temporary views are not shared between queries. If one dbt query creates a temp view, another query cannot see it. Can't find a spark issue # now

vakarisbk · 2024-02-17T18:41:32Z

Seeing as there is some recent activity on Issue #814, and knowing that there are at least a couple of people actively using this fork, I've updated it. Looking forward for any insights regarding the implementation, as well as the likelihood of this pr getting merged.

vakarisbk force-pushed the main branch from f251090 to e79671e Compare October 3, 2023 17:40

cla-bot bot added the cla:yes label Oct 3, 2023

vakarisbk force-pushed the main branch from e79671e to 38bfada Compare October 3, 2023 19:05

vakarisbk mentioned this pull request Oct 4, 2023

[ADAP-931] [Bug] Values in seeds that should convert to null aren't working for session connection method #901

Open

2 tasks

vakarisbk changed the title ~~[WIP] Add support for Spark Connect (SQL models)~~ Add support for Spark Connect (SQL models) Oct 4, 2023

vakarisbk marked this pull request as ready for review October 4, 2023 16:23

vakarisbk requested a review from a team as a code owner October 4, 2023 16:23

vakarisbk requested a review from VersusFacit October 4, 2023 16:23

vakarisbk mentioned this pull request Oct 4, 2023

[ADAP-658] [Feature] Spark Connect as connection method #814

Open

3 tasks

ben-schreiber reviewed Feb 6, 2024

View reviewed changes

Add spark-connect connection method

4e7f5d5

vakarisbk force-pushed the main branch from 726e359 to 4e7f5d5 Compare February 17, 2024 18:19

fix requirements and githubci

091132d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Spark Connect (SQL models) #899

Add support for Spark Connect (SQL models) #899

vakarisbk commented Oct 3, 2023 •

edited

Loading

cla-bot bot commented Oct 3, 2023

cla-bot bot commented Oct 3, 2023

ben-schreiber Feb 6, 2024

vakarisbk Feb 17, 2024 •

edited

Loading

vakarisbk commented Feb 17, 2024

Add support for Spark Connect (SQL models) #899

Are you sure you want to change the base?

Add support for Spark Connect (SQL models) #899

Conversation

vakarisbk commented Oct 3, 2023 • edited Loading

Problem

Solution

Checklist

How to test locally?

Known issues: #901

cla-bot bot commented Oct 3, 2023

cla-bot bot commented Oct 3, 2023

ben-schreiber Feb 6, 2024

Choose a reason for hiding this comment

vakarisbk Feb 17, 2024 • edited Loading

Choose a reason for hiding this comment

vakarisbk commented Feb 17, 2024

vakarisbk commented Oct 3, 2023 •

edited

Loading

vakarisbk Feb 17, 2024 •

edited

Loading