Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] spark_catalog requires a single-part namespace in dbt python incremental model #1300

Open
2 tasks done
carlos-veris opened this issue Jul 23, 2024 · 0 comments
Open
2 tasks done
Labels
bug Something isn't working python_models

Comments

@carlos-veris
Copy link

carlos-veris commented Jul 23, 2024

Is this a new bug in dbt-bigquery?

  • I believe this is a new bug in dbt-bigquery
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

When running a dbt python model with an incremental strategy and using the property dbt.this to access the location of the current model, the code breaks.

Here's the faulty code:

# Processs new rows only
if dbt.is_incremental:
    # only new rows compared to max in current table
    max_from_this = f"select max(created_at) from {dbt.this}"
    df = df.filter(df.created_at >= session.sql(max_from_this).collect()[0][0])

Here's the error output:

df = df.filter(df.created_at >= session.sql(max_from_this).collect()[0][0])
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 1034, in sql
File "/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 196, in deco
pyspark.sql.utils.AnalysisException: spark_catalog requires a single-part namespace, but got [x, y]

X is the project name and Y is the dataset name.
It is using the dbt-bigquery adapter (v.1.7.2) and uses dataproc to submit the Python model.

Expected Behavior

It is expected that one can make use of the aforementioned property in order to run incremental models.

Steps To Reproduce

  1. Python model using the dbt-bigquery adapter
  2. Under the model() function, set the materialized property of dbt.config to incremental
def model(dbt, session):
    dbt.config(
        materialized="incremental",
        dataproc_region=<DATAPROC_REGION>
        submission_method=<SUBMISSION_METHOD>
    )
  1. Try to use the dbt.this property
max_from_this = f"select max(created_at) from {dbt.this}
df = df.filter(df.created_at >= session.sql(max_from_this).collect()[0][0])
  1. Run the model using dbt run
  2. You should be able to check the logs in the dataproc batch, using the Google Cloud Console.

Relevant log output

Using the default container image
Waiting for container log creation
PYSPARK_PYTHON=/opt/dataproc/conda/bin/python
JAVA_HOME=/usr/lib/jvm/temurin-11-jdk-amd64
SPARK_EXTRA_CLASSPATH=
:: loading settings :: file = /etc/spark/conf/ivysettings.xml
/usr/lib/spark/python/lib/pyspark.zip/pyspark/pandas/__init__.py:49: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched.
Traceback (most recent call last):
  File "/var/dataproc/tmp/srvls-batch-0c5d7153-2f67-4614-86b2-1ed2f1264837/<PYTHON-MODEL.py>", line 264, in <module>
    df = model(dbt, spark)
  File "/var/dataproc/tmp/srvls-batch-0c5d7153-2f67-4614-86b2-1ed2f1264837/<PYTHON-MODEL.py>", line 165, in model
    df = df.filter(df.created_at >= session.sql(max_from_this).collect()[0][0])
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 1034, in sql
  File "/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 196, in deco
pyspark.sql.utils.AnalysisException: spark_catalog requires a single-part namespace, but got [x, y]

Environment

dbt-core: 1.7.2
dbt-bigquery: 1.7.2

Additional Context

References:

@carlos-veris carlos-veris added bug Something isn't working triage labels Jul 23, 2024
@amychen1776 amychen1776 added the python Pull requests that update Python code label Jul 24, 2024
@amychen1776 amychen1776 added python_models and removed python Pull requests that update Python code triage labels Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python_models
Projects
None yet
Development

No branches or pull requests

2 participants