Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] dbt source freshness expected a timestamp but received a string #631

Open
2 tasks done
guilhermenoronha opened this issue Apr 25, 2024 · 2 comments
Open
2 tasks done
Labels
bug Something isn't working

Comments

@guilhermenoronha
Copy link

Is this a new bug in dbt-athena?

  • I believe this is a new bug in dbt-athena
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

I'm trying to run a freshness check on some of my tables using a timestamp column. However, I'm getting an error saying the check found a str column instead of a timestamp one. I believe this is a bug, because athena shows my column as a timestamp according to the following image.

image

The error I'm getting is also in the following image:

image

My dbt_athena package in in version 1.7.2. The query generated by the freshness code is below as the results it produced

select
      max(loaded_at) as max_loaded_at,
      cast(now() as timestamp) as snapshotted_at
    from "awsdatacatalog"."bronze_dev"."callbacks"

image

Expected Behavior

The dbt source freshness command should be able to run successfully on my source tables. Otherwise, it should point some error related with the SLA configured, not throw an error related with casting types.

Steps To Reproduce

  1. Have a table on athena with a timestamp column (In my case the column was generated by a pyspark script executed on aws glue)
  2. Configure the dbt freshness on the same table:
    version: 2
sources:
  - name: "bronze_dev}"
    tables:
      - name: my_table
        freshness:
          error_after:
            count: 2
            period: day
        loaded_at_field: loaded_at

run the command dbt source freshness

Environment

- OS: Windows
- Python: ^3.11
- dbt: 1.7.9
- dbt-athena-community: 1.7.2

Additional Context

No response

@guilhermenoronha guilhermenoronha added the bug Something isn't working label Apr 25, 2024
@nicor88
Copy link
Contributor

nicor88 commented May 11, 2024

Leaving this for anyone picking this up.
The first attempt, we could overwrite collect_freshness macro, e.g. here

Alternatively, but a bit more complicated, on the python side, we can overwrite this method in the adapter:

    def calculate_freshness_from_metadata(
        self,
        source: BaseRelation,
        macro_resolver: Optional[MacroResolverProtocol] = None,
    ) -> Tuple[Optional[AdapterResponse], FreshnessResponse]:
....

@sanromeo
Copy link
Contributor

From my side I can add than max(<ts_column>) as max_loaded_at doesn't work fine for dbt source freshness query. For large tables it comes to scan full data without partition predicate. It seems like for max, min functions in Iceberg tables don't using Iceberg metadata. From another side I can say that querying metadata scan almost nothing (few kb).

I propose override collect_freshness macro and depends on table_type and partition column use different queries for dbt source freshness command:
For non-partitioned Iceberg tables and all Hive tables use default query:

select
      max(<ts_column>) as max_loaded_at,
      {{ current_timestamp() }} as snapshotted_at
    from <source_table>

For partitioned Iceberg tables use the following query:

select
      max(data.<partitioned_by_ts_column>.max) as max_loaded_at
from "<source_schema>"."<source_table>$partitions"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants