Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a new materialization insert_by_period #190

Closed
nicor88 opened this issue Mar 29, 2023 · 14 comments
Closed

Create a new materialization insert_by_period #190

nicor88 opened this issue Mar 29, 2023 · 14 comments
Labels
enhancement New feature or request

Comments

@nicor88
Copy link
Contributor

nicor88 commented Mar 29, 2023

Inspired by https://gist.github.com/jessedobbelaere/6fdb593f9e2cc732e9f142c56c9bac87
create a new materialisation: insert_by_period.

Requirements

  • Support table_type=hive
  • Support for table_type=iceberg
    • to make the implementation idempotent add delete from table period={period_to_process} to avoid duplicates in case some error occur
@nicor88 nicor88 added the enhancement New feature or request label Mar 29, 2023
@nicor88 nicor88 added this to the v1.5.0 milestone Apr 20, 2023
@nicor88 nicor88 modified the milestones: v1.5.0, v1.5.1 May 2, 2023
@Jrmyy Jrmyy modified the milestones: v1.5.1, v1.5.2 Jun 14, 2023
@nicor88 nicor88 removed this from the v1.5.2 milestone Sep 11, 2023
@svdimchenko
Copy link
Contributor

svdimchenko commented Oct 31, 2023

@nicor88 WDYT is this materialisation still needed ? I believe this commit resolves huge source datasets processing.

@nicor88
Copy link
Contributor Author

nicor88 commented Oct 31, 2023

@svdimchenko yes, I believe could be still needed.
In the current adapter setup we have a first step where the all table is saved as not partitioned dataset. If the amount of data is huge, there user might want to add filters to the dataset to break it down, that's where insert_by_period comes into play.

@nicor88
Copy link
Contributor Author

nicor88 commented Oct 31, 2023

@svdimchenko for more context on my last message see this one #471

@mycaule
Copy link

mycaule commented Nov 16, 2023

You should support also unique_key, I don't think the gist code has any way to deal with that.

I am thinking of cases you would need to deal with rows that get updated and developer wants to use this config
config(materialized='sp_insert_by_period', timestamp_field = 'updated_at', ...)

@nicor88
Copy link
Contributor Author

nicor88 commented Nov 16, 2023

@mycaule why unique_key should included in insert_by_period? Insert by period by definition acts on time, and it delete insert batches of periods. If you need to do upsert by unique_key maybe you consider to use an incremental materialization with merge strategy and of course unique_key.

@mycaule
Copy link

mycaule commented Nov 16, 2023

Using incremental fails for large tables as you discussed in #471
and the snippet from redshift provided in the discussion was the only quick way to do it in my case.

And force_batch: (config(materialized='incremental', unique_key='id', force_batch=True)) doesn't give the progress, neither can we continue if the dbt run fails.

@nicor88
Copy link
Contributor Author

nicor88 commented Nov 16, 2023

Which snippet for redshift? Do you mean this #471 (comment) ? if so I don't see any reference to unique_key. Could you please add the reference to the snippet that you are talking about?

Also:

config(materialized='incremental', unique_key='id', force_batch=True))

that is incomplete, that is your default incremental_strategy? I assume merge if you use unique_key.

Also few challenges:

  • adding unique_key to insert by period in my opinion is really not ideal - the update of those unique_key will happen naturally when you delete/insert by period

In order to understand more what you want to achieve I need more context to understand what you plan to build.

Also force_batch don't help you if you are not using partitioned datasets, and if your initial query is big, it will still fail, because it will try to process the all datasets - that's where insert_by_period comes in handy.

@mycaule
Copy link

mycaule commented Nov 16, 2023

Thank you for the details I hadn't this in mind.

This snippet talks about Redshift and adapting to Athena: https://gist.github.com/jessedobbelaere/6fdb593f9e2cc732e9f142c56c9bac87

For the moment I use the default incremental_strategy which should be append, using Hive format for the moment because of legacy. For the merge strategy to work I would have to move to Iceberg.

I am trying to materialize a very large table through a federated query to RDS, and I want to Athena table to be refreshed daily in my datalake. Existing rows in RDS can change over time, I have id (primary key), created_at and updated_at columns at disposal.

@nicor88
Copy link
Contributor Author

nicor88 commented Nov 16, 2023

Yes, insert_by_period is adapted from other adapter (redshift in this case), so seems like you were rerefing to the gist, not directly #471

Existing rows in RDS can change over time, I have id (primary key), created_at and updated_at columns at disposal.

do you expect to have updated_at to change? what you will generally do with an incremental strategy is to pick only records that have a change (means updated_at MUST change) to let this working, using is_incremental macro.

Now the challenge is that:

  • on the first run of the model, due to the big dataset, you will like to have a behavior like insert_by_period,
  • for incrementally runs(means on smaller datasets, you must apply a filter e.g. WHERE updated_at >= select max(updated_at) from {{this}}) - you want to have a "normal" incremental materialization with merge.

@mycaule
Copy link

mycaule commented Nov 17, 2023

materialized='incremental', table_type='iceberg', unique_key='id', incremental_strategy='merge'
When I used Iceberg + merge strategy, I didn't notice any duration improvement on second iteration while dbt logs saying only a few rows were updated.

materialized='incremental', incremental_strategy='insert_overwrite'
Let's imagine I stay with Hive, I understand that to be able to use insert_overwrite, the partitioned_by field is mandatory

I also had different finite status columns describing the lifecycle of the rows: status_1, status_2, status_3, and partitioned_by = ['status_1', 'status_2', 'status_3'] was faster using Hive than Iceberg.

To deal with tables without lifecycle columns, I also tried to create artificial partitions using YEAR(created_at) AS year, MONTH(created_at) AS month, DAY(created_at) AS day and then partitioned_by = ['year', 'month', 'day'].

Do you think I can do partitioned_by=['bucket(id, 100)'] to both stay within the 100 partitions limit, and only rewrite on average $\mathcal{O}(n/100)$ of my data. Or is it only supported by Iceberg as shown in this example from the docs

@jessedobbelaere
Copy link
Contributor

💡 Last night, dbt made this announcement about a new incremental-strategy for time-series event data. It introduces incremental_strategy='microbatch' to enable micro-batching to replace the custom insert-by-period macro's in the wild. It also seems to simplify things a lot, less config and including sampling for dev runs. Let's focus on integrating with this feature when it comes out?

@nicor88
Copy link
Contributor Author

nicor88 commented Sep 6, 2024

@jessedobbelaere we should indeed consider to support the microbatch in dbt-athena.
We should consider to create another issue to track incremental_strategy microbatch, and closing eventually this one.

@jessedobbelaere
Copy link
Contributor

Exactly 👍 I've created #715 as placeholder

@amychen1776
Copy link

Hi folks - our team is already looking at supporting our new Microbatch strategy on this adapter so for now, we are going to go ahead and close out this issue since this will address this issue :)

@amychen1776 amychen1776 closed this as not planned Won't fix, can't repro, duplicate, stale Oct 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants