Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add microbatch strategy #924

Merged
merged 13 commits into from
Nov 7, 2024
Merged

Add microbatch strategy #924

merged 13 commits into from
Nov 7, 2024

Conversation

QMalcolm
Copy link
Contributor

@QMalcolm QMalcolm commented Oct 2, 2024

resolves #923

Problem

dbt-redshift needs a microbatch implementation that:

  • efficiently inserts new batches of data knowing that compiled_code will be filtered down by event_time
  • does not require a unique_key configuration on the model.

Solution

We did a custom implementation of delete+insert specifically for microbatch

Checklist

  • I have read the contributing guide and understand what's expected of me
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • This PR has no interface changes (e.g. macros, cli, logs, json artifacts, config files, adapter interface, etc) or this PR has already received feedback and approval from Product or DX

This work is basically in entirety a duplicate of the work done by
MichelleArk in dbt-labs/dbt-snowflake#1179.
I don't really expect this to work first try, but it might. I expect
to need to do some edits, but who knows, maybe I'll get lucky.
@cla-bot cla-bot bot added the cla:yes label Oct 2, 2024
@QMalcolm QMalcolm added the Skip Changelog Skips GHA to check for changelog file label Oct 2, 2024
@dbt-labs dbt-labs deleted a comment from github-actions bot Oct 2, 2024
@QMalcolm QMalcolm removed the Skip Changelog Skips GHA to check for changelog file label Oct 3, 2024
@QMalcolm QMalcolm marked this pull request as ready for review November 6, 2024 19:10
@QMalcolm QMalcolm requested a review from a team as a code owner November 6, 2024 19:10
@mikealfare
Copy link
Contributor

Turning it off and on again for CI

@mikealfare mikealfare closed this Nov 6, 2024
@mikealfare mikealfare reopened this Nov 6, 2024
{% endif %}

{#-- Add additional incremental_predicates to filter for batch --#}
{% do predicates.append(target ~ "." ~ model.config.event_time ~ " >= TIMESTAMP '" ~ model.config.__dbt_internal_microbatch_event_time_start ~ "'") %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think target is not strictly necessary here on this line or the one below since the delete from {{ target }} has no using clause so its unambiguous that model.config.event_time is a column on target

{% do predicates.append(pred) %}
{% endfor %}

{% if not model.config.get("__dbt_internal_microbatch_event_time_start") or not model.config.__dbt_internal_microbatch_event_time_end -%}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: not using model.config.get for the end time. let's be consistent across the two accesses

Copy link
Contributor Author

@QMalcolm QMalcolm Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

I was curious how this happened and did a little digging. In f7899e0 we removed the old separate if statements that added the predicates and added this new combined if statement. The old if statements came from the original snowflake implementation. I think did it that way originally because our initial implementation of batches in core the start time was not guaranteed. And so the get was probably to handle when the start time was None.

@MichelleArk
Copy link
Contributor

MichelleArk commented Nov 7, 2024

Did some additional 🎩-ing locally in addition to the functional coverage we had for fun!

Initial build with 3 input rows --full-refresh

Screenshot 2024-11-07 at 2 10 50 PM
❯ dbt run --select +microbatch --full-refresh
19:05:08  Running with dbt=1.9.0-b3
19:05:08  [WARNING]: Deprecated functionality

User config should be moved from the 'config' key in profiles.yml to the 'flags' key in dbt_project.yml.
19:05:09  Registered adapter: redshift=1.9.0-b1
19:05:09  [WARNING]: Time spines without YAML configuration are in the process of
deprecation. Please add YAML configuration for your 'metricflow_time_spine'
model. See documentation on MetricFlow time spines:
https://docs.getdbt.com/docs/build/metricflow-time-spine and behavior change
documentation:
https://docs.getdbt.com/reference/global-configs/behavior-changes.
19:05:09  Found 23 models, 6 seeds, 20 data tests, 6 sources, 2 exposures, 13 metrics, 799 macros, 6 semantic models, 5 unit tests
19:05:09  
19:05:09  Concurrency: 1 threads (target='dev')
19:05:09  
19:05:11  1 of 2 START sql view model michelle_ark.microbatch_input ...................... [RUN]
19:05:13  1 of 2 OK created sql view model michelle_ark.microbatch_input ................. [SUCCESS in 1.62s]
19:05:13  2 of 2 START sql microbatch model michelle_ark.microbatch ...................... [RUN]
19:05:13  1 of 12 START batch 2024-10-27 of michelle_ark.microbatch ...................... [RUN]
19:05:15  1 of 12 OK created batch 2024-10-27 of michelle_ark.microbatch ................. [SUCCESS in 1.70s]
19:05:15  2 of 12 START batch 2024-10-28 of michelle_ark.microbatch ...................... [RUN]
19:05:17  2 of 12 OK created batch 2024-10-28 of michelle_ark.microbatch ................. [SUCCESS in 1.82s]
19:05:17  3 of 12 START batch 2024-10-29 of michelle_ark.microbatch ...................... [RUN]
19:05:18  3 of 12 OK created batch 2024-10-29 of michelle_ark.microbatch ................. [SUCCESS in 1.93s]
19:05:18  4 of 12 START batch 2024-10-30 of michelle_ark.microbatch ...................... [RUN]
19:05:20  4 of 12 OK created batch 2024-10-30 of michelle_ark.microbatch ................. [SUCCESS in 1.84s]
19:05:20  5 of 12 START batch 2024-10-31 of michelle_ark.microbatch ...................... [RUN]
19:05:22  5 of 12 OK created batch 2024-10-31 of michelle_ark.microbatch ................. [SUCCESS in 1.84s]
19:05:22  6 of 12 START batch 2024-11-01 of michelle_ark.microbatch ...................... [RUN]
19:05:24  6 of 12 OK created batch 2024-11-01 of michelle_ark.microbatch ................. [SUCCESS in 1.50s]
19:05:24  7 of 12 START batch 2024-11-02 of michelle_ark.microbatch ...................... [RUN]
19:05:25  7 of 12 OK created batch 2024-11-02 of michelle_ark.microbatch ................. [SUCCESS in 1.67s]
19:05:25  8 of 12 START batch 2024-11-03 of michelle_ark.microbatch ...................... [RUN]
19:05:27  8 of 12 OK created batch 2024-11-03 of michelle_ark.microbatch ................. [SUCCESS in 1.58s]
19:05:27  9 of 12 START batch 2024-11-04 of michelle_ark.microbatch ...................... [RUN]
19:05:29  9 of 12 OK created batch 2024-11-04 of michelle_ark.microbatch ................. [SUCCESS in 1.62s]
19:05:29  10 of 12 START batch 2024-11-05 of michelle_ark.microbatch ..................... [RUN]
19:05:30  10 of 12 OK created batch 2024-11-05 of michelle_ark.microbatch ................ [SUCCESS in 1.60s]
19:05:30  11 of 12 START batch 2024-11-06 of michelle_ark.microbatch ..................... [RUN]
19:05:32  11 of 12 OK created batch 2024-11-06 of michelle_ark.microbatch ................ [SUCCESS in 1.63s]
19:05:32  12 of 12 START batch 2024-11-07 of michelle_ark.microbatch ..................... [RUN]
19:05:33  12 of 12 OK created batch 2024-11-07 of michelle_ark.microbatch ................ [SUCCESS in 1.64s]
19:05:33  2 of 2 OK created sql microbatch model michelle_ark.microbatch ................. [SUCCESS in 20.43s]
19:05:34  
19:05:34  Finished running 1 incremental model, 1 view model in 0 hours 0 minutes and 24.60 seconds (24.60s).
19:05:34  
19:05:34  Completed successfully
19:05:34  
19:05:34  Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2
❯ dbt show --inline "select * from {{ ref('microbatch') }}"
19:05:43  Running with dbt=1.9.0-b3
19:05:43  [WARNING]: Deprecated functionality

User config should be moved from the 'config' key in profiles.yml to the 'flags' key in dbt_project.yml.
19:05:43  Registered adapter: redshift=1.9.0-b1
19:05:44  Found 23 models, 6 seeds, 20 data tests, 1 sql operation, 6 sources, 2 exposures, 13 metrics, 799 macros, 6 semantic models, 5 unit tests
19:05:44  
19:05:44  Concurrency: 1 threads (target='dev')
19:05:44  
19:05:46  Previewing inline node:
| x | event_time |
| - | ---------- |
| 3 | 2024-10-30 |
| 1 | 2024-10-28 |
| 2 | 2024-10-29 |

Adding a couple input rows that are out of bounds + incremental run

Screenshot 2024-11-07 at 2 06 07 PM
❯ dbt run --select +microbatch
19:05:55  Running with dbt=1.9.0-b3
User config should be moved from the 'config' key in profiles.yml to the 'flags' key in dbt_project.yml.
19:05:55  Registered adapter: redshift=1.9.0-b1
19:05:56  Found 23 models, 6 seeds, 20 data tests, 6 sources, 2 exposures, 13 metrics, 799 macros, 6 semantic models, 5 unit tests
19:05:56  
19:05:56  Concurrency: 1 threads (target='dev')
19:05:56  
19:06:00  1 of 2 OK created sql view model michelle_ark.microbatch_input ................. [SUCCESS in 1.88s]
19:06:00  2 of 2 START sql microbatch model michelle_ark.microbatch ...................... [RUN]
19:06:00  1 of 2 START batch 2024-11-06 of michelle_ark.microbatch ....................... [RUN]
19:06:03  1 of 2 OK created batch 2024-11-06 of michelle_ark.microbatch .................. [SUCCESS in 2.94s]
19:06:03  2 of 2 START batch 2024-11-07 of michelle_ark.microbatch ....................... [RUN]
19:06:06  2 of 2 OK created batch 2024-11-07 of michelle_ark.microbatch .................. [SUCCESS in 2.54s]
19:06:06  2 of 2 OK created sql microbatch model michelle_ark.microbatch ................. [SUCCESS in 5.51s]
19:06:06  Finished running 1 incremental model, 1 view model in 0 hours 0 minutes and 10.26 seconds (10.26s).
19:06:06  
19:06:06  Completed successfully
19:06:06  
19:06:06  Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2

❯ dbt show --inline "select * from {{ ref('microbatch') }}"
19:06:10  Running with dbt=1.9.0-b3
User config should be moved from the 'config' key in profiles.yml to the 'flags' key in dbt_project.yml.
19:06:10  Registered adapter: redshift=1.9.0-b1
19:06:11  Found 23 models, 6 seeds, 20 data tests, 1 sql operation, 6 sources, 2 exposures, 13 metrics, 799 macros, 6 semantic models, 5 unit tests
19:06:11  
19:06:11  Concurrency: 1 threads (target='dev')
19:06:11  
19:06:14  Previewing inline node:
| x | event_time |
| - | ---------- |
| 2 | 2024-10-29 |
| 3 | 2024-10-30 |
| 1 | 2024-10-28 |

Adding a row of new + current data, incremental run

Screenshot 2024-11-07 at 2 06 51 PM
❯ dbt run --select +microbatch
19:06:30  Running with dbt=1.9.0-b3
19:06:30  [WARNING]: Deprecated functionality

User config should be moved from the 'config' key in profiles.yml to the 'flags' key in dbt_project.yml.
19:06:30  Registered adapter: redshift=1.9.0-b1
19:06:31  [WARNING]: Time spines without YAML configuration are in the process of
deprecation. Please add YAML configuration for your 'metricflow_time_spine'
model. See documentation on MetricFlow time spines:
https://docs.getdbt.com/docs/build/metricflow-time-spine and behavior change
documentation:
https://docs.getdbt.com/reference/global-configs/behavior-changes.
19:06:31  Found 23 models, 6 seeds, 20 data tests, 6 sources, 2 exposures, 13 metrics, 799 macros, 6 semantic models, 5 unit tests
19:06:31  
19:06:31  Concurrency: 1 threads (target='dev')
19:06:31  
19:06:33  1 of 2 START sql view model michelle_ark.microbatch_input ...................... [RUN]
19:06:35  1 of 2 OK created sql view model michelle_ark.microbatch_input ................. [SUCCESS in 1.78s]
19:06:35  2 of 2 START sql microbatch model michelle_ark.microbatch ...................... [RUN]
19:06:35  1 of 2 START batch 2024-11-06 of michelle_ark.microbatch ....................... [RUN]
19:06:37  1 of 2 OK created batch 2024-11-06 of michelle_ark.microbatch .................. [SUCCESS in 2.18s]
19:06:37  2 of 2 START batch 2024-11-07 of michelle_ark.microbatch ....................... [RUN]
19:06:39  2 of 2 OK created batch 2024-11-07 of michelle_ark.microbatch .................. [SUCCESS in 1.79s]
19:06:39  2 of 2 OK created sql microbatch model michelle_ark.microbatch ................. [SUCCESS in 3.99s]
19:06:40  
19:06:40  Finished running 1 incremental model, 1 view model in 0 hours 0 minutes and 8.48 seconds (8.48s).
19:06:40  
19:06:40  Completed successfully
19:06:40  
19:06:40  Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2

❯ dbt show --inline "select * from {{ ref('microbatch') }}"
19:06:42  Running with dbt=1.9.0-b3
19:06:42  [WARNING]: Deprecated functionality

User config should be moved from the 'config' key in profiles.yml to the 'flags' key in dbt_project.yml.
19:06:43  Registered adapter: redshift=1.9.0-b1
19:06:43  Found 23 models, 6 seeds, 20 data tests, 1 sql operation, 6 sources, 2 exposures, 13 metrics, 799 macros, 6 semantic models, 5 unit tests
19:06:43  
19:06:43  Concurrency: 1 threads (target='dev')
19:06:43  
19:06:45  Previewing inline node:
| x | event_time |
| - | ---------- |
| 1 | 2024-10-28 |
| 2 | 2024-10-29 |
| 3 | 2024-10-30 |
| 6 | 2024-11-07 |

Copy link
Contributor

@MichelleArk MichelleArk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manual testing done after I applied the couple nits locally. Looks great!

…in microbatch materialization

The `target.` portion of `target.<column_name>` is unnecessary for the predicates in the
microbatch materialization macro because the delete statement already ensures the "targeting`
of `target` in the delete statement via the clause `delete from {{ target }}`. Said another way,
there is no use of the word `using` in the delete clause, thus it is unambiguous what is being
deleted from.
@QMalcolm QMalcolm enabled auto-merge (squash) November 7, 2024 22:33
@QMalcolm QMalcolm merged commit fccbe2d into main Nov 7, 2024
37 checks passed
@QMalcolm QMalcolm deleted the qmalcolm--microbatch-strategy branch November 7, 2024 22:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[dbt-redshift] Microbatch strategy
4 participants