Microbatch - Adapter Maintainers Guide #371

MichelleArk · 2024-12-05T19:51:31Z

MichelleArk
Dec 5, 2024
Maintainer

What is Microbatch?

As part of dbt-core==1.9.0 and dbt-adapters==1.10.3, we have introduced support for a new built-in incremental_strategy called microbatch. This new incremental strategy materializes large, event-oriented datasets in an opinionated and ergonomic way using time ranges. Our “happy path” use case is “I need to process new (partitions of) data (each hour, day, etc), and efficiently upsert them into an existing table.”

From the microbatch beta documentation:

Incremental models in dbt are a materialization designed to efficiently update your data warehouse tables by only transforming and loading new or changed data since the last run. Instead of reprocessing an entire dataset every time, incremental models process a smaller number of rows, and then append, update, or replace those rows in the existing table. This can significantly reduce the time and resources required for your data transformations.

Microbatch incremental models make it possible to process transformations on very large time-series datasets with efficiency and resiliency. When dbt runs a microbatch model — whether for the first time, during incremental runs, or in specified backfills — it will split the processing into multiple queries (or "batches"), based on the event_time and batch_size you configure.

Each "batch" corresponds to a single bounded time period (by default, a single day of data). Where other incremental strategies operate only on "old" and "new" data, microbatch models treat every batch as an atomic unit that can be built or replaced on its own. Each batch is independent and idempotent. This is a powerful abstraction that makes it possible for dbt to run batches separately — in the future, concurrently — and to retry them independently.

Additional context is also available in the GH Discussion and GH Epic.

Support Considerations

This is an opt-in functionality which means you can choose not to support this in your adapter. As an overview of the minimal work required to support the new microbatch incremental strategy, refer to the dbt-redshift implementation here: https://github.com/dbt-labs/dbt-redshift/pull/924/files. The rest of this section will dive deeper into the details of support considerations, using dbt-snowflake and dbt-bigquery as examples.

Adapter Requirements

To support the microbatch incremental strategy, each adapter is responsible for:

Extending the BaseAdapter.valid_incremental_stratgies method to include "microbatch" in its result. (example)
Implementing a <my-adapter>__get_incremental_microbatch_sql macro (example)
1. In practice, this may look most similar to an existing implementations of either the insert_overwrite (preferred, if available), or delete+insert strategy.
2. For each batch materialized by dbt, two new properties are available in the jinja context via the new model.batch attribute, which are only available when running microbatch models. When not in a microbatch model context, model.batch will be None and access to sub-attributes is unsafe.
  1. model.batch.event_time_start : datetime
  2. model.batch.event_time_end : datetime
  3. model.batch.id : string representation of model.batch.event_time_start , with no spaces, -, or _ characters.
  model.batch.event_time_start and model.batch.event_time_end represent the time bounds of the running batch, and should be used to filter any delete or merge statements in the strategy implementation. This is both necessary for efficiency as well as correctness.
  
  model.batch.id may be helpful for logging purposes, and is baked into the default make_temp_relation macro, acting as an additional suffix to dbt_tmp tables, so that each batch gets it an isolated temp table (implementation here).
It may also be necessary to override the base implementation of a new method BaseRelation._render_event_time_filter. This method accepts an EventTimeFilter from dbt-core , and generates the appropriate SQL to wrap a ref statement with where filters using the event_time_start , event_time_end , and event_time.
1. Note that these filters apply to the inputs of a microbatch model, and not the model itself.
2. A base implementation exists and is sufficient for many adapter implementations. Overriding this method is only necessary if the base implementation is not sufficient or compatible with the data platform.

Examples:

[dbt-snowflake] valid_incremental_strategies implementation: https://github.com/dbt-labs/dbt-snowflake/blob/0b24a5a2d311ffec5f996ca28532076a637aa6b3/dbt/adapters/snowflake/impl.py#L426-L427
[dbt-snowflake] snowflake__get_incremental_microbatch_sql macro implementation
[dbt-bigquery] Overriding BaseRelation._render_event_time_filtered: https://github.com/dbt-labs/dbt-bigquery/pull/1422/files

Opting-into concurrency support

It is possible to opt-into support concurrency for your adapter’s microbatch strategy. The benefits of opting-in are primarily repeated during --full-refresh runs, whereby running batches concurrently (with respect to the global --threads variable) leads to significantly reduced build times for users.

Determine whether your <my-adapter>_get_incremental_microbatch_sql macro is safe to run concurrently, and set the MicrobatchConcurrency capability to True. By default, MicrobatchConcurrency is set to False, which directs dbt to execute each batch in serial. Common concurrency considerations are:

Ensuring any temp tables are uniquely named for a given batch, to avoid clobbering. It is possible to leverage the new model.batch.id jinja global to do so, or to use global make_temp_relation which will do this automatically if the default suffix is provided.

Beyond ensuring correctness of the strategy across threads, it is recommended to benchmark the performance implications of supporting MicrobatchConcurrency by simulating a large input to a microbatch model, and running a --full-refresh ”backfill” of the microbatch model with 1 (serial execution), 4, and 8 threads. The overall runtime may go down significantly, but the total time executed against the warehouse may increase because the platform will need to manage merge into the main dataset safely (e.g. via locking).

Users are able to opt in and out of concurrency support at the model.config level via the batch_concurrency: bool configuration, even if your adapter supports MicrobatchConcurrency. This means the end-user can determine whether any tradeoffs associated with concurrent microbatch invocations are acceptable for their use cases.

For reference benchmarking, please refer to: dbt-labs/dbt-snowflake#1259 (comment)

Testing

A new base test, BaseMicrobatch has been implemented for concrete adapters to inherit and test against. It is possible to override microbatch_model_sql, input_model_sql, and insert_two_rows_sql via fixtures.

https://github.com/dbt-labs/dbt-adapters/blob/main/dbt-tests-adapter/dbt/tests/adapter/incremental/test_incremental_microbatch.py

Example override in dbt-snowflake: https://github.com/dbt-labs/dbt-snowflake/pull/1179/files#diff-54c15a3b4b6e274116439d3d4ac9416141d9bd39f9e6ffedc0724ab304cb81eb

Behaviour Flag: require_batched_execution_for_custom_microbatch_strategy

Lastly, dbt-core has introduced a new global behavior flag, require_batched_execution_for_custom_microbatch_strategy. This behavior flag is configurable under flags in dbt_project.yml, and defaults to False. This behavior flag is intended to protect users that have created a custom incremental strategy called ‘microbatch’, since we are now effectively claiming that as a builtin / reserved.

By default, projects with a custom incremental strategy called ‘microbatch’ will not run through the new microbatch execution framework whereby dbt-core computes individual batches and resolves ref and source calls with EventTimeFilter. If the user turns this flag on, they are effectively opting-into using their custom 'microbatch' strategy in combination with dbt-core's new execution framework.

Documentation on this flag is available here: https://docs.getdbt.com/reference/global-configs/behavior-changes#custom-microbatch-strategy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Microbatch - Adapter Maintainers Guide #371

{{title}}

Replies: 0 comments

Select a reply

Microbatch - Adapter Maintainers Guide #371

MichelleArk Dec 5, 2024 Maintainer

What is Microbatch?

Support Considerations

Adapter Requirements

Examples:

Opting-into concurrency support

Testing

Behaviour Flag: require_batched_execution_for_custom_microbatch_strategy

Replies: 0 comments

MichelleArk
Dec 5, 2024
Maintainer