Skip to content

Commit

Permalink
Merge branch 'current' into add-batch-jinja
Browse files Browse the repository at this point in the history
  • Loading branch information
mirnawong1 authored Dec 5, 2024
2 parents c61cfff + 0a3c0f3 commit 6ed90c3
Showing 1 changed file with 53 additions and 12 deletions.
65 changes: 53 additions & 12 deletions website/docs/docs/build/incremental-microbatch.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ Each "batch" corresponds to a single bounded time period (by default, a single d

This is a powerful abstraction that makes it possible for dbt to run batches [separately](#backfills), concurrently, and [retry](#retry) them independently.

### Example
## Example

A `sessions` model aggregates and enriches data that comes from two other models:
- `page_views` is a large, time-series table. It contains many rows, new records almost always arrive after existing ones, and existing records rarely update. It uses the `page_view_start` column as its `event_time`.
Expand Down Expand Up @@ -175,22 +175,63 @@ It does not matter whether the table already contains data for that day. Given t

<Lightbox src="/img/docs/building-a-dbt-project/microbatch/microbatch_filters.png" title="Each batch of sessions filters page_views to the matching time-bound batch, but doesn't filter sessions, performing a full scan for each batch."/>

### Relevant configs
## Relevant configs

Several configurations are relevant to microbatch models, and some are required:

| Config | Type | Description | Default |
|----------|------|---------------|---------|
| [`event_time`](/reference/resource-configs/event-time) | Column (required) | The column indicating "at what time did the row occur." Required for your microbatch model and any direct parents that should be filtered. | N/A |
| `begin` | Date (required) | The "beginning of time" for the microbatch model. This is the starting point for any initial or full-refresh builds. For example, a daily-grain microbatch model run on `2024-10-01` with `begin = '2023-10-01` will process 366 batches (it's a leap year!) plus the batch for "today." | N/A |
| `batch_size` | String (required) | The granularity of your batches. Supported values are `hour`, `day`, `month`, and `year` | N/A |
| `lookback` | Integer (optional) | Process X batches prior to the latest bookmark to capture late-arriving records. | `1` |
| Config | Description | Default | Type | Required |
|----------|---------------|---------|------|---------|
| [`event_time`](/reference/resource-configs/event-time) | The column indicating "at what time did the row occur." Required for your microbatch model and any direct parents that should be filtered. | N/A | Column | Required |
| `begin` | The "beginning of time" for the microbatch model. This is the starting point for any initial or full-refresh builds. For example, a daily-grain microbatch model run on `2024-10-01` with `begin = '2023-10-01` will process 366 batches (it's a leap year!) plus the batch for "today." | N/A | Date | Required |
| `batch_size` | The granularity of your batches. Supported values are `hour`, `day`, `month`, and `year` | N/A | String | Required |
| `lookback` | Process X batches prior to the latest bookmark to capture late-arriving records. | `1` | Integer | Optional |

<Lightbox src="/img/docs/building-a-dbt-project/microbatch/event_time.png" title="The event_time column configures the real-world time of this record"/>

### Required configs for specific adapters
Some adapters require additional configurations for the microbatch strategy. This is because each adapter implements the microbatch strategy differently.

The following table lists the required configurations for the specific adapters, in addition to the standard microbatch configs:

| Adapter | `unique_key` config | `partition_by` config |
|----------|------------------|--------------------|
| [`dbt-postgres`](/reference/resource-configs/postgres-configs#incremental-materialization-strategies) | ✅ Required | N/A |
| [`dbt-spark`](/reference/resource-configs/spark-configs#incremental-models) | N/A | ✅ Required |
| [`dbt-bigquery`](/reference/resource-configs/bigquery-configs#merge-behavior-incremental-models) | N/A | ✅ Required |

For example, if you're using `dbt-postgres`, configure `unique_key` as follows:

<File name="models/sessions.sql">

```sql
{{ config(
materialized='incremental',
incremental_strategy='microbatch',
unique_key='sales_id', ## required for dbt-postgres
event_time='transaction_date',
begin='2023-01-01',
batch_size='day'
) }}
select
sales_id,
transaction_date,
customer_id,
product_id,
total_amount
from {{ source('sales', 'transactions') }}
```

In this example, `unique_key` is required because `dbt-postgres` microbatch uses the `merge` strategy, which needs a `unique_key` to identify which rows in the data warehouse need to get merged. Without a `unique_key`, dbt won't be able to match rows between the incoming batch and the existing table.

</File>

### Full refresh

As a best practice, we recommend configuring `full_refresh: False` on microbatch models so that they ignore invocations with the `--full-refresh` flag. If you need to reprocess historical data, do so with a targeted backfill that specifies explicit start and end dates.

### Usage
## Usage

**You must write your model query to process (read and return) exactly one "batch" of data**. This is a simplifying assumption and a powerful one:
- You don’t need to think about `is_incremental` filtering
Expand All @@ -207,7 +248,7 @@ During standard incremental runs, dbt will process batches according to the curr

**Note:** If there’s an upstream model that configures `event_time`, but you *don’t* want the reference to it to be filtered, you can specify `ref('upstream_model').render()` to opt-out of auto-filtering. This isn't generally recommended — most models that configure `event_time` are fairly large, and if the reference is not filtered, each batch will perform a full scan of this input table.

### Backfills
## Backfills

Whether to fix erroneous source data or retroactively apply a change in business logic, you may need to reprocess a large amount of historical data.

Expand All @@ -222,13 +263,13 @@ dbt run --event-time-start "2024-09-01" --event-time-end "2024-09-04"

<Lightbox src="/img/docs/building-a-dbt-project/microbatch/microbatch_backfill.png" title="Configure a lookback to reprocess additional batches during standard incremental runs"/>

### Retry
## Retry

If one or more of your batches fail, you can use `dbt retry` to reprocess _only_ the failed batches.

![Partial retry](https://github.com/user-attachments/assets/f94c4797-dcc7-4875-9623-639f70c97b8f)

### Timezones
## Timezones

For now, dbt assumes that all values supplied are in UTC:

Expand Down

0 comments on commit 6ed90c3

Please sign in to comment.