Skip to content

Releases: dbt-labs/dbt-core

dbt version 0.4.1

22 Aug 04:49
Compare
Choose a tag to compare

dbt v0.4.1 provides improvements to incremental models, performance improvements, and ssh support for db connections.

0. tl;dr

  • slightly modified dbt command structure
  • unique_key setting for incremental models
  • connect to your db over ssh
  • no more model-defaults
  • multithreaded schema tests

If you encounter an SSL/cryptography error while upgrading to this version of dbt, check that your version of pip is up-to-date

pip install -U pip
pip install -U dbt

1. new dbt command structure #109

# To run models
dbt run # same as before

# to dry-run models 
dbt run --dry # previously dbt test

# to run schema tests
dbt test # previously dbt test --validate

2. Incremental model improvements #101

Previously, dbt calculated "new" incremental records to insert by querying for rows which matched some sql_where condition defined in the model configuration. This works really well for atomic datasets like a clickstream event log -- once inserted, these records will never change. Other datasets, like a sessions table comprised of many pageviews for many users, can change over time. Consider the following scenario:

User 1 Session 1 Event 1 @ 12:00
User 1 Session 1 Event 2 @ 12:01
-- dbt run --
User 1 Session 1 Event 3 @ 12:02

In this scenario, there are two possible outcomes depending on the sql_where chosen: 1) Event 3 does not get included in the Session 1 record for User 1 (bad), or 2) Session 1 is duplicated in the sessions table (bad). Both of these outcomes are inadequate!

With this release, you can now add a unique_key expression to an incremental model config. Records matching the unique_key will be deleted from the incremental table, then inserted as usual. This makes it possible to maintain data accuracy without recalculating the entire table on every run.

The unique_key can be any expression which uniquely defines the row, eg:

sessions:
  materialized: incremental
  sql_where: "session_end_tstamp > (select max(session_end_tstamp) from {{this}})"
  unique_key: user_id || session_index

3. Run schema validations concurrently #100

The threads run-target config now applies to schema validations too. Try it with dbt test

4. Connect to database over ssh #93

Add an ssh-host parameter to a run-target to connect to a database over ssh. The ssh-host parameter should be the name of a Host in your ~/.ssh/config file more info

warehouse:
  outputs:
    dev:
      type: redshift
      host: my-redshift.amazonaws.com
      port: 5439
      user: my-user
      pass: my-pass
      dbname: my-db
      schema: dbt_dbanin
      threads: 8
      ssh-host: ssh-host-name  # <------ Add this line 
  run-target: dev

Remove the model-defaults config #111

The model-defaults config doesn't make sense in a dbt world with dependencies. To apply default configs to your package, add the configs immediately under the package definition:

models:
    My_Package:
        enabled: true
        materialized: table
        snowplow:
            ...

dbt version 0.4.0

16 Aug 07:11
Compare
Choose a tag to compare

dbt v0.4.0

dbt v0.4.0 provides new ways to materialize models in your database.

0. tl;dr

  • new types of materializations: incremental and ephemeral
  • if upgrading, change materialized: true|false to materialized: table|view|incremental|ephemeral
  • optionally specify model configs within the SQL file

1. Feature: {{this}} template variable #81

The {{this}} template variable expands to the name of the model being compiled. For example:

-- my_model.sql
select 'the fully qualified name of this model is {{ this }}'
-- compiles to
select 'the fully qualified name of this model is "the_schema"."my_model"'

2. Feature: materialized: incremental #90

After initially creating a table, incremental models will insert new records into the table on subsequent runs. This drastically speeds up execution time for large, append-only datasets.

Each execution of dbt run will:

  • create the model table if it doesn't exist
  • insert new records into the table

New records are identified by a sql_where model configuration option. In practice, this looks like:

sessions:
    materialized: incremental
    sql_where: "session_start_time > (select max(session_start_time) from {{this}})"

There are a couple of new things here. Previously, materialized could either be set to true or false. Now, the valid options include view, table, incremental, and ephemeral (more on this last one below). Also note that incremental models generally require use of the {{this}} template variable to identify new records.

The sql_where field is supplied as a where condition on a subquery containing the model definition. This resultset is then inserted into the target model. This looks something like:

insert into schema.model (
    select * from (
        -- compiled model definition
    ) where {{sql_where}}
)

3. Feature: materialized: ephemeral #78

Ephemeral models are injected as CTEs (with statements) into any model that references them. Ephemeral models are part of the dependency graph and generally function like any other model, except ephemeral models are not compiled to their own files or directly created in the database. This is useful for intermediary models which are shared by other downstream models, but shouldn't be queried directly from outside of dbt.

To make a model ephemeral:

employees:
    materialized: ephemeral

Suppose you wanted to exclude employees from your users table, but you don't want to clutter your analytics schema with an employees table.

-- employees.sql
select * from public.employees where is_deleted = false

-- users.sql
select *
from {{ref('users')}}
where email not in (select email from {{ref('employees')}})

The compiled SQL would look something like:

with __dbt__CTE__employees as (
  select * from public.employees where is_deleted = false
)
select *
from users
where email not in (select email from __dbt__CTE__employees)

Ephemeral models play nice with other ephemeral models, incremental models, and regular table/view models. Feel free to mix and match different materialization options to optimize for performance and simplicity.

4. Feature: In-model configs #88

Configurations can now be specified directly inside of models. These in-model configs work exactly the same as configs inside of the dbt_project.yml file.

An in-model-config looks like this:

-- users.sql

-- python function syntax
{{ config(materialized="incremental", sql_where="id > (select max(id) from {{this}})") }}
-- OR json syntax
{{ 
    config({"materialized:" "incremental", "sql_where" : "id > (select max(id) from {{this}})"})
}}

select * from public.users

The config resolution order is:

  1. dbt_project.yml model-defaults
  2. in-model config
  3. dbt_project.yml models config

5. Fix: dbt seed null values #102

Previously, dbt seed would insert empty CSV cells as "None", whereas they should have been NULL. Not anymore!

dbt version 0.3.0

04 Aug 04:26
Compare
Choose a tag to compare

dbt v0.3.0

Version 0.3.0 comes with the following updates:

1. Parallel model creation #83

dbt will analyze the model dependency graph and can create models in parallel if possible. In practice, this can significantly speed up the amount of time it takes to complete dbt run. The number of threads dbt uses must be between 1 and 8. To configure the number of threads dbt uses, add the threads key to your dbt target in ~/.dbt/profiles.yml, eg:

user:
  outputs:
    my-redshift:
      type: redshift
      threads: 4         # execute up to 4 models concurrently
      host: localhost
      ...
  run-target: my-redshift

For a complete example, check out a sample profiles.yml file

2. Fail only within a single dependency chain #63

If a model cannot be created, it won't crash the entire dbt run process. The errant model will fail and all of its descendants will be "skipped". Other models which do not depend on the failing model (or its descendants) will still be created.

3. Logging #64, #65

dbt will log output from the dbt run and dbt test commands to a configurable logging directory. By default, this directory is called logs/. The log filename is dbt.log and it is rotated on a daily basic. Logs are kept for 7 days.

To change the name of the logging directory, add the following line to your dbt_project.yml file:

log-path: "my-logging-directory" # will write logs to my-logging-directory/dbt.log

4. Minimize time models are unavailable in the database #68

Previously, dbt would create models by:

  1. dropping the existing model
  2. creating the new model

This resulted in a significant amount of time in which the model was inaccessible to the outside world. Now, dbt creates models by:

  1. creating a temporary model {model-name}__dbt_tmp
  2. dropping the existing model
  3. renaming the tmp model name to the actual model name

5. Arbitrarily deep nesting #50

Previously, all models had to be located in a directory matching models/{model group}/{model_name}.sql. Now, these models can be nested arbitrarily deeply within a given dbt project. For instance, models/snowplow/sessions/transformed/transformed_sessions.sql is a totally valid model location with this release.

To configure these deeply-nested models, just nest the config options within the dbt_project.yml file. The only caveat is that you need to specify the dbt project name as the first key under the models object, ie:

models:
  'Your Project Name':
    snowplow:
      sessions:
        transformed:
          transformed_sessions:
            enabled: true

More information is available on the issue and in the sample dbt_project.yml file

6. don't try to create a schema if it already exists #66

dbt run would execute create schema if not exists {schema}. This would fail if the dbt user didn't have sufficient permissions to create the schema, even if the schema already existed! Now, dbt checks for the schema existence and only attempts to create the schema if it doesn't already exist.

7. Semantic Versioning

The previous release of dbt was v0.2.3.0 which isn't a semantic version. This and all future dbt releases will conform to semantic version in the format {major}.{minor}.{patch}.

dbt version 0.2.3.0

15 Jul 17:12
Compare
Choose a tag to compare

dbt v0.2.3.0

Version 0.2.3.0 of dbt comes with the following updates:

1. Fix: Flip referential integrity arguments (breaking)

Referential integrity validations in a schema.yml file were previously defined relative to the parent table:

account:
  constraints:
    relationships:
      - {from: id, to: people, field: account_id}

Now, these validations are specified relative to the child table

people:
  constraints:
    relationships:
      - {from: account_id, to: accounts, field: id}

For more information, run dbt test -h

2. Feature: seed tables from a CSV

Previously, auxiliary data needed to be shoehorned into a view comprised of union statements, eg.

select 22 as "type", 'Chat Transcript' as type_name, 'chatted via olark' as event_name union all
select 21, 'Custom Redirect', 'clicked a custom redirect' union all
select 6, 'Email', 'email sent' union all
...

That's not a scalable solution. Now you can load CSV files into your data warehouse:

  1. Add a CSV file (with a header) to the data/ directory
  2. Run dbt seed to create a table from the CSV file!
  3. The table name with be the filename (sans .csv) and it will be placed in your run-target's schema

Subsequent calls to dbt seed will truncate the seeded tables (if they exist) and re-insert the data. If the table schema changes, you can run dbt seed --drop-existing to drop the table and recreate it.

For more information, run dbt seed -h

3. Feature: compile analytical queries

Versioning your SQL models with dbt is a great practice, but did you know that you can also version your analyses? Any SQL files in the analysis/ dir will be compiled (ie. table names will be interpolated) and placed in the target/build-analysis/ directory. These analytical queries will not be run against your data warehouse with dbt run -- you should copy/paste them into the data analysis tool of your choice.

4. Feature: accepted values validation

In your schema.yml file, you can now add accepted-values validations:

accounts:
  constraints:
    accepted-values:
      - {field: type, values: ['paid', 'free']}

This test will determine how many records in the accounts model have a type other than paid or free.

5. Feature: switch profiles and targets on the command line

Switch between profiles with --profile [profile-name] and switch between run-targets with --target [target-name].

Targets should be something like "prod" or "dev" and profiles should be something like "my-org" or "my-side-project"

side-project:
  outputs:
    prod:
      type: redshift
      host: localhost
      port: 5439
      user: Drew
      pass:
      dbname: data_generator
      schema: ac_drew
    dev:
      type: redshift
      host: localhost
      port: 5439
      user: Drew
      pass:
      dbname: data_generator
      schema: ac_drew_dev
  run-target: dev

To compile models using the dev environment of my side-project profile:
$ dbt compile --profile side-project --target dev
or for prod:
$ dbt compile --profile side-project --target prod

You can also add a "profile' config to the dbt_config.yml file to fix a dbt project to a specific profile:

...
test-paths: ["test"]
data-paths: ["data"]

# Fix this project to the "side-project" profile
# You can still use --target to switch between environments!
profile: "side-project"

model-defaults:
....