Hello! This is the companion repo to the 2021 Coalesce Talk - Building a Mature dbt Project from Scratch
With the explosion in popularity of dbt, and the coinciding explosion in features and capabilities in the tool, it's natural for many of us to find ourselves unsure of where to start. Many people come across dbt through a recommendation of a particularly powerful feature that dbt can support, like complex macros or intricate incremental model logic, but it's both intimidating and unwise to dive directly into the deep end. Like with any tool, it's best to walk before you run, and learn how these features both complement and build on each other so you can be confident you've developed a strong, sustainable, and scalable dbt project.
The goal of this repository is to show a single dbt project at different lifecycle stages, showing opinionated view of when to introduce certain dbt features into your project. Each stage has a particular theme/purpose, and the listed feature sets connect to that learning goal. This is intended to be both a resource for new dbt users to use as a jumping off point for starting a new project from scratch, and a rubric for existing dbt users to peg their own use of dbt features against this model to find opportunities for growth.
In each stage listed below (and in the accompanying talk), you'll see:
- A theme/purpose for the life stage
- Features relevant to the stage (with links to the relevant dbt docs)
- A picture of the DAG of the example project in that stage
- Links to slack channels on the dbt Community Slack that would be of interest!
- There are real life use cases where some features get introduced into projects out of the order described here, and that is perfectly reasonable. There are often very justifiable reasons to introduce more advanced dbt features earlier in the development cycle.
- There is no sense of timescale in this presentation! Some teams may mature their project in weeks rather than months, depending on a wide range of factors. It's more important to think about how features build upon themselves (and each other) rather than how quickly they do so.
- This presentation assumes familiarity and comfortability with git and version control, and that all of the projects are already managed in a repository
Each project is built on a mock data set of patients, doctors, claims, and other billing data. It was generated via the Mockaroo API. Huge hat-tip to @krevitt for building a sweet G-sheet x Mockaroo integration! In the 0-raw-data
project, you can find the sample dataset this was built from, so you can load them into your warehouse and run each project to get a feel for how the functionality works!
Congratulations! It's (sorta!) a DAG!!
This project represents truly the bare minimum needed to have dbt do anything of use. It's really only technically a dbt project, but is going to need a lot of hand holding to do anything useful and keep it alive.
dbt seed
dbt run
#advice-dbt-for-beginners
This project is just starting to play with its blocks, and see how the world fits together. It can now handle multiple models, and it's able to see the difference between raw and transformed data.
- Models
- adds
{{ ref() }}
functionality! Modularize your model!
- adds
- Sources
- uses
{{ source() }}
functionality, builds a layer of abstraction between source data and your transformations
- uses
- dbt Macros
- Start to understand some of the key built-in macros that make dbt work.
- Docs
- single model documentation for critical models
- Tests
- last-mile testing for final reporting objects
dbt seed
dbt run
dbt test
dbt docs generate
dbt docs serve
#advice-dbt-for-beginners
#advice-data-testing
Now we're starting to let our project free into the world. Time to set some ground rules! You wouldn't send your project to school without a list of allergies, so it's time to let people know how they should be interacting with your project
- Project Standards and Documentation
- not technically a dbt feature per se, but critical to scaling!
- README
- Style Guide
- Contribution Guide
- PR Template
- Testing
- Standard minimum testing requirements
- Docs
- Model-level descriptions for all models
- Deployed and shared widely
- Materializations
table
- Deployment (after all of the above!)
dbt compile
dbt seed
dbt run
dbt test
dbt build
dbt docs generate
dbt docs serve
#advice-dbt-for-beginners
#advice-data-testing
#advice-data-modeling
Look at your beautiful project, all grown up, about to go to prom. At this stage, your project is learning things fast, and is looking to figure out ways to work smarter not harder (so it can spend more time at 7/11 with their friends)
- Sources
- Packages
- Materializations
- Documentation
- column-level docs for key metrics/critical columns
- Macros
- In-model SQL simplification
- Custom Deployments (specific jobs)
dbt deps
dbt compile
dbt seed
dbt run
dbt test
dbt build
dbt docs generate
dbt docs serve
#advice-dbt-for-beginners
#advice-data-testing
#advice-data-modeling
#advice-dbt-for-power-users
- Relevant tool specific channels (i.e.
#tools-looker
,#tools-meltano
)
By the time your project reaches adulthood, the basics of dbt should be humming along just fine, and that should buy it time to think back on its life, look inward, and fingure out how it fits into the world. How has your project grown and changed? How does it relate to the world around it?
- Macros
- Operations for object management
- Selectors/Tags
- Custom Schema/Database Behavior
- Custom Generalized Test
- Hooks & Operations
- Exposures
- For dbt Cloud users: unlocks status tiles
dbt deps
dbt compile
dbt source freshness
dbt seed
dbt run
dbt test
dbt build
dbt run-operation
dbt docs generate
dbt docs serve
#advice-dbt-for-beginners
#advice-data-testing
#advice-data-modeling
#advice-dbt-for-power-users
- Relevant tool specific channels (i.e.
#tools-looker
,#tools-meltano
,#db-snowflake
) #towards-analytics-engineering
#metadata
- Introspective Analyses on dbt-produced artifacts
- if Cloud: Metadata API
- if Core: dbt-artifacts package
- Project Health Metrics
- Test Coverage
- Model Runtimes
Some features are not included in this project, not because they are unimportant, but because they generally are only used as-needed when the specifics of your data/project call for it.
- Snapshots
- Seeds (although the raw data project has a good example!)
- Variables/Environment Variables
- Analyses