How do we structure our dbt projects in 2022 and beyond? Discussing the creation of a new Guide. #1284

gwenwindflower · 2022-03-29T19:25:26Z

gwenwindflower
Mar 29, 2022

What are we doing?

The How we structure our dbt projects post is one of the most popular and relied upon works of analytics engineering knowledge created to date. It's been a long time (in data years) since it was published though, and we have a new, improved system for sharing knowledge on the Developer Hub.

We decided it was time to update this classic as our first Guide on the new platform! While we still have many opinions shaped by our consulting, teaching, and solutions work with companies of all sizes, this time we wanted to make sure we talked in-depth with you all, to fold your voices into our recommendations. While this Guide will still represent dbt Labs Best Practices, it's important to us that these are informed and improved by the Community. Particularly, we want to hear about any important areas you felt the original didn't cover, or areas where you strongly disagreed!

We also have some specific questions we're discussing internally about changes we've made to naming and other principles that we'll aim to share with you all soon, so if you're interested, we'll branch another discussion off of this one within the next couple weeks.

Some questions to consider

Consider all of the following as potential prompts for thinking about this core set of questions: what aspects of the original How we structure our dbt projects post influenced you the most? What stuck with you? What was missing? What did you invent for yourself? Where did you diverge from dbt Labs' best practices over time? Where have you always disagreed? We'd love to understand what we can do to improve the coverage and structure of the new guide as we update the content and platform it lives on.

Some areas that might spark ideas on the above: How do you manage files and folder structure? How do you split up YAML files within that? What sort of naming conventions do you rely on? How do these intersect with your modeling approach? How do files and folder conventions intersect with tagging, YAML selectors, and selector syntax overall -- in both development and jobs? Do you use snapshots (an area not covered by the original guide) and if so how and where? What macros do you always override or packages do you reach for all the time? What do you do with the analysis folder?

What we're not covering in this guide

The dbt Labs Style Guide will be in a separate guide, so you can save ~~war~~ discussion over commas for a different thread. 😄
CI/CD, version control, and our recommendations around PR and Issue templates will be in a separate guide around project development best practices.
Everything downstream from marts: metrics, exposures, ML and reporting use cases - these will also be in their own guide. The guide being discussed here will focus on the foundational structure that can be more broadly applied across a range of projects and use cases.

Thank you

We really appreciate you all taking the time to share your thoughts on the next generation of this guide with us!

krevitt · 2022-03-31T19:47:55Z

krevitt
Mar 31, 2022

Hmm should we include our use of reverse ETL internally @ dbt Labs (and our use of an export folder)? Feel like, as a net-new tooling option since this post was written, people might be curious how we're structuring our project to integrate w/ that layer.

1 reply

gwenwindflower Mar 31, 2022
Author

yesss but i think that would fit with the post-marts guide more than this one.

epapineau · 2022-03-31T20:21:19Z

epapineau
Mar 31, 2022

Very excited to see this article getting highlighted for a refresh. A 2022-take would be highly beneficial. I have personally used the existing article at several points in my dbt journey:

when first learning dbt
anytime I started a new dbt project as a consultant
teaching the importance of filenaming conventions to clients

Considering the prompts, it felt important to highlight that this isn't a one-and-done document, this is a document that you revisit anytime you find yourself weighing different solutions to a problem.

What influenced me / stayed with me

1:1 source to staging models with data standardization in staging and the stg_ prefix. An update should include a thoughtful delineation on this pattern (this is a question that comes up with clients a lot)
One .yml file per directory. We need more emphasis on there being multiple ways that this can be implemented, but the most important value being consistency in application for project scalability.

What was missing / I invented

The connection between a well structured project and an easy to operationalize project. Making an explicit connection between dbt’s node selection syntax, ideal job structure (at least when using dbt Cloud), and how to design the project with those in mind. Somewhat related, using tags to fill in the gaps for this system and adding exposures for critical downstream-of-dbt use cases.
Any context on what base models are and why I may or may not want to implement them. Using Stripe as an example dataset sets us up with a good example of when base models can be helpful. It also demonstrates how you might differ from our suggestions due to your project's specific needs.
How to adapt! We can't make a document for every business use case and we want this document to persist into the future. We should extrapolate on some concepts to teach the principles that justify them, empowering dbt users to make their own decisions when the guide does not cover their scenario (like the reverse ETL comment above).

Overall, the reader should be able to step away from this document with a mental framework for working in dbt and an understanding that...

0 replies

KiraFuruichi · 2022-03-31T20:36:15Z

KiraFuruichi
Mar 31, 2022

Re-commenting since I wasn't done with my thoughts!

Things I love about this article:

Explaining the differences between and what happens in src/stages/marts -- this has always been such a good resource to separate these out.
The visualization of tree structure is easy to follow (and replicate potentially)
The humanness of the language (we mess up, we learn things and we grow from it sort of vibes)
The base transformation (select, rename, select * structure) has been a go-to for me for years now

This is a classic for a reason! But here's some things I was thinking about as I reread this for the 123921389 time:

I think providing strong motivation for the list below could be beneficial for ppl who are new to data modeling and dbt. I think this can also help ppl down the line when they may have to deviate from this structure -- why would you deviate? when do you decided to deviate?

I think the differentiation between fct_ and dim_ modes could be stronger. Talking more about the downstream BI use cases for these could probably be its own post (or book lol).
"Accessories to data" is interesting diction -- from my awareness, we don't use that language anywhere else on site -- and I think it is a good analogy, but potentially unclear especially if it's not used universally throughout dbt docs
Linking to definitions or other content (ex. this article mentions ephemeral materializations -- what if I don't know what that is?)

Actual AE things I've come to like:

This article definitely supports this already, but +1 what Elize says as well about the one yml per directory
I literally always implement this
I've personally stayed away from the "mart" terminology (hot take!) and have preferred using "xf" instead. I've found that "xf" was digestible or intuitive (not saying more than mart since it wasn't implemented) for data folks who were just getting started with dbt. Also, for end business users who are perusing through dbt Docs, this was potentially more intuitive for them.

3 replies

gwenwindflower Mar 31, 2022
Author

hell yea this is amazing feedback! i'm seeing already a thread emerging of more 'why' mixed in with the 'what' -- to enable people to better now when it's appropriate to deviate and for what reasons, which makes a lot of sense 🙏🏻

joellabes Apr 5, 2022
Maintainer

@KiraFuruichi what does xf stand for?

KiraFuruichi Apr 5, 2022

@joellabes xf stands for transformed! sorry should have clarified that :)

lbenezra-FA · 2022-04-01T16:50:11Z

lbenezra-FA
Apr 1, 2022
Collaborator

agree so much with the above!
I really value the original post, and use it frequently to discuss structure with clients.
The over all structure outlined is still relevant in most projects today, with a few modifications. However, as Elize pointed out, it doesn't do a lot of explaining, and that's one of the things I find myself doing when I point people to this document.

One of the most frequent and most important concepts is the one-to-one staging layer -- this concept I view as more of a rule than a suggestion. But why do we build it? Why is it so important? Why would it be a rule and not a guideline?! That needs to be explained here, otherwise it seems like extra and unnecessary work just for the sake of structure and organization. Would love to help craft this convincing piece of writing.

Other concepts I think are more rule-like:

prefixes always to indicate the layer of the model
avoiding circular references when possible -- what joins to what?
lean towards waterfall design -- design by the DAG
emphasis on one yaml file per directory
testing at each layer, especially when you've joined and/or changed grain!
keeping the staging layer to clean, renaming, no joining.

The other pieces I'd like to see as options. Here are several if-then examples:

If you are using a BI tool that isn't designed for robust self-service, or perhaps your dbt analytics engineering team also is your BI team, then you may want to consider adding a reporting layer, models prefixed with rpt_. These live downstream of the Marts layer, and perform joins, aggregations, etc on the marts in order to build a report-ready model. Here's where you may see a ballooning of nodes from your marts, that's expected because this layer is building very specific tables.
if you are joining staging tables to create accessible, but not quite marts-level tables, your intermediate models can live in an intermediate folder outside of marts. These can be materialized, but avoid exposing them to your BI layer.
If your marts subdirectories are consumed pretty separately, and the intermediate joins are specific to those marts, then it may make more sense to have intermediate layers exist within your marts subdirectories.
if you want to avoid building a model before your source data is tested, throw tests in source yamls; if you are filtering out or de-duping in your staging layer, it may make more sense to test at the staging layer, rather than your source. (is there a case for testing both?)

I'd love to see mentions of the accessories -- even though I know we'll cover them in separate posts. but more like, hey we know your data doesn't end in dbt! it may end up in these places. look out for some best practice guides there.

0 replies

epapineau · 2022-04-06T00:31:24Z

epapineau
Apr 6, 2022

Why would it be a rule and not a guideline?!

^ 💯

I've been thinking a lot about the delineation you made between rules and options and I really appreciate it as a framework for this revamp.

0 replies

patkearns10 · 2022-04-06T02:11:14Z

patkearns10
Apr 6, 2022
Collaborator

Agree with all of the above!

I have lots of thoughts on all of this, but the things that stick out the most to me are:

model names are important everywhere.
- Be verbose and use prefixes in a format like <type/dag_stage>_<source/topic>__<additional_context>
folder structure is important for developer sanity and model selection syntax / job execution strategies.
custom database/schema declarations are important for stakeholders / those who query the data.
model names are more important than folder structure and schema declarations
yml file names don't matter, use _<dir_name>.yml to ensure it is pinned to the top of a subfolder/directory
same with markdown files, _<dir_name>.md to ensure it is pinned to the top of a subfolder/directory
I prefer intermediate to be it's own directory at the same level as marts and staging. But really I would prefer if they were organized in the order that data flows, so I am partial to:

└── models
    ├── 1_staging
    ├── 2_intermediate
    ├── 3_marts          # or whatever we decide to call this
    └── 4_reports

diligence in the right areas pays dividends.
we should more clearly articulate where we should allow bending of the rules vs not.

7 replies

eoghanosweeney Apr 6, 2022

Oh that would be excellent if it was something else, I really want this! Thank you for diving in here Pat! I was getting this error :

Which I then reverted after a chat with support who said the cloud ide didnt like leading underscores. The PR I put up only changed file names to have the _ and the file in question looked like this before I reverted:

patkearns10 Apr 6, 2022
Collaborator

🤔 the compiler thinks you are using a space instead of an underscore? or it is looking for the previous file that it new existed?

patkearns10 Apr 6, 2022
Collaborator

I would try a net new yml file that starts with an underscore and set up a fake configuration that way and see if it works for you. It might have something to do with partial parsing and you changing the name of a file to later have an underscore and the compiler is looking for the old file because it doesnt recognize they are different (this is a guess).

eoghanosweeney Apr 27, 2022

Hey Pat! Sorry I never reported back, I have done what you said and created new .yml files using the underscore and deleted the old version, and that seems to have tricked the compiler. The IDE is looking squeaky clean ! Thank you again.

patkearns10 Apr 27, 2022
Collaborator

Glad to hear it works @eoghanosweeney !

gwenwindflower · 2022-06-02T03:49:51Z

gwenwindflower
Jun 2, 2022
Author

it lives! https://docs.getdbt.com/guides/best-practices/how-we-structure/1-guide-overview

0 replies

joellabes · 2022-07-13T02:15:09Z

joellabes
Jul 13, 2022
Maintainer

I am late to this party, but am finally reading through! It is very good. I have a handful of nitpicks and comments, mostly around SQL style as opposed to naming structure specifically.

The final CTE seems to be gone (example). I love the final CTE. Why?
Joins aren’t using using, even when the column names align. It's no longer mentioned one way or the other in the Fishtown guide, but I really like them when available. (example)
When transforming units, I strongly prefer always providing the unit, i.e. instead of amount / 100.0 as amount I would prefer amount / 100.0 as amount_dollars. This is helpful for discoverability downstream and prevents false assumptions by consumers who think that they are working with the untransformed amount column, but also avoids chaos caused by lateral column aliasing.

One other thought: I think it would be useful when providing an antipattern example to make that clear in the image itself, for ease of skimming

2 replies

patkearns10 Jul 13, 2022
Collaborator

I do not believe we should ever use using because it is implicit rather than explicit, and it can have unexpected results when there are nulls involved: https://getdbt.slack.com/archives/CBSQTAPLG/p1626439665129500

patkearns10 Jul 13, 2022
Collaborator

+1 to all the other points though!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do we structure our dbt projects in 2022 and beyond? Discussing the creation of a new Guide. #1284

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments 13 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How do we structure our dbt projects in 2022 and beyond? Discussing the creation of a new Guide. #1284

What are we doing?

Some questions to consider

What we're not covering in this guide

Thank you

Replies: 8 comments · 13 replies

gwenwindflower Mar 31, 2022 Author

What influenced me / stayed with me

What was missing / I invented

gwenwindflower Mar 31, 2022 Author

joellabes Apr 5, 2022 Maintainer

lbenezra-FA Apr 1, 2022 Collaborator

patkearns10 Apr 6, 2022 Collaborator

patkearns10 Apr 6, 2022 Collaborator

patkearns10 Apr 6, 2022 Collaborator

patkearns10 Apr 27, 2022 Collaborator

gwenwindflower Jun 2, 2022 Author

joellabes Jul 13, 2022 Maintainer

patkearns10 Jul 13, 2022 Collaborator

patkearns10 Jul 13, 2022 Collaborator

Replies: 8 comments 13 replies

gwenwindflower Mar 31, 2022
Author

gwenwindflower Mar 31, 2022
Author

joellabes Apr 5, 2022
Maintainer

lbenezra-FA
Apr 1, 2022
Collaborator

patkearns10
Apr 6, 2022
Collaborator

patkearns10 Apr 6, 2022
Collaborator

patkearns10 Apr 6, 2022
Collaborator

patkearns10 Apr 27, 2022
Collaborator

gwenwindflower
Jun 2, 2022
Author

joellabes
Jul 13, 2022
Maintainer

patkearns10 Jul 13, 2022
Collaborator

patkearns10 Jul 13, 2022
Collaborator