dbt should know more semantic information #6644

callum-mcdata · 2022-12-05T14:29:16Z

callum-mcdata
Dec 5, 2022

Learnings from the past year

Before we get into what this issue is proposing to add to dbt-core, we want to make sure that the community understands what this functionality is building towards. Over the coming years, we envision dbt expanding beyond its current scope to provide users with the best experience in creating their data knowledge graph, comprising three layers:

The Physical Layer: the underlying object, location where data is stored
The Logical Layer: the method of applying dbt transformations to create datasets
The Semantic Layer: the mapping of business noun/verbs onto the logical/physical layer

dbt will tie all of these into a cohesive experience but recognizing them as distinct components provides an experience that is easy to adopt but flexible for many use cases.

Today, dbt has tightly coupled logical and physical layers and a new semantic layer that has initially focused on metrics. But in the past year, we’ve learned that in order to build the broader vision, we have to lay the groundwork of a fully featured semantic layer. Our goals are:

Create a loosely coupled abstraction on top of the logical layer where users can define and consume semantic logic (definitions, relationships, etc.) in their dbt projects.
Maximize efficiency and minimize the error of dbt asset definition for knowledge builders.
Maximize discoverability of dbt assets for data consumers.
Expand the range of tools that can interact with the Semantic Layer and the use cases it can serve.

What problems are we solving?

Here’s what we aim to accomplish in the near- to medium-term with this proposed scope:

Provide users with the foundation for a broader and more flexible semantic layer.
Decrease the effort and difficulty of defining metrics, namely dimension definition.
Ensure that finding the right semantic constructs is possible at scale.

Introducing the entity

To both solve these problems and lay the foundation for a semantic future, we are proposing a new node type called an entity. entities are top-level nodes within dbt-core and represent the declared interface to a specific model, containing additional metadata (semantic information) that can’t live within models. Each entity will be associated with a distinct business noun/verb and allow for dbt users to create a single universal semantic model across their entire project.

To quote our lovely @jtcohen6 , entities are for everyone. We envision a world where there might be teams of different humans managing the logical layer and the semantic layer given their interest and expertise.

What are our building blocks?

In order to solve these problems, we need to figure out what our building blocks are and whether we need to add anything new:

model: A data transformation that provides the business-conformed representation of the dataset, or more specifically a discrete unit of transformation. This is the building block component of the logical layer.
metric: An aggregation of data (defined on top of an entity) that represents a measurable indicator for the business.
- With the introduction of entities, metrics need to change so that they can be built on top of entities as opposed to models. But this is good news! Not only does it allow metrics to inherit a lot of the defined information (making metrics more DRY), but it is also a forcing function to make metrics more flexible.
entity: A new abstraction loosely coupled with a model that allows users to map business concepts onto the underlying logical model.
- This is the new “building block” that we’re proposing! We believe this will unlock new functionality and give us the best framework on which to continue building the Semantic Layer.
- In other words, the entity construct allows you to define all of your business concepts as first-class representations in your dbt project so all this information can be consumed downstream in your analytics tools of choice.

Fitting in with our story

The ever-present theme of dbt Labs’ story is taking the best practices of software engineering and converting them to the data world. In the case of entities, we’re taking the best practice of API design and contracts between consumers and producers. Software engineering teams don’t expose the underlying table to their consumers – they bundle it in a format that they know matches the consuming behavior. So too should dbt users employ those principles to build their semantic layers.

The entity spec

Property	Description	Example	Required
name	The name of the entity	orders	Yes
model	The name of the model that the entity is dependent on	fact_orders	Yes
description	The description of the entity	Lorem Ipsum	No
dimensions	The list of dimensions and their properties associated with the entity.	See below	No
dimensions.include	Either * to denote all columns or a list of columns that will be inherited	* or [column_a, column_b]	No
dimensions.exclude	If * is set at include_columns, this is a list of columns to be excluded from that list	[column_a]	No
dimensions.name	The name of the dimension	order_location	No
dimensions.column_name	The name of the column in the model if not 1:1. Serves as mapping	location	No
dimensions.data_type	The data type of the dimension	string	No
dimensions.description	Description of the dimension	The location of the order	No
dimensions.default_timestamp	Setting datetime dimension as default for metrics	false	No
dimensions.time_grains	Acceptable time grains for the datetime dimension	[day, week, month]	No
dimensions.primary_key	Whether this dimension is part of the primary key	false	No

What would an example look like?

# models/semantic_layer/product/schema.yml

entity:
  - name: orders
    model: ref('fact_orders')
    dimensions:
       - include: *
       - exclude: [column_c]

    ## An example where we don't want it to inherit everything the model
  - name: organization
    model: ref('dim_organization') 
    dimensions:
      - name: organization_id
         type: primary_key

      - name: some_dimension_name
         column_name: some_column_name_that_doesnt_match

Functional requirements

Entities will participate in the dbt DAG as a distinct node type
Entity nodes should be accessible in the dbt Core compilation context via:
- the graph.entities variable
Entity nodes should be emitted into the manifest.json artifact
Entity should work with partial parsing
Metric nodes should be supported in node selection and should be selectable with the entity: selector
- When listing nodes, existing graph operators (+, &, etc) should be supported
Entities should be surfaced in the dbt Docs website

Similar to metrics, dbt Core itself will not evaluate or materialize entities. These are virtualized abstractions that are exposed to downstream tools/packages for use for the purpose of discovery/understanding and dynamic dataset generation. Properties like data type are also useful for Semantic Layer integrations.

Just as dbt_metrics exists to interact with metrics, we’ll provide a method of interacting with entities that will evolve in this way until it is stable and bundled with dbt-core. The exact format will come in a future issue.

The updated metric spec

With the addition of entities:

Property	Description	Example	Required	Changed
name	The name of the metric	new_customers	Yes	No
label	The human readable name of the metric	New Customers	Yes	No
entity	The name of the entity that the metric is dependent on	customers	Yes	Yes
calculation_method	The method of calculation (aggregation or derived) that is applied to the expression	count_distinct	Yes	No
expression	The expression to aggregate/calculate over	user_id	Yes	No
description	The description of the metric	Lorem Ipsum	No	No
dimensions.include	Either * to denote all columns or a list of columns that will be inherited	* or [column_a, column_b]	No	Yes
dimensions.exclude	If * is set at include_columns, this is a list of columns to be excluded from that list	[column_a]	No	Yes
timestamp	The time-based component of the metric	signup_date	No	No
time_grains	One or more "grains" at which the metric can be evaluated.	[day, week, month, quarter, year]	No	No
window	A dictionary for aggregating over a window of time.	{count: 14, period: day}	No	No
filters	A list of filters to apply before calculating the metric	See below	No	No
config	Optional configurations for calculating this metric	{treat_null_values_as_zero: true}	No	No
meta	Arbitrary key/value store	{team: Finance}	No	No

What would an example look like?

metrics:
 - name: new_customers
   label: New Customers
   entity: customers
   calculation_method: count_distinct
   expression: user_id 

   dimensions:
      include: *
      exclude: 
        - column_a
        - column_b

What’s changed:

Metrics are now built on top of a entity instead of a model
The timestamp property is now optional.
The time_grains property is now optional. If a timestamp is provided that does not have time_grains associated with it, we will now provide defaults of day, week, month, year
The dimensions property has been split into two properties:
- include: This property is either set to * which inherits all of the dimensions from the entity or a list of columns that limits the input
- exclude: If include is configured as * then this property can be used to exclude the listed dimensions from the dimension list

How does this impact what you’ve currently built with metrics?

Similar to how we handled changes to the metric spec in the release of dbt-core 1.3.0, we will support the old behavior for a full minor version with backwards compatibility. After that, we will fully deprecate the old properties.

This means that your metric definitions will remain viable for a full minor version upgrade - so if this is launched as part of 1.5.0 then you don’t need to migrate until 1.6.0. That being said, we hope you migrate over earlier for the advantages of using entities 🙂.

Let’s talk about joins

All right, it’s time to address the complex elephant in the room: joins. Supporting joins has been one of the top feature requests that we’ve heard since adding metrics to dbt and we understand why. Joins will allow you to fit metrics into your overall data model (be it Kimball, Inmon, etc.) and expand the ease with which your teams can adopt metrics.

But they’re not part of this issue. And that’s for a good reason: joins are hard.

We’re committed to adding joins in the future but are very aware that supporting this functionality effectively means building two-thirds of a query planner ourselves. And it’s a query planner that needs the underlying information that we will add loosely coupled entities.

This is only further complicated by our goal of providing a universal semantic layer across all entities defined in your project, as opposed to an explore-based semantic layer where relationships may need to be defined multiple times. In this world, our query construction process has to be able to traverse the semantic graph to determine whether a query is not only viable but also if it makes semantic sense. To quote the original metrics issue:

It is extremely common to see folks perform syntactically correct but semantically meaningless calculations over data. This looks like averaging an average or adding two distinct counts together. You get a number back... but it's not a useful or meaningful result.

With all this said, we are committed to adding joins in the future but are taking our time to ensure what we launch is right for analytics engineers, the data consumers, and our integration partners who will build on top of it.

Describe alternatives you've considered

Including this semantic information in the model config

We explored a number of different designs during the ideation process and one of the main alternatives was storing this type of semantic information inside of the model configuration. Ultimately we determined it wasn’t the path forward for a number of reasons:

Storing semantic information in the model config creates a tightly coupled experience between the logical and semantic layers, which would make it difficult to enable our vision of different groups of humans simultaneously contributing to different layers
The goal of the model config is to serve as the implementation detail, whereas the goal of the entity is to be a declared interface. Coupling them together reduces the flexibility of the declared interface
Coupled entities cannot move/migrate to a new underlying model without significant effort, whereas a loosely coupled interface can be pointed at a new model easily.

Are you interested in contributing to this feature?

Absolutely.

Footnotes

A kind soul asked what this new node type would mean for exposures. We imagine exposures serving a very similar, if somewhat more important, role that they do today in that they would represent the consuming experiences sitting on top of entities and metrics!

aaronsteers · 2022-12-05T21:30:50Z

aaronsteers
Dec 5, 2022

Resurfacing this related proposal from last year 😄:

[Feature] dbt should know about `Attributes` (foundation for Metrics/Dimensions) #4621

0 replies

callum-mcdata · 2022-12-05T22:14:24Z

callum-mcdata
Dec 5, 2022
Author

@aaronsteers we had a response to #4621 in earlier versions of this issue but I think I must have gotten it lost along the way! Let me know if you feel that below is a fair characterization - the problems you raise are very valid but we weren't fully aligned on the implementation existing at that fine a grain.

We’d be remiss to not mention a fantastic issue opened(#4621) by AJ Steers of Meltano fame! In this issue/discussion, he proposes the addition of a new node type called Attributes . He describes it as:

a foundation for deeper metadata understandings within dbt and to unify the documentation effort for existing dbt projects

We thought about this issue for a while and considered this as a paradigm for expanding semantic layer capability but in the end decided that it felt like it was trying to tackle it two problems at one - both column consistency/documentation across a project and semantic information/structure in a project. The former is very important problem for dbt to consider but not something we consider to be in scope for the semantic layer. Additionally we believe entities are the best building block for semantic information because they are similar in grain/scope to models, whereas attributes map more closely to columns.

Column-level consistency and documentation are problems that we’ll want to tackle but most likely as parts of a separate initiative.

0 replies

jaypeedevlin · 2022-12-05T22:20:00Z

jaypeedevlin
Dec 5, 2022

After reading the section on joins, I'm not clear on whether the (eventual) vision is that entities will eventually contain information on joins, or that joins will sit between multiple entities?

0 replies

yu-iskw · 2022-12-06T00:27:52Z

yu-iskw
Dec 6, 2022

This is so cool. I want to define data models for exploratory data analysis as code. dbt metrics almost meets my expectations. But, as for me, one of the missing pieces is a measure to interactively seek metrics on a BI tool. Of course, We can technically deal with a table created by dbt without dbt semantic layer. But, I would like to define them semantically.

Maturity of metrics defined as dbt metrics may be relatively high in an organization. All stakeholders including business divisions agree with the metrics they should track. We may be able to create dashboards on a BI tool based on dbt semantic layer so that business divisions can track.

Meanwhile, in my opinion, we need another step for data analysts or data scientists to figure out candidates of metrics. After that, they can start a discussion about metrics which all stakeholders should track. I would like to define and export something to a BI tool for such exploratory data analysis.

If that makes sense, we may be able to put some parts of calculation_method to the entity layer so that we can define consistent measures like revenue. By doing so, multiple data analysts can use the same definitions even on exploratory data analysis. And ideally I would like to enable to export the definitions through dbt server.

0 replies

PedramNavid · 2022-12-06T04:12:28Z

PedramNavid
Dec 6, 2022

Hello! I've had my tea, finished my digestive 🍪 and have collected my thoughts.

To create a new layer between models and metrics is to add extra complexity. I am not opposed to it, and can understand the desire to add a decoupled layer, but I don't think I've seen a strong enough reason to justify it. That doesn't mean there isn't one, just that it wasn't immediately clear what it is. I believe the core argument is for better collaboration, but wondering if there are others?
Given 1, I think it's important to acknowledge the increase in maintainability and boilerplate that this will cause for AEs. Today an AE is responsible for sources, specs, schemas, docs, tests, exposures, and sql. We are now asking them to also be responsible for metrics.

Apologies for this long-winded example but: consider the experience of writing a new model: experimenting with SQL outside of dbt until the query looks good, migrating that query to dbt as a sql file and updating the references. Opening up a separate file to create the docs, spec, and specify other model attributes, often jumping back and forth between the two. Deciding between where to document a model when there are several models that reference the same data.

Now, open up a third file and create metrics, perhaps with a browser window open to validate the schema of the metrics file. There is a lot of complexity here!

To add Yet Another Yaml (YaYaML ™) to the mix would increase the burden on AEs for maintainability.

What would it look like to update the amount column to incremental_amount as an example?

Update the source, update the staging table, find all usages of that column in all downstream tables and update, find all references in all documentation and tests, and all metrics, and all metric documentation. This change could add another level to these changes. (I understand that you can use include: * but that doesn't preclude column name changes either.

I would love to see some care given to improve this experience somehow, if another level of abstraction is deemed necessary. I wish I had an easy solution that I could just dream up, but alas.

The name entity doesn't give me a strong indication of which models deserve entities and which don't, so would be great to have some opinionated advice here for users.

0 replies

olivierdupuis · 2022-12-06T14:53:41Z

olivierdupuis
Dec 6, 2022

First off, it’s great to see this new layer being built in public. That the community is invited to chip in.

I hear some of the "concerns" shared above about what would be the added value of defining entities outside their source models. I hear as well the opportunity for greater participation from non-AE folks in defining the semantic layer and not getting their hands dirty in those source models.

An interesting bit that caught my attention in the Slack discussion is "And that doesn't even get into some of the crazy ideas @abhi has around all SaaS businesses having the same entities/metrics and being able to map across all businesses", where common entities+metrics could be uniformed.

So I’m seeing that there might be a case for semantic packages. Something like marketing analytics, product analytics, which usually use the same set of entities and metrics could be packaged. Now it would only just be a matter of setting model source hooks to make those work. So there could be additional value here where communities are responsible for semantic packages and everyone benefits.

I guess my question is what you see as the future that will be unlocked by that building block? Am I off the mark? Or is there something else that you are envisioning that might not be immediately obvious when just focusing on those entities?

0 replies

cafzal · 2022-12-06T17:24:20Z

cafzal
Dec 6, 2022
Collaborator

@olivierdupuis Thanks for your comment. You're on 🎯 about entities as semantic building blocks.

The main advantage of structuring this design around entities is its ability to detach itself from the logical layer (models) and re-attach itself to any new implementation. This unlocks interesting possibilities by allowing data teams to map sources to entities and gradually standardizing/automating metric and even entity definition (in line with Abhi’s vision). You can see this trend toward (semi-)automation with the dbt metric packages that Fivetran and Houseware released.

We recognize @PedramNavid 's point about entities adding more complexity for analytics engineers to keep track of. There's a definite tradeoff to adding a new layer to the dbt project. But we've weighed the options: it doesn't make sense to overload models, and while treating metrics as first-class objects helps with lineage, modularity, etc., defining them is too inefficient right now (speaking of having to keep track of columns...). We'll keep iterating on the developer experience to reduce the overhead of managing different assets.

Introducing entities will help unlock metric definition at scale, lower the barrier of entry for data consumers to get involved (both in entity definition and analysis), and start paving this path to standardized packages.

0 replies

MichelleArk · 2022-12-06T23:37:02Z

MichelleArk
Dec 6, 2022
Maintainer

Zooming in the proposed entity spec's dimensions.data_type field: If the configuration of datatype on an entity deviated from the corresponding data_type configured on an underlying model column (if provided), would this result in data type coercions within entity query compilation?

If not, validation at parse-time would be necessary to ensure consistency if data_type is configured both in the model column information and a downstream entity dimension. This begs the question - should there be more than one way to encode the same metadata? Perhaps there is actually is a benefit in this case:

We envision a world where there might be teams of different humans managing the logical layer and the semantic layer given their interest and expertise.

Or perhaps this metadata really should just live in the logical layer (on the model), and propagate it's way up to a metric through an entity. Either way - we'd want to be precise in documentation about how this attribute is set and used by consumers.

0 replies

callum-mcdata · 2022-12-07T14:51:06Z

callum-mcdata
Dec 7, 2022
Author

Love the engagement we're seeing here! Let me see if I can do my best to address concerns in the following areas:

Fuzzy Added Value

These are fair concerns! Lets try to address the added value first.

what would be the added value of defining entities outside their models @olivierdupuis
adding a not immediately distinct layer on top of final models @jaypeedevlin (slack)

Concretely, the real value you'll get today from defining an entity on top of a model is the increased flexibility to define metrics. Metrics built on top of entities can inherit defined properties, such as the time_grains associated with a dimension, the default timestamp associated with an entity, or the list of dimensions for that entity. Adding this kind of inheritance behavior into models (ie representing semantic concepts within the model config) begins to really blur that line between defined interface and implementation detail, which overloads the model config to @cafzal 's above comment.

This workstream/proposal is really about creating the building blocks that will enable the functionality of Tomorrow™️, such as joins.

Models/Entities Are 1:1?

I already think of terminal (model) nodes as "entities" that I'm surfacing for end users. With this in mind, I feel like (based on the current proposal) I would have to double handle all my terminal models to define entities @jaypeedevlin

If there's a 1:1 relationship between a model and an entity, then aren't they always going to be tightly coupled? That's one of the reasons I was curious about entities containing joins (or at least information to facilitate joins), since that would demand a looser coupling. @jaypeedevlin

This is a great callout! When we say loosely coupled, we're referring to the fact that the relationships can be swapped/detached without impacting either of the nodes in question. IE, this loose coupling allows users to detach any semantic model and move it over to any new/edited implementation.

I already think of terminal (model) nodes as "entities" that I'm surfacing for end users. With this in mind, I feel like (based on the current proposal) I would have to double handle all my terminal models to define entities @jaypeedevlin

This is a totally reasonable hesitancy and part of the feedback we're hoping to get from commenters such as yourself. We feel reasonably confident that an entity as its own first-class representation inside of dbt is powerful because of the workflows it could enable outside of the AE workflow (more integration partners, easier interfaces for business users to add to the project, etc).

Obviously this comes with a degree of additional complexity that we believe is worth the tradeoff. What we'd love to get feedback on is ways to improve this developer experience - properties, behaviors, etc etc. What are some potential ways that this concept can fit more easily into your workflow?

Moving Along To Concrete Properties - Specifics Around Datatype!

If the configuration of datatype on an entity deviated from the corresponding data_type configured on an underlying model column (if provided), would this result in data type coercions within entity query compilation?

Our vision here was that datatype would be an entirely metadata property that could be provided to BI tools but we're very open to admitting we're wrong on this one! The problem we're attempting to resolve is that column/dimensions datatypes are not introspected from the db as part of the manifest so in order to surface them up to our integration partners we have to rely on the users running some command that generates the catalog such as dbt docs generate. The hope here is that we provide multiple places for the user to configure data types (model, entity, etc) so that the integration partners in question aren't as reliant on a user generating the catalog.

Or perhaps this metadata really should just live in the logical layer (on the model), and propagate it's way up to a metric through an entity. Either way - we'd want to be precise in documentation about how this attribute is set and used by consumers.

I am willing to be convinced that this is really a property of the implementation detail (ie model config) and not the declared interface, even if it somewhat diverges from some of the API design principles that we're trying to learn from. Especially with the work being done around constraints inside core, this feels like a reasonable thing to push down to the logical layer.

0 replies

nipunj15 · 2022-12-20T05:25:19Z

nipunj15
Dec 20, 2022

I've been waiting for far too long for this issue to drop. Loving the discourse here! Here are some thoughts/questions -

Metrics get supercharged?
Metrics being able to inherit relevant properties from the underlying Entities definitely makes it easier to create them. I'm curious to know however if this makes real time metric creation possible. There is a whole class of exploratory use cases in Product Analytics and other areas where this adds a lot of value. Here is an example that I posted earlier on dbt-slack that is cumbersome to solve in the current form of dbt Metrics.

Entities are for everyone
It's clear to me that similar to Metrics, Entities will make it easier for business users to consume business related data. Given that creating the Semantic Layer assets (Entities and Metrics) requires input from data teams as well as the business teams - is yaml the right language to author Entities and Metric? Or do we see Entities creation/update as a low frequency activity which resides in yaml while the derivatives of Entities(like Metrics) can be created/updated in a more business friendly non-yaml manner?

0 replies

jtcohen6 · 2023-01-18T11:05:30Z

jtcohen6
Jan 18, 2023
Maintainer

Now that we've opened #6626 to track the technical implementation, I'm going to convert this more-conceptual issue to a discussion. Members of the community should continue to feel encouraged to respond to threads above, or weigh in with new thoughts.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dbt should know more semantic information #6644

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 11 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

dbt should know more semantic information #6644

callum-mcdata Dec 5, 2022

Learnings from the past year

What problems are we solving?

Introducing the entity

What are our building blocks?

Fitting in with our story

The entity spec

The updated metric spec

How does this impact what you’ve currently built with metrics?

Let’s talk about joins

Describe alternatives you've considered

Including this semantic information in the model config

Are you interested in contributing to this feature?

Footnotes

Replies: 11 comments

aaronsteers Dec 5, 2022

callum-mcdata Dec 5, 2022 Author

jaypeedevlin Dec 5, 2022

yu-iskw Dec 6, 2022

PedramNavid Dec 6, 2022

olivierdupuis Dec 6, 2022

cafzal Dec 6, 2022 Collaborator

MichelleArk Dec 6, 2022 Maintainer

callum-mcdata Dec 7, 2022 Author

Fuzzy Added Value

Models/Entities Are 1:1?

Moving Along To Concrete Properties - Specifics Around Datatype!

nipunj15 Dec 20, 2022

jtcohen6 Jan 18, 2023 Maintainer

callum-mcdata
Dec 5, 2022

aaronsteers
Dec 5, 2022

callum-mcdata
Dec 5, 2022
Author

jaypeedevlin
Dec 5, 2022

yu-iskw
Dec 6, 2022

PedramNavid
Dec 6, 2022

olivierdupuis
Dec 6, 2022

cafzal
Dec 6, 2022
Collaborator

MichelleArk
Dec 6, 2022
Maintainer

callum-mcdata
Dec 7, 2022
Author

nipunj15
Dec 20, 2022

jtcohen6
Jan 18, 2023
Maintainer