dbt should know more semantic information #6644
Replies: 11 comments
-
Resurfacing this related proposal from last year 😄: |
Beta Was this translation helpful? Give feedback.
-
@aaronsteers we had a response to #4621 in earlier versions of this issue but I think I must have gotten it lost along the way! Let me know if you feel that below is a fair characterization - the problems you raise are very valid but we weren't fully aligned on the implementation existing at that fine a grain. We’d be remiss to not mention a fantastic issue opened(#4621) by AJ Steers of Meltano fame! In this issue/discussion, he proposes the addition of a new node type called
We thought about this issue for a while and considered this as a paradigm for expanding semantic layer capability but in the end decided that it felt like it was trying to tackle it two problems at one - both column consistency/documentation across a project and semantic information/structure in a project. The former is very important problem for dbt to consider but not something we consider to be in scope for the semantic layer. Additionally we believe Column-level consistency and documentation are problems that we’ll want to tackle but most likely as parts of a separate initiative. |
Beta Was this translation helpful? Give feedback.
-
After reading the section on joins, I'm not clear on whether the (eventual) vision is that entities will eventually contain information on joins, or that joins will sit between multiple entities? |
Beta Was this translation helpful? Give feedback.
-
This is so cool. I want to define data models for exploratory data analysis as code. dbt metrics almost meets my expectations. But, as for me, one of the missing pieces is a measure to interactively seek metrics on a BI tool. Of course, We can technically deal with a table created by dbt without dbt semantic layer. But, I would like to define them semantically. Maturity of metrics defined as dbt metrics may be relatively high in an organization. All stakeholders including business divisions agree with the metrics they should track. We may be able to create dashboards on a BI tool based on dbt semantic layer so that business divisions can track. Meanwhile, in my opinion, we need another step for data analysts or data scientists to figure out candidates of metrics. After that, they can start a discussion about metrics which all stakeholders should track. I would like to define and export something to a BI tool for such exploratory data analysis. If that makes sense, we may be able to put some parts of |
Beta Was this translation helpful? Give feedback.
-
Hello! I've had my tea, finished my digestive 🍪 and have collected my thoughts.
Apologies for this long-winded example but: consider the experience of writing a new model: experimenting with SQL outside of dbt until the query looks good, migrating that query to dbt as a sql file and updating the references. Opening up a separate file to create the docs, spec, and specify other model attributes, often jumping back and forth between the two. Deciding between where to document a model when there are several models that reference the same data. Now, open up a third file and create metrics, perhaps with a browser window open to validate the schema of the metrics file. There is a lot of complexity here! To add Yet Another Yaml (YaYaML ™) to the mix would increase the burden on AEs for maintainability. What would it look like to update the Update the source, update the staging table, find all usages of that column in all downstream tables and update, find all references in all documentation and tests, and all metrics, and all metric documentation. This change could add another level to these changes. (I understand that you can use I would love to see some care given to improve this experience somehow, if another level of abstraction is deemed necessary. I wish I had an easy solution that I could just dream up, but alas.
|
Beta Was this translation helpful? Give feedback.
-
First off, it’s great to see this new layer being built in public. That the community is invited to chip in. I hear some of the "concerns" shared above about what would be the added value of defining entities outside their source models. I hear as well the opportunity for greater participation from non-AE folks in defining the semantic layer and not getting their hands dirty in those source models. An interesting bit that caught my attention in the Slack discussion is "And that doesn't even get into some of the crazy ideas @abhi has around all SaaS businesses having the same entities/metrics and being able to map across all businesses", where common entities+metrics could be uniformed. So I’m seeing that there might be a case for semantic packages. Something like marketing analytics, product analytics, which usually use the same set of entities and metrics could be packaged. Now it would only just be a matter of setting model source hooks to make those work. So there could be additional value here where communities are responsible for semantic packages and everyone benefits. I guess my question is what you see as the future that will be unlocked by that building block? Am I off the mark? Or is there something else that you are envisioning that might not be immediately obvious when just focusing on those entities? |
Beta Was this translation helpful? Give feedback.
-
@olivierdupuis Thanks for your comment. You're on 🎯 about entities as semantic building blocks. The main advantage of structuring this design around entities is its ability to detach itself from the logical layer (models) and re-attach itself to any new implementation. This unlocks interesting possibilities by allowing data teams to map sources to entities and gradually standardizing/automating metric and even entity definition (in line with Abhi’s vision). You can see this trend toward (semi-)automation with the dbt metric packages that Fivetran and Houseware released. We recognize @PedramNavid 's point about entities adding more complexity for analytics engineers to keep track of. There's a definite tradeoff to adding a new layer to the dbt project. But we've weighed the options: it doesn't make sense to overload models, and while treating metrics as first-class objects helps with lineage, modularity, etc., defining them is too inefficient right now (speaking of having to keep track of columns...). We'll keep iterating on the developer experience to reduce the overhead of managing different assets. Introducing entities will help unlock metric definition at scale, lower the barrier of entry for data consumers to get involved (both in entity definition and analysis), and start paving this path to standardized packages. |
Beta Was this translation helpful? Give feedback.
-
Zooming in the proposed entity spec's If not, validation at parse-time would be necessary to ensure consistency if
Or perhaps this metadata really should just live in the logical layer (on the model), and propagate it's way up to a metric through an entity. Either way - we'd want to be precise in documentation about how this attribute is set and used by consumers. |
Beta Was this translation helpful? Give feedback.
-
Love the engagement we're seeing here! Let me see if I can do my best to address concerns in the following areas: Fuzzy Added ValueThese are fair concerns! Lets try to address the added value first.
Concretely, the real value you'll get today from defining an entity on top of a model is the increased flexibility to define metrics. Metrics built on top of entities can inherit defined properties, such as the This workstream/proposal is really about creating the building blocks that will enable the functionality of Tomorrow™️, such as joins. Models/Entities Are 1:1?
This is a great callout! When we say loosely coupled, we're referring to the fact that the relationships can be swapped/detached without impacting either of the nodes in question. IE, this loose coupling allows users to detach any semantic model and move it over to any new/edited implementation.
This is a totally reasonable hesitancy and part of the feedback we're hoping to get from commenters such as yourself. We feel reasonably confident that an entity as its own first-class representation inside of dbt is powerful because of the workflows it could enable outside of the AE workflow (more integration partners, easier interfaces for business users to add to the project, etc). Obviously this comes with a degree of additional complexity that we believe is worth the tradeoff. What we'd love to get feedback on is ways to improve this developer experience - properties, behaviors, etc etc. What are some potential ways that this concept can fit more easily into your workflow? Moving Along To Concrete Properties - Specifics Around Datatype!
Our vision here was that datatype would be an entirely metadata property that could be provided to BI tools but we're very open to admitting we're wrong on this one! The problem we're attempting to resolve is that column/dimensions datatypes are not introspected from the db as part of the
I am willing to be convinced that this is really a property of the implementation detail (ie model config) and not the declared interface, even if it somewhat diverges from some of the API design principles that we're trying to learn from. Especially with the work being done around constraints inside core, this feels like a reasonable thing to push down to the logical layer. |
Beta Was this translation helpful? Give feedback.
-
I've been waiting for far too long for this issue to drop. Loving the discourse here! Here are some thoughts/questions - Metrics get supercharged? Entities are for everyone |
Beta Was this translation helpful? Give feedback.
-
Now that we've opened #6626 to track the technical implementation, I'm going to convert this more-conceptual issue to a discussion. Members of the community should continue to feel encouraged to respond to threads above, or weigh in with new thoughts. |
Beta Was this translation helpful? Give feedback.
-
Learnings from the past year
Before we get into what this issue is proposing to add to dbt-core, we want to make sure that the community understands what this functionality is building towards. Over the coming years, we envision dbt expanding beyond its current scope to provide users with the best experience in creating their data knowledge graph, comprising three layers:
dbt will tie all of these into a cohesive experience but recognizing them as distinct components provides an experience that is easy to adopt but flexible for many use cases.
Today, dbt has tightly coupled logical and physical layers and a new semantic layer that has initially focused on metrics. But in the past year, we’ve learned that in order to build the broader vision, we have to lay the groundwork of a fully featured semantic layer. Our goals are:
What problems are we solving?
Here’s what we aim to accomplish in the near- to medium-term with this proposed scope:
Introducing the entity
To both solve these problems and lay the foundation for a semantic future, we are proposing a new node type called an
entity
.entities
are top-level nodes within dbt-core and represent the declared interface to a specific model, containing additional metadata (semantic information) that can’t live within models. Each entity will be associated with a distinct business noun/verb and allow for dbt users to create a single universal semantic model across their entire project.To quote our lovely @jtcohen6 , entities are for everyone. We envision a world where there might be teams of different humans managing the logical layer and the semantic layer given their interest and expertise.
What are our building blocks?
In order to solve these problems, we need to figure out what our building blocks are and whether we need to add anything new:
model
: A data transformation that provides the business-conformed representation of the dataset, or more specifically a discrete unit of transformation. This is the building block component of the logical layer.metric
: An aggregation of data (defined on top of an entity) that represents a measurable indicator for the business.entities
, metrics need to change so that they can be built on top of entities as opposed to models. But this is good news! Not only does it allow metrics to inherit a lot of the defined information (making metrics more DRY), but it is also a forcing function to make metrics more flexible.entity
: A new abstraction loosely coupled with a model that allows users to map business concepts onto the underlying logical model.Fitting in with our story
The ever-present theme of dbt Labs’ story is taking the best practices of software engineering and converting them to the data world. In the case of entities, we’re taking the best practice of API design and contracts between consumers and producers. Software engineering teams don’t expose the underlying table to their consumers – they bundle it in a format that they know matches the consuming behavior. So too should dbt users employ those principles to build their semantic layers.
The entity spec
What would an example look like?
Functional requirements
graph.entities
variablemanifest.json
artifactentity:
selector+
,&
, etc) should be supportedSimilar to metrics, dbt Core itself will not evaluate or materialize entities. These are virtualized abstractions that are exposed to downstream tools/packages for use for the purpose of discovery/understanding and dynamic dataset generation. Properties like data type are also useful for Semantic Layer integrations.
Just as dbt_metrics exists to interact with metrics, we’ll provide a method of interacting with entities that will evolve in this way until it is stable and bundled with dbt-core. The exact format will come in a future issue.
The updated metric spec
With the addition of entities:
What would an example look like?
What’s changed:
entity
instead of amodel
timestamp
property is now optional.time_grains
property is now optional. If atimestamp
is provided that does not havetime_grains
associated with it, we will now provide defaults ofday, week, month, year
dimensions
property has been split into two properties:include
: This property is either set to*
which inherits all of the dimensions from the entity or a list of columns that limits the inputexclude
: Ifinclude
is configured as*
then this property can be used to exclude the listed dimensions from the dimension listHow does this impact what you’ve currently built with metrics?
Similar to how we handled changes to the metric spec in the release of dbt-core 1.3.0, we will support the old behavior for a full minor version with backwards compatibility. After that, we will fully deprecate the old properties.
This means that your metric definitions will remain viable for a full minor version upgrade - so if this is launched as part of 1.5.0 then you don’t need to migrate until 1.6.0. That being said, we hope you migrate over earlier for the advantages of using entities 🙂.
Let’s talk about joins
All right, it’s time to address the complex elephant in the room: joins. Supporting joins has been one of the top feature requests that we’ve heard since adding metrics to dbt and we understand why. Joins will allow you to fit metrics into your overall data model (be it Kimball, Inmon, etc.) and expand the ease with which your teams can adopt metrics.
But they’re not part of this issue. And that’s for a good reason: joins are hard.
We’re committed to adding joins in the future but are very aware that supporting this functionality effectively means building two-thirds of a query planner ourselves. And it’s a query planner that needs the underlying information that we will add loosely coupled entities.
This is only further complicated by our goal of providing a universal semantic layer across all entities defined in your project, as opposed to an explore-based semantic layer where relationships may need to be defined multiple times. In this world, our query construction process has to be able to traverse the semantic graph to determine whether a query is not only viable but also if it makes semantic sense. To quote the original metrics issue:
With all this said, we are committed to adding joins in the future but are taking our time to ensure what we launch is right for analytics engineers, the data consumers, and our integration partners who will build on top of it.
Describe alternatives you've considered
Including this semantic information in the model config
We explored a number of different designs during the ideation process and one of the main alternatives was storing this type of semantic information inside of the model configuration. Ultimately we determined it wasn’t the path forward for a number of reasons:
Are you interested in contributing to this feature?
Absolutely.
Footnotes
exposures
. We imagineexposures
serving a very similar, if somewhat more important, role that they do today in that they would represent the consuming experiences sitting on top of entities and metrics!Beta Was this translation helpful? Give feedback.
All reactions