-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] dbt should know about metrics #4071
Comments
@drewbanin what is the default null behavior for For our platform, we'd probably want all (or a majority of the columns) so I'm wondering if users would have to manually list them all out. Honestly I can see the perspective of requiring definition for every column and some sort of |
Hey @drewbanin! Incoming novel for ya ;) Additional Metric Properties:
re: dimensionality defined at the metric level If we're including dimensions as part of the metric definition, then the metric property should be dimension_sets:
But this raises the question -- should dimensions be defined at the metric level??
Insight/Answer/Widget Layer: revenue_by_plan:
revenue_by_state:
revenue_by_plan_state:
Report Layer: Consumes the insights/answers/widget layer to include the various metrics/insights. Essentially defines which insights/metrics are part of a report.
Por que no los dos? I think there's a lot of value that can be found here on the pre-caching front, particularly in producing reports that don't require/allow users to query with specific date ranges. For example, I created a macro that populates a table with metrics defined at their various date granularities, such that I have a table like:
Here's an actual example using a barebones metric layer I defined:
Should they be allowed? I think so! Should this be the best practice? Absolutely not. Here's why I think it might be a good idea to allow this:
|
This is awesome! And completely agree that it'll be a huge boon to the community and tooling to move toward a standard. It's something we've been really excited about at Lightdash. I think the proposal looks like a great direction. The metric definitions look familiar to the community too: adding to your list of comparisons for the implementation:
JoinsAgreed these can get confusing. Fortunately, they're also becoming less important, we're seeing teams move to big wide tables (where all models are prejoined for a given analysis). So seems like a reasonable omission Time-dimension requirement (aka metrics vs measures)I was interested to see that a time-dimension is a requirement, this excludes a lot of useful aggregates, for example:
Does this relate to your dimension of metrics vs measures:
I found this quote a bit confusing 🤔. Not sure if I understood correctly, is the difference that a metric always has a time dimension and evaluates to a float? (e.g revenue over time, user count over time), hence giving a more rigorous "type"? Whereas a measure is any aggregation giving any type (date, string, timestamp, number)? Metrics on modelsThere is definitely some benefit earned because for analysis tables there tends to be a subset of columns that are destined to be dimensions and some destined to become metrics in the semantic layer. The ability to just go through all columns and tag them with dimension / metric-type can make it really easy to bootstap metrics from a dbt model. Just as @sisu-callum mentioned, there are productivity gains to be had. While joins aren't supported, metrics only map to a single model so I think there's a strong argument to keep them on models. However, if dbt needs to retain flexility to define metrics across models it would make more sense to define them on exposures / a new resource. SQL implementationOne noticeable omission here is the implementation of how these metrics definitions compile to SQL. I think it would be even more powerful if the metrics definitions were paired with an expected SQL output for each target warehouse. Is that in scope for the project? It would be troublesome for example if I have my centralised metric definitions but downstream tools interpret those metrics differently. I imagine something like a reference implementation of the complied SQL statements coming out of metrics (e.g. a dbt macro / package). Or perhaps the idea here (as with dbt models) is that it doesn't take an opinion on how a metric definition becomes SQL and people are free to essentially build their own metrics layers using a common set of metrics definitions. Last one...
|
I think these properties:
Seem more related to visualisation and I think would be strong candidates for the Thanks @drewbanin for including it! The |
Yeah absolutely agree with you. And the other things like pop_increase: [good/bad] could (should) be stored in the |
I would echo @owlas's points about the Time-Demension requirement. It feels overly explicit for no clear benefit.
or
Many moons ago at Meltano we implemented a SQL:1999 compliant algorithm for properly handling JOINs and generating SQL. We call these "Hyper Dimensional Aggregates". The original issue documenting it is in https://gitlab.com/meltano/meltano/-/issues/286 and the SQL:1999 compliance part is in https://gitlab.com/meltano/meltano/-/issues/344 This code is all MIT licensed and could most likely be ported (famous last words) over to dbt if that makes sense. Perhaps @drewbanin you're thinking of having a similar adapter model to SQL generation that dbt has for everything else (a la https://docs.getdbt.com/docs/available-adapters#dbt-labs-supported) Maybe this discussion is too much for the specification issue, but I wanted to call it out and put that bug in people's brains 🙂 |
UPDATE: I've logged a separate issue for adding "Metrics on Columns" vs "Metrics on Attributes"First of all, I completely agree with the value here and general direction, and I agree with much of the great points in the above thread. I do however think we are missing a in important abstraction layer between "tables/columns" and "metrics". A column can only have a So, I'd propose something along the lines adding both attributes and metrics together, or with attributes arriving slightly before:
Other notes:
Related conversation here in the Lightdash repo, as we recently discussed swapping the terms Example with Column->Attribute->Metric mappingsExpanding the above with a specific strawman example: attributes:
- name: customer_id
label: Customer ID
attribute_type: dimension_key
column_names:
- customer_id
models:
# as much or as little constraint to get ontologically the right column mappings
- marts.*
- !marts.the_weird_one
- name: customer_count
label: Customer Count
description: A pre-aggregated count of customers.
attribute_type: entity_count
column_names:
- customer_id
models:
# everywhere we see this column in the `marts` folder, map it to this attribute
- marts.*
- name: country
label: Country Name
attribute_type: dimension_key
models:
# everywhere we see this column in the `marts` folder, map it to this attribute
- marts.*
column_names:
- country
- country_name
- name: sales_revenue
label: Sales Revenue
attribute_type: amount
column_names:
- sales_revenue
models:
- marts.*
metrics:
- name: Number of Customers
calculations:
- expression: count(distinct {{ attribute('customer_id') }})
default: true
models:
- dim_customers
- expression: sum({{ attribute('customer_count') }})
models:
- fact_customers
- fact_customers_by_region
- name: Number of Countries
calculations:
# `attribute('country')` resolves to either `country` or `country_name`, depending on the model
- expression: count(distinct {{ attribute('country') }})
default: true
models:
- dim_customers
- dim_countries
- name: Sales Revenue
calculations:
- expression: sum({{ attribute('sales_revenue') }})
models:
# Anywhere the attribute is available
- *
- name: Average Customer Revenue
calculations:
- expression: {{ metric('Sales Revenue') }} / {{ metric('Number of Customers') }}
# Anywhere both metrics are available
models:
- *
- name: Predicted Year-End Revenue
# This calc is built in the BI layer and is not able to be defined in dbt.
# Instead of `expression`, we provide `depends_on` as a list of attributes.
depends_on:
- customer_favorite_color
- customer_join_date
- customer_ytd_sales
models:
- marts.predictions.fancy_ai_table
- name: Super Fancy AI Prediction
# This calc is built in the BI layer and is not able to be defined in dbt.
# Instead of `expression`, we provide `depends_on` as a list of attributes.
depends_on:
- customer_favorite_color
- customer_join_date
models:
- marts.predictions.fancy_ai_table |
Agreed @owlas format / precision are a good fit for meta as they seem like presentation attributes rather than core to the metric. That said I'd love to see
Not sure I fully grasped this either. I think there are certainly instances (and some you've mentioned) where a metric (or measure?) may not necessarily evaluate to a float, but may still be quite useful (e.g., the mode of a categorical variable). |
I would love for dbt to support metrics, mostly so that we could include metric definitions in Fivetran dbt packages. I take issue with your definition of metrics though: metrics can exist without time series. For example, if we have a database of species, “countif(kingdom = ‘animalia’)” is a metric which counts the number of animal species. This isn’t just a technical issue: there are lots of business metrics which are time-series in principle but in practice we choose to calculate at just the present time, because it’s simpler. |
Great to see that dbt will support metric definitions! The proposal looks great and we're specifically interested in exposing these metrics to the downstream tools including SQL clients, BI tools, and data science tools. There are a couple of limitations to the proposed features:
|
Supportive, this would mean Snowplow dbt packages can have the metrics built-in. As package authors who already have multiple data stores/SQL flavors to support 💸, it would be great to see market standardization at the metrics definition layer 💯! |
Some things that popped into my mind while reading this --
|
🎩 Written without my dbt Labs employee hat on, not that it changes anything 🎩
This seems to imply outsourcing the SQL generation that actually calculates the metric to each tool. Does that mean that (best case) the engineers at Or does
above imply a reference implementation that vendors deviate from at their own risk? |
I think this goes (only slightly) too far. How about this instead?:
For simple things like a "Number of Customers" or "Sales Revenue" metric, unless we intentionally sabotage the metric definition to be overly vague and non-prescriptive, there are only artificial barriers preventing dbt (or a dbt macro/extension/fork) from providing basic calculation results - and more valuably, metric-based unit tests. Rather than say "dbt Core should not" evaluate metrics, I think it's preferrable to just keep with "dbt Core may or may not be able to evaluate any given metric". In my suggestions above, I give examples of metrics of three types:
|
Regarding tests for metrics, it could still be valuable to have expectations defined in dbt (even if not evaluated in dbt) and used as a reference for the BI tools to guarantee correctness. |
@jpmont exactly. This is how we have always imagined a metric standard, the combination of:
Alternatively I like @aaronsteers point, which I took to be, if the metrics definitions are simple enough such that the SQL mapping is trivial, then we could reduce errors.
|
I believe that synchronizing the metrics into downstream data tools is not that easy without a middleware layer. All the BI tools that support metrics out of the box has their own way to define the metrics and often there is no one-to-one mapping to SQL expressions:
Our approach was to build a middleware layer in between the BI tools and the dbt adapter (data warehouse) because we couldn't find a way to support all the measure types by compiling SQL to native metric language for each BI tool. The BI tools see both metrics & dimensions as database columns and dbt resources (seeds, sources, and models) as database tables with Metriql's MQL feature. We use the Trino interface so all the BI tools that support Trino / Prestodb also support Metriql out of the box. Since the databases don't implement metrics, we also maintain an additional integration (example for Superset) for mapping metrics, formatting, etc. in a native way. That way, the tools can just SQL queries similar to this one:
and the engine compiles it to the following SQL query in an ad-hoc way:
If dbt standardizes the metric definitions, the BI vendors and metric stores can just integrate dbt via its Metadata API and generate the SQL in the serving layer. That way, dbt can handle documentation, testing, etc. for the metrics and the data tools can build different interfaces for querying the metrics and just have their own SQL compiler for metric queries. |
Hi all ! Following @aaronsteers 's comments concerning attributes, Julian Hyde (now working at looker) did a great job with mondrian, the open source ROLAP engine, trying to bind the relational and analytical worlds, and ended up with cube specifications which are imho worth the reading : Anyway, it would be very cool to automatically generate a mondrian schema ontop the future dbt's metrics and dimensions. |
@fabrice-etanchaud that's a great idea! I contacted Julian a while ago for a call for help but it looks like it might be a conflict of interest since he's working for Looker. The idea was to parse MDX queries with Olap4j of Mondrian and compile it to SQL for the underlying data warehouse without any caching / storage mechanism, unlike Mondrian. We can generate the dimensional tables as dbt incremental / table models instead of using an additional caching mechanism and reference these roll-up tables in the compiled SQL query. We have pending work but MDX seems to be much more complex than we imagined. I would love to collaborate if there's any interest. |
Hi @buremba , Glad it makes sense to you ! You are right MDX has (for me) a huge learning curve, and I usually rely on tools like Saiku to build queries. Did you have a look at dremio ? You can leverage its reflections (kind of materialized lattices, in calcite's world) to seamlessly accelerate requests without setting up mondrian's aggregate tables. dremio can even pre-compute metrics based on the dimensions distinct values. |
@fabrice-etanchaud I didn't know about dremio's reflections but it looks like they store the data in external storage rather than in the same data warehouse, which is the main point of dbt. The idea is to create dbt models and utilize them rather than moving data from your data warehouse to an OLAP engine. I believe that it's more dbt-native and it aligns with the modern data stack since it's we will be pushing all the work into the data warehouse. dbt and the metrics layers can potentially replace the OLAP solutions with less overhead, I wrote an opinionated blog post about this topic recently. |
Metrics are always generated with respect to a particular time period and dimension set, so in order for dbt to evaluate arbitrary metrics, it would need a server component that can listen for requests, compile the request to SQL, and return the SQL. Right now, dbt Core itself does not have a long-lived server component that can respond to requests like these. I like the idea of allowing dbt Core to materialize a static parameterization of a metric as a model, e.g., specify "Sales Revenue by day, by territory". If someone is interested in running unit tests on a metric, they probably are trying to do so over a well-understood parameterization of that metric, and won't need dynamic cuts. |
Hi @buremba , dremio can be used as a datawarehouse (it stores its data in parquet files or even in iceberg format ). As it connects to your enterprise sources, there is little if no need for pipelines. You are right, if you already have a dwh, dremio would add an extra layer. I just had a look at Metriql ! In fact it seems you are using trino's federation as a proxy to rewrite queries, just like dremio is acting , great ! Best |
A couple quick updates on recent threads:
While we're not trying to build a ROLAP query or caching engine, we are trying to establish a ROLAP framing - just by nature of ROLAP or "Relational OLAP" being the (somewhat dated now) name for what happens when you calculate analytical metrics directly over tabular sources. |
Love the direction this is going. Two thoughts on the extensibility: Metric extensibilityYou should be able to define a metric in terms of another. This helps keep definitions DRY. A There is a significant benefit for this: the sales metrics will use the same root, so you won't end up in a situation where You could imagine something like: metrics:
- name: total_sales_germany
type: sum
parent_metric: total_sales
dimensions:
- product
- plan
filters:
- field: country
value: Germany The filter field has to be a dimension on the parent metric and the dimensions present in the parent metric. You could also imagine defining ratios from two existing metrics. You'd be able to visualize the DAG of metrics. +1 for including an Metric type extensibilityThe different metric types will need different settings. For example, a You could imagine something like: metrics:
- name: unique_customers
type: count_distinct
settings:
- distinct_field: customer_id
- approximate: true
dimensions:
- segment A strongly-typed settings object for each metric type reduces the error surface area significantly. |
I am just loving this thread! Thank you all so much for contributing your thoughts and expertise in here -- I am wondering if I should have started with a public google doc instead of an issue 😂 .This is what I'm hearing and what I'm thinking: Timestamps should be optional ✅ Behavior for dimensions ✅ ❓ I am picturing the
Here, @sisu-callum I am open to the idea of syntactic sugar for including all columns from a model as dimensions, but my preference would be to leave this out of the first cut of this implementation. For one, the set of all columns in the underlying model is going to include the column that's being aggregated over (eg. @carlineng correctly notes that dbt doesn't know all of the columns in a model a priori, but we can fetch the list of columns defined for a model in a schema.yml file. This feels like a reasonable tradeoff, as we really would want to make sure that the dimensions which power metrics are well-documented and well-tested. Either way, I'd like to leave that consideration out of scope for this ticket, and possibly pursue it in the future if it remains desirable! @aaronsteers thanks for opening a new issue to discuss attributes -- we'll follow up with you in that ticket! I really like the concept that you're describing and would be excited to discuss it further. Meta properties ✅ I think we're all aligned here - information around presentation/formatting, units, 📈 / 📉 , and more can live in arbitrary configs of the Metrics defined by other metrics I love the idea of defining metrics in terms of other metrics. This would be important and helpful for a metric like "LTV to CAC", or more broadly: ratios & deltas. I don't have a good answer for how this could work in the near-term, but I think it's a pretty essential piece of the metrics puzzle. I am less swayed by @acossta's suggestion of "derived" metrics like
Just exceptionally with you on this part though :) Metrics vs macros ✅ ❓ Seeing/hearing some questions/confusion around this line:
I think @miike calls this out well above -- I do not think that a mode of a categorical variable, or a listagg, or similar, are metrics per se. I think they're valuable, but I don't believe they should be represented in this specification, at least not initially. I'd definitely be interested in finding ways to support these kinds of aggregations in the future, but think that numeric-based aggregations are going to help keep things simple for our first cut of the implementation. Filters ✅ Ok, I'm with you all - filters should have an On that note - I am excited to see us all develop best practices for what goes into a model definition vs a metric definition. @jthandy - can I sign you up for v2 of this blog post 😂 Wrap up I think we're in good shape on all of the topics raised in this thread. If anything feels unresolved to anyone, please feel very encouraged to follow up below! For my part.... next steps here involve putting up a PR :) |
@drewbanin - it could be defined as a model and I agree with you on the thick model, thin metrics mantra. There might be other solutions, but the underlying goal is to be able to show the DAG of metrics (just metrics) to a broader audience (business users) that don't necessarily operate at the model level but do at the metric level and care about the metric lineage. |
This thread is awesome and sparks joy ✨ To echo @georgewfraser point above
Ever since I started seeing the infamous "Metrics Layer" pop up in the community I have been wondering how the Fivetran dbt packages (and dbt packages in general) can support metric definitions, and this feature can make that a reality! I have already been thinking about other specific package use cases for metrics as well and just created a Feature Request detailing some of my thoughts around how metrics can become more versatile within dbt packages. TLDR on the FR: I am wondering how metric definitions could be added or overridden in dbt packages by users. Super excited to stay up to date on this thread and feature request! |
From @mistercrunch:
I agree with this approach because most of the downstream tools don't talk in terms of metrics, they connect to databases so the user flow is usually Our approach is to let users define the cubes as dimension / metric pairs as part of the model definitions and hide them from the downstream tools. We automatically create dbt models from each cube and make use of them by parsing the queries and re-writing them to point the roll-up tables under the hood. The dbt resources (models/seeds/sources) are exposed to BI tools as the semantic layer. |
Thanks @mistercrunch for bringing such clarity! Some questions:
Absolutely agree here and I took from @drewbanin 's first comment that this was the intention of the original design. Although the metrics are outside of
I would challenge the idea of having virtual columns calculated in dbt This would be different for calculated metrics
|
From @owlas:
I believe that there are some good use-cases of virtual columns / dimensions because of the dynamic nature of it.
|
I'm not convinced as you then would have to systematically duplicate models for the sake of these "virtual columns" (which are pretty common imho), potentially doubling the size of your DAGs creating more indirections. Maybe DBT compiles in that form physically in the database (with an extra view on top), but hopefully we don't have to create a new file and carry the list of fields forward (or whatever DRY version of that) for that sake. |
On the topic of Joins: A first implementation of join logic could be made easier in the following way:
As in: you should NOT want to implement logic to resolve casm / fan traps! (https://biguru.wordpress.com/2008/05/01/its-all-in-the-universe-handling-chasm-and-fan-traps/) |
Nick here from transform.co. I want to share some learnings that we're building off of at Transform from working on metrics tools at Airbnb. We would be eager consumers of these definitions and support the development of a more universal spec. A snippet of our current spec can be seen here. We'll do a better job of pushing out the full spec and the opinions we're building on soon. At a high level, I'm excited about this direction for dbt. Centralizing this information and creating strong ties with the data transformation pipelines is an exciting direction. The opportunity to connect metrics to their data lineage through the entire data modeling process and then downstream is compelling and will benefit end-users. This also presents an exciting opportunity to solve one of the more significant challenges with metrics frameworks, building out a dependency graph. Some good ideas in this thread will push this in the right direction, but some gaps are fairly similar to the gaps that Airbnb ran into with early iterations. I'm skeptical of the idea that starting simply and then evolving this spec will work. Metric specs, in particular, are complicated because of the number of end applications that consume from them and the flexibility required to serve them all. There have been multiple backward-incompatible evolutions of metrics frameworks across Airbnb and other companies, and the ones I participated in were all Philosophical Level Before diving in too deep, I'm sensing a philosophical gap between some of the writing in this thread and how larger companies who have built this internally have used this type of technology. Specifically, Airbnb's framework (and the older version) evolved to be thought of as a "denormalization engine" and I get the sense that there are advocates here for a different direction (no virtualization, push to rely on manual modeling for complexities, dimensions and joins as a second step in spec design, etc.). I fear that the impact will be limited if end-users are expected to define these metrics on top of denormalized tables. That's not to say that these ideas are right or wrong, but they do deviate from a path that has been tested. If the context on the evolution of Airbnb's spec is helpful, I'm happy to share more. Here is a good explanation of why you should pass flexibility to frameworks like this. The virtualization of logic is seen as a plus for some of the reasons @mistercrunch mentioned above. If the abstractions are chosen thoughtfully, you can end up with DRYer code and a more flexible framework that can handle more of the boiler-plate logic associated with metric construction. Joins / Identifiers Joins are by far the most complicated component of this concept. The core difference between the way we have thought about this vs. others in this space and the posts I've seen here is that we build the joins for the end-user rather than the framework built on top of this spec needing to be robust enough to handle fan-out and chasm joins. I don't think symmetric aggregates in SQL construction are necessary. We get around this by using "identifiers," which are the join keys and their types (primary, foreign, unique) as joinable entities rather than defining join paths one by one. The framework determines the join that the requested data set requires. This is a much more flexible and DRY abstraction that enables the framework to manage the complexity around joins. This pushes the optimization to the framework with all of the above benefits around declarative languages with flexible planning engines. It's probably not worth digging in too deep here since this appears to be a future discussion point, but this is one of the more significant differences of how we've approached building Transform relative to frameworks like LookML or most other semantic layers. Dimensions First, since this is a loaded and ambiguous term, I define a dimension as an attribute of an entity/identifier... i.e., user's country, product's type, etc., or that entity itself. I have pretty strong opinions on this one -- A metrics spec should treat a dimension as a first-class object. Treating a dimension as an attribute of the metric will become quite repetitive as companies adopt this at any form of scale. This goes back to the philosophical idea above, but if you are required to pull dimensions into every metric, business consumers will become frustrated, and the analytics engineer will become a blocker to the outcomes this layer aims to enable. Hierarchies: Folks have shared some interesting ideas here. This is an important concept and shouldn't be ignored. The expression of time granularities is a dimension hierarchy. There are many more types of dimension hierarchies, and these should be expressible by the user. Everyone's business has unique dimension hierarchies(e.g., Sales Region → Territory → Market), and this needs to be an abstraction of its own that looks similar to a nestable case statement to get real DRY. One way to think about a dimensional hierarchy is that it is a discretely aggregatable object rather than a measure that is a continuously aggregatable object. Time granularity: Except for a point-in-time metric (e.g., 2021 revenue or 2021 revenue by country), metrics only make sense in the context of time. We've had a lot of debate about representing time and whether time itself should be a component of the metric. Some metrics require time in their definition (i.e., a cumulative metric), and some may make sense in the context of a single timestamp (transactions by last updated vs. transactions by created at) Measure and Metrics The differentiation you identified in your first post between measures and metrics is close to how we defined these two objects, although we use both. Put simply, metrics are functions of measures. The measure & metric abstraction allows us to maintain a flatter dependency structure and increase the reuse of objects. We define measures as aggregatable expressions (columns or SQL statements that fit in a select) and an aggregation type. What is defined in the original spec as a metric would be measure and a Reusing metrics or deriving metrics from other metrics is another idea that seems appealing, but it can become quite a mess, both technically and organizationally. There are a few reasons, but the most challenging part is generating nested metrics and joining in dimensions at different aggregation levels. Summary The goal of a spec in my mind should be to create a sufficient semantic understanding of the warehouse that tools can build the logic necessary to serve metrics to various applications. There are many applications for metrics, and that is a tall order and ultimately the challenge of defining a I would love to help with this but also don't want to grandstand on these beliefs if there is a desire to go in a different direction. That was a very high level and possibly too short to fully articulate these points. There was so much pain and frustration learning how to make these configs organized and simple, and I would love to help shortcut that. |
We've got a first stab at teaching dbt about metrics (#4235), which we're going to be including in v1.0.0-rc1. This is an intentionally limited schema, just enough for this to be an experimental feature. We're reserving the right to add more over time. Major-version one isn't usually the time for brand-new experimental features. We want to keep innovating, even as we lock in the foundation; and it's because of all our other efforts—toward stability, performance, and maturity—that we feel able to take this swing. Needless to say, this conversation is far from over. I'm so appreciative of all your time and thoughtfulness in the threads above, and I'm excited to see our thinking on this topic continue to develop over the next weeks and months. I know there are already some exciting follow-on issues, to which I owe responses :) |
* first cut at supporting metrics definitions * teach dbt about metrics * wip * support partial parsing for metrics * working on tests * Fix some tests * Add partial parsing metrics test * Fix some more tests * Update CHANGELOG.md * Fix partial parsing yaml file to correct model syntax Co-authored-by: Drew Banin <[email protected]>
* first cut at supporting metrics definitions * teach dbt about metrics * wip * support partial parsing for metrics * working on tests * Fix some tests * Add partial parsing metrics test * Fix some more tests * Update CHANGELOG.md * Fix partial parsing yaml file to correct model syntax Co-authored-by: Drew Banin <[email protected]>
Concerning the implementation of "cubing/metrics" whatever the way they will be specified in dbt, it could be interesting to see how the unified star schema paradigm based on SQL UNIONing could help. |
One note on the naming of this functionality @jtcohen6 , I would propose to call this feature the (BI) semantic layer or Information semantic layer instead of the metrics layer. |
Just putting this out there: we cannot avoid joins forever To weigh in a little more concretely, I strongly favor the semantics defined by @owlas in which dimensions/measures are defined on the model-column, and relationships are defined in exposures, albeit a first class While the INTP in me would like to see an explicit syntax for asserting foreign keys in model-column properties so that lattices can be generated, the reality of most businesses is this:
and so that pretty strongly suggests that using exposures to explicitly define intentional explores is a pragmatic solution |
* first cut at supporting metrics definitions * teach dbt about metrics * wip * support partial parsing for metrics * working on tests * Fix some tests * Add partial parsing metrics test * Fix some more tests * Update CHANGELOG.md * Fix partial parsing yaml file to correct model syntax Co-authored-by: Drew Banin <[email protected]> automatic commit by git-black, original commits: 855419d
* first cut at supporting metrics definitions * teach dbt about metrics * wip * support partial parsing for metrics * working on tests * Fix some tests * Add partial parsing metrics test * Fix some more tests * Update CHANGELOG.md * Fix partial parsing yaml file to correct model syntax Co-authored-by: Drew Banin <[email protected]> automatic commit by git-black, original commits: 31c88f9 6c6649f 726004b 855419d c69f28e d60f6bc
* first cut at supporting metrics definitions * teach dbt about metrics * wip * support partial parsing for metrics * working on tests * Fix some tests * Add partial parsing metrics test * Fix some more tests * Update CHANGELOG.md * Fix partial parsing yaml file to correct model syntax Co-authored-by: Drew Banin <[email protected]> automatic commit by git-black, original commits: 2142e52 56ae206 855419d
Integrations with dbt Metrics incoming! I think that @owlas was the first one to mention Cube in this thread (thanks!), and I'm happy to share that now Cube is able to read metrics definitions from dbt and provide caching, access control, and a variety of APIs for downstream data consumers like BI tools, notebooks, custom apps, embedded analytics, etc. Here's some details on this integration (https://cube.dev/blog/dbt-metrics-meet-cube/) and there's also a topic on dbt's forum where we can discuss this further (https://discourse.getdbt.com/t/combining-dbt-metrics-with-api-caching-and-access-control/4033). |
Very happy with the integration released by Cube JS!! To get back on the core of the dbt metrics feature. Take just this simple example (where I appreciate the way Cube JS implemented support for joins), when you want to use dbt metrics as the source of truth. metrics:
Dimensions:
You would want to calculate your metrics, and break them down by the specified dimensions that are not part of the current model. Dbt Metrics, should not force users to denormalize the star schema into one big table to be able to use the dbt metrics feature, what is now unfurtunately the case. I think it would be very helpful to look on the Cube JS implementation, and take the lessons learned from there to be able to provide the same support for Joins. |
Is there an existing feature request for this?
Describe the Feature
dbt should know about metrics. A metric is a timeseries aggregation over a table that supports zero or more dimensions. Some examples of metrics include:
dbt should support metric definitions as a new node type. Like exposures, metrics participate in the dbt DAG and can be expressed in yaml files. By defining metrics in dbt projects, analytics engineers can encode crucial business logic in tested, version controlled code. Further, these metrics definitions can be exposed to downstream tooling to drive consistency and precision in metric reporting.
The ecosystem
There is some prior art for defining metrics in dbt projects. In particular, see
While these two implementations differ (measures vs. metrics), more on that below, there exists a need in the community for a first-class way to define these metrics in dbt code. It is really neat to see that some folks have already made these definitions possible with dbt, but it would better if metrics were treated as well-defined nodes with field validation and helpful utilities inside of dbt Core.
Specification
A metric is a timeseries aggregation over a table that supports zero or more dimensions. These metrics can be encoded in schema.yml files. In the example below, a
new_customers
metric is defined as a count of customer records created in a given time grain.Given this information, a downstream process (or a dbt macro!) can generate a sql SELECT statement that correctly calculates this metric with a specified time grain and set of dimensions. Here is a breakdown of supported fields:
Model reference
A reference to a dbt model. This model may be any "materialized" model, or a reference to an ephemeral model. Direct table references are not allowed, and alternate node types (seeds, snapshots) are not supported.
Metric types
The following metric types should be supported:
In the future, alternative metric types (ratios, deltas, etc) should be supported in this model.
Filters
Filters should be defined as a list of dictionaries that define predicates for the metric. Filters are ANDed together. If more complex filtering is required, users can (and should) push that logic down into the underlying model.
Functional requirements
graph.metrics
variablemetrics.find_by_name('...')
(exact mechanism TBD)manifest.json
artifactmetric:
selector+
,&
, etc) should be supporteddbt Core should not, itself, evaluate or calculate metrics. Instead, dbt Core should expose the definition of metrics to downstream tools or packages for evaluation and analysis. In this way, it is critical that dbt Core provides hooks into metrics that can be leveraged in both macro code, or by processes that consume dbt Core
manifest.json
files.Describe alternatives you've considered
Don't implement metrics as distinct node types and keep encoding them in
meta
properties:Metrics vs. Measures
Metrics are strongly-typed objects. It is extremely common to see folks perform syntactically correct but semantically meaningless calculations over data. This looks like averaging an average, or adding two distinct counts together. You get a number back... but it's not a useful or meaningful result.
To that end, I think we should start with metrics instead of measures. The difference here (and maybe a strawperson of my own creation - tell me if you think so) is that measures are untyped aggregations, whereas metrics are rigorously defined summaries over well-defined datasets. The creation of metrics does not preclude us from teaching dbt about more generic types of aggregations in the future, but I'd prefer to start with a narrow set of functionality and expand over time. It is easy to remove constraints, but it is hard to add them 🙂
Include support for joins
relationships
test) in the future, but this would be a meaningful expansion of scope for our first cut of this featureWhere metrics are defined
Should metrics be a property of a model? While that could be functional today, I think this would make it hard to extend metrics to work with joins (see above). Instead, declaring metrics as independent nodes that participate in the DAG is a more future-proof idea, and we'd probably do well to avoid the "patching" flow required to get schema tests (properties of models today) translated into their own independent nodes in the DAG.
Inheriting configuration from models
Should metrics be namespaced under a model? This would make it possible to define some "shared" properties for all of the metrics derived from a model (eg. valid dimensions, the time field, supported time grain). This would be good for ergonomics, but not a huge value-add IMO. I'd like to keep this simple for the initial implementation and then make decisions like this with some more information from the community around example use-cases.
Example:
SQL calculations in dimensions
Should dimensions be allowed to provide arbitrary SQL expressions? I don't think so — that SQL is best encoded in model code, and it would be confusing and dissonant to break up dimension definitions across SQL and yaml files.
Example:
Who will this benefit?
Analytics engineers
- As with models, AEs will be able to define metric logic under version control. By colocating model and metric code, new metrics or changes to existing metrics can be made in a tested, versioned, documented, code reviewed environment
- Further, dbt Core's built in lineage can surface information about how changes to an upstream model may impact a downstream metric
BI/Analytics tooling (and therein, data consumers)
- Organizations use metrics to understand performance and make decisions. To that end, the correctness and precision of these metrics is really paramount! By defining metrics rigorously under version control, and then exposing their definitions globally, dbt Core can help ensure consistency in reporting.
The data ecosystem
- There are so many tools, both existing and yet to be created, that can benefit from an open source mechanism for defining a semantic model on top of the data warehouse. I believe that this business logic is just too valuable and strategically important for end-users to be locked up in proprietary tooling. To that end, this feature, and future types of semantic logic like this, should be addressable in an open source way
Are you interested in contributing this feature?
I sure am :)
Anything else?
From around the internet:
The text was updated successfully, but these errors were encountered: