Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(#4884) Add support for ratio metrics #5027

Merged
merged 32 commits into from
Jul 6, 2022
Merged

Conversation

drewbanin
Copy link
Contributor

@drewbanin drewbanin commented Apr 11, 2022

resolves #4884

Description

This PR adds support for ratio metrics in dbt. A ratio metric is defined in terms of an expression, eg:

  - name: arpc
    label: Average Revenue per Customer
    type: expression
    sql: "{{ metric('revenue') }} / {{ metric('customers') }}"

This PR also adds a metric function that 1) resolves a specified metric and 2) creates an edge in the DAG between a metric node and another node.

Deliverables:

  • Create a metric context method which returns a MetricReference object (similar to the RelationProxy)
    • This object should make it possible to access attributes of the node
    • This object should make it possible to find parent metrics for an expression metric
    • This object should make it possible to access the underlying model node for a concrete (ie. non-expression) metric
  • The metric() function creates DAG edges between the referencing and referenced nodes
    • node is added to the depends_on.nodes list for the child node
    • node is added to a (new) metrics list (akin to refs list
  • Support metric definitions with an expression type. These are expression metrics which are defined as an expression over multiple other metrics
    • The expression metric type is mutually exclusive with the model config for metrics. A metric must either be derived from a model, or from one or more other metrics
  • Ensure that the metric() function can be called from Snapshots, Models, Tests, Macros, and other Metrics

Example of metrics depending on other metrics in the DAG viz (note: this requires a PR to dbt-docs):

Screen Shot 2022-04-08 at 4 07 17 PM

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have added information about my change to be included in the CHANGELOG.

@cla-bot cla-bot bot added the cla:yes label Apr 11, 2022
@github-actions
Copy link
Contributor

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

Copy link
Contributor Author

@drewbanin drewbanin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped some notes inline in the PR. Excited to discuss!

# Skip if the metric is disabled!
return metric

# TODO: Figure this part out...
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we allow other yaml-only resources (like sources) to be disabled? If so, we should implement this part

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sources can be disabled, exposures cannot (but we should support this). In the past, it's only been possible to disable a source via the sources: block in dbt_project.yml.

In v1.1, we'll be adding the ability to disable a source inline with its definition (#5008):

sources:
  - name: my_source
    config:
      enabled: False

self.package_name = package_name


class ResolvedMetricReference(MetricReference):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was my take on a class analogous to the Relation object. I figured that this kind of object could be returned from the metric() function. It would help us avoid the need to find the metric in the graph manually like we do here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks reasonable to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets remove .metric_value based on conversation here.

return f'{namespace}.{self.metric_name}'

def __str__(self):
return self.node.name
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this return the metric name? Or the sql expression? Or something else? Not 100% sure yet....

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would the ResolvedMetricReference get stringified?

Copy link
Contributor Author

@drewbanin drewbanin Apr 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would happen when you interpolate the results of the metric() function, so:

select
  {{ metric('revenue') }}
from {{ metric('revenue').model }}

would evaluate to

select
  revenue
from analytics.fct_revenue

So - returning the name of the metric is kind of useless, but I am unsure if we want to return anything more sophisticated than that? I guess we could return self.node.sql instead (which feels reasonable), but this doesn't extend well to the ratio metric type which does not have a .sql value

So, uh, short answer: you probably wouldn't stringify it in practice?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we shouldn't even have a str method then? With the idea that it's better to fail noisily than fail silently.

created_at: float = field(default_factory=lambda: time.time())

def resolve_metric_references(self, context):
# TODO: Obviously don't do this...
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is where I implemented "patching" for the metric node. Big idea is that we have code like:

  - name: arpc
    label: Average Revenue per Customer

    type: ratio
    ratio_terms:
      numerator: metric('revenue')
      denominator: metric('customers')

When we parse the arpc metric, we don't necessarily know anything about revenue or customers yet. I think this is sort of akin to how exposures work, though I wasn't able to find the way that we "patch" exposures when I was looking through the code. Wondering if this is different because metrics can depend on other metrics, whereas exposures can only depend on models/sources/snapshots, so we can't bypass this problem with parsing order.

Feedback/guidance very welcomed here!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exposures don't refer to themselves, so exposures refs, sources, etc get resolved in the various parser.manifest 'process_XXX' methods. I wouldn't call what the metrics need a "patch" step, because when we did that we saved a bunch of patch dictionaries (with config, etc) until the end of parsing and applied them then. So far metrics just need a resolution phase, as you've implemented.

core/dbt/main.py Outdated
@@ -142,6 +142,8 @@ def main(args=None):
exit_code = e.code

except BaseException as e:
import traceback
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just here for debugging - will remove before PRing this

@@ -66,6 +66,9 @@ def should_render_keypath(self, keypath: Keypath) -> bool:
return False
elif self._is_norender_key(keypath[0:]):
return False
elif self.key == "metrics":
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually don't think this part is necessary... will remove. This came from an initial idea of implementing ratios via the sql config, eg:

sql: "{{metric('revenue') }} / {{ metric('customers') }}"

Decided against doing it that way (instead: specify numerator and denominator separately) in order to make metric queries leveraging these ratio metrics more tractable to implement.

@@ -1042,8 +1046,42 @@ def parse_metric(self, unparsed: UnparsedMetric) -> ParsedMetric:
self.schema_parser.manifest,
package_name,
)
model_ref = "{{ " + unparsed.model + " }}"
get_rendered(model_ref, ctx, parsed, capture_macros=True)
if unparsed.type == 'ratio':
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love the idea of teaching the metric parser about metric subtypes. For now, the only special metric is the "ratio" type, though I suppose there could be other special types in the future. Happy to discuss if anyone feels strongly about us doing this part differently

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heard. If we want the validation to happen in Python code (and it should), I think we'll need to teach some of it to dbt-core (or a dbt-core dependency). I don't mind ratio being a special type for now.

Is there a more general solution that looks like "super"/"sub" metric types? Like super/subclasses for Python objects. Or maybe we don't want to build the semantic layer as OOP :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels wrong to me to have this code be in the parser. I'd be more comfortable if we could move some of it into the UnparsedMetric or ParsedMetric or a subclass based on type... Having to render it does complicate things though. Is it actually necessary to render the numerator and denominator at parse time? Can they actually contain macros? Because if they can't contain macros, I would think the get_rendered is a no-op.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gshank I think we do need to render the numerator and denominator at parse-time because we need to capture the metric() call in order to build the edges between the two metrics. In this way, metric() is kind of like ref(), but for metrics :)

I'd be ok with moving this logic into the ParsedMetric (or similar)... is it ok if we call get_rendered in there? I don't want to explode the number of places that we render jinja code, but i definitely hear what you're saying

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's only the 'metrics' call that we need to capture, this feels like something that we might want to do with static parsing eventually. Wouldn't need to do that in the first pass though.

I think we're already calling get_rendered in ParsedMetric in "resolve_metric_references". And it does seem out of place there too. Hmm... I'm not exactly sure what the best way of factoring this is. Separation of concerns would be nice, but the concerns seem kind of inherently tangled.

@drewbanin drewbanin changed the title Add support for ratio metrics (#4884) Add support for ratio metrics Apr 11, 2022
Copy link
Contributor

@gshank gshank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good solid start :)

We use pre-commit to do black, flake8 and mypy, and there are a number of mypy complaints about this. Sometimes in order to make mypy happy you have to move things around a bit, so you might want to pip install pre-commit, and do 'pre-commit run --all-files'.

For partial parsing, will we need to re-parse anything with a changed metric reference? I think we'll have to handle metric references in a handful of places.

core/dbt/parser/manifest.py Show resolved Hide resolved
return f'{namespace}.{self.metric_name}'

def __str__(self):
return self.node.name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would the ResolvedMetricReference get stringified?

created_at: float = field(default_factory=lambda: time.time())

def resolve_metric_references(self, context):
# TODO: Obviously don't do this...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exposures don't refer to themselves, so exposures refs, sources, etc get resolved in the various parser.manifest 'process_XXX' methods. I wouldn't call what the metrics need a "patch" step, because when we did that we saved a bunch of patch dictionaries (with config, etc) until the end of parsing and applied them then. So far metrics just need a resolution phase, as you've implemented.

@@ -1042,8 +1046,42 @@ def parse_metric(self, unparsed: UnparsedMetric) -> ParsedMetric:
self.schema_parser.manifest,
package_name,
)
model_ref = "{{ " + unparsed.model + " }}"
get_rendered(model_ref, ctx, parsed, capture_macros=True)
if unparsed.type == 'ratio':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels wrong to me to have this code be in the parser. I'd be more comfortable if we could move some of it into the UnparsedMetric or ParsedMetric or a subclass based on type... Having to render it does complicate things though. Is it actually necessary to render the numerator and denominator at parse time? Can they actually contain macros? Because if they can't contain macros, I would think the get_rendered is a no-op.

self.package_name = package_name


class ResolvedMetricReference(MetricReference):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks reasonable to me.

self.metrics[metric.unique_id] = metric
source_file.metrics.append(metric.unique_id)
if not metric.config.enabled:
self.add_disabled_nofile(metric)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this piece is actually working for a metric with

metrics:
  - name: some_metric
    config:
      enabled: False

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll open a separate issue for supporting config.enabled on metrics (and exposures!) In the meantime, I think we should probably just remove the enabled-related code from this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 9 to 11
def __str__(self):
# TODO : Qualify the metric name as a CTE or field?
return f"{self.metric_name}.metric_value"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with @callum-mcdata live: let's just make this return f"{self.metric_name}" instead, since otherwise we're just replacing this with "" in Jinja here

Comment on lines +467 to +469
# TODO: Expressions _cannot_ have `model` properties
if data.get("model") is None and data.get("type") != "expression":
raise ValidationError("Non-expression metrics require a 'model' property")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@callum-mcdata To add validation for the opposite case (this TODO)

resource_type: NodeType = NodeType.Metric
meta: Dict[str, Any] = field(default_factory=dict)
tags: List[str] = field(default_factory=list)
sources: List[List[str]] = field(default_factory=list)
depends_on: DependsOn = field(default_factory=DependsOn)
refs: List[List[str]] = field(default_factory=list)
metrics: List[List[str]] = field(default_factory=list)
config: SourceConfig = field(default_factory=SourceConfig)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove this (and related code) for now, and split off support for disabling metrics into a separate unit of work

@emmyoop emmyoop requested a review from gshank July 5, 2022 14:05
@emmyoop emmyoop marked this pull request as ready for review July 5, 2022 14:05
@emmyoop emmyoop requested a review from a team July 5, 2022 14:05
@emmyoop emmyoop requested review from a team as code owners July 5, 2022 14:05
@emmyoop emmyoop requested a review from stu-k July 5, 2022 14:05
@leahwicz leahwicz mentioned this pull request Jul 5, 2022
6 tasks
leahwicz and others added 3 commits July 5, 2022 15:47
* Bumping manifest version to v6

* Adding manifest file for tests

* Reverting unneeded changes

* Updating v6

* Updating test to add metrics field

* Adding changelog
Copy link
Contributor

@gshank gshank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good!

I think there might be a minor issue related to partial parsing. We now have metric references in models. If a model and metric have been parsed and the metric is deleted or renamed, I'm pretty sure we aren't scheduling the model for re-parsing, so the error would show up at execution time rather than parse time. I don't consider this a blocker -- we can fix it later.

* Update test_previous_version_state for v6. Cleanup

* Regenerate, rm breakpoint

* Code checks

* Add assertion that will fail when we bump manifest version

* update tests to automatically tests all previous versions

Co-authored-by: Emily Rockman <[email protected]>
@emmyoop emmyoop merged commit 064d890 into main Jul 6, 2022
@emmyoop emmyoop deleted the feature/metric-improvements branch July 6, 2022 21:01
agoblet pushed a commit to BigDataRepublic/dbt-core that referenced this pull request Sep 16, 2022
* wip

* More support for ratio metrics

* Formatting and linting

* Fix unit tests

* Support disabling metrics

* mypy

* address all TODOs

* make pypy happy

* wip

* checkpoint

* refactor, remove ratio_terms

* flake8 and unit tests

* remove debugger

* quickfix for filters

* Experiment with functional testing for 'expression' metrics

* reformatting slightly

* make file and mypy fix

* remove config from metrics - wip

* add metrics back to context

* adding test changes

* fixing test metrics

* revert name audit

* pre-commit fixes

* add changelog

* Bumping manifest version to v6 (dbt-labs#5430)

* Bumping manifest version to v6

* Adding manifest file for tests

* Reverting unneeded changes

* Updating v6

* Updating test to add metrics field

* Adding changelog

* add v5 to backwards compatibility

* Clean up test_previous_version_state, update for v6 (dbt-labs#5440)

* Update test_previous_version_state for v6. Cleanup

* Regenerate, rm breakpoint

* Code checks

* Add assertion that will fail when we bump manifest version

* update tests to automatically tests all previous versions

Co-authored-by: Emily Rockman <[email protected]>

Co-authored-by: Jeremy Cohen <[email protected]>
Co-authored-by: Callum McCann <[email protected]>
Co-authored-by: Emily Rockman <[email protected]>
Co-authored-by: leahwicz <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CT-376] add ability to calculate metrics based off of other metrics
6 participants