Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SpiceDB community integration #2538

Open
wants to merge 61 commits into
base: master
Choose a base branch
from

Conversation

tstirrat15
Copy link

@tstirrat15 tstirrat15 commented Nov 15, 2024

What does this PR do?

This adds an openmetrics-based agent checker for SpiceDB, an open-source authorization database maintained by authzed: https://authzed.com/spicedb

Motivation

Datadog doesn't easily support scraping openmetrics without the use of a third-party agent. This provides first-party support and will make life easier for Authzed customers.

Review checklist

  • PR has a meaningful title or PR has the no-changelog label attached
  • Feature or bugfix has tests
  • Git history is clean
  • If PR impacts documentation, docs team has been notified or an issue has been opened on the documentation repo
  • If this PR includes a log pipeline, please add a description describing the remappers and processors.

Additional Notes

I used the CockroachDB integration from integrations-core as my template. I think I ported the right parts of it over in the right way, but this is my first time developing an integration and would appreciate any feedback on obvious tripping points.

@tstirrat15
Copy link
Author

I'm looking for prior art around getting Openmetrics-style metrics into Datadog using the agent. There don't seem to be good examples in integrations-extras - there's a few folks who seem to be treating _bucket metrics as gauge metrics which doesn't seem to be correct.

There's https://github.com/DataDog/integrations-core/blob/master/cockroachdb/metadata.csv which indicates that the buckets are exposed as counts for Cockroachdb, but that's only going to be helpful if they're somehow implicitly converted to distribution metrics on the Datadog side. Is that how it works?

My coworker and I were struggling today with the otel exporter - the custom metrics that were coming across for histograms were showing up as straight counts, even though the otel datadog exporter is supposed to treat histogram metrics as distribution metrics by default.

Is there something I'm missing here? I'm also happy to be pointed at another resource if there's documentation somewhere.

@urseberry
Copy link
Contributor

I've updated the README and dashboard JSON. Can I get some help with the questions I've asked?

Thank you for the updates, @tstirrat15. I will review them. Unfortunately, I don't know the answers to your questions, as I am a technical writer and not an engineer. An engineer will review your PR also and answer your questions.

spicedb/README.md Outdated Show resolved Hide resolved
spicedb/README.md Outdated Show resolved Hide resolved
spicedb/README.md Outdated Show resolved Hide resolved
spicedb/README.md Outdated Show resolved Hide resolved
spicedb/README.md Outdated Show resolved Hide resolved
spicedb/assets/dashboards/spicedb_overview.json Outdated Show resolved Hide resolved
spicedb/images/IMAGES_README.md Outdated Show resolved Hide resolved
@tstirrat15
Copy link
Author

Done

spicedb/README.md Outdated Show resolved Hide resolved
spicedb/CHANGELOG.md Outdated Show resolved Hide resolved
@tstirrat15
Copy link
Author

@urseberry are the labels correct for further review?

@steveny91
Copy link
Contributor

steveny91 commented Dec 12, 2024

The validation is complaining that distribution isn't a valid metric type. The integration assets reference links to metric types which shows distribution is a metric type. What are the valid metric types and where do I find them?

@tstirrat15 For openmetrics, all metrics are collected as gauges or counts.

Prometheus/Openmetric type > Datadog metric type
Gauge > Gauge
Counter > Count
Histogram _bucket, _count, _sum > Count
Summary _count, _sum > Count
Summary _quantile > Gauge

@steveny91
Copy link
Contributor

There's https://github.com/DataDog/integrations-core/blob/master/cockroachdb/metadata.csv which indicates that the buckets are exposed as counts for Cockroachdb, but that's only going to be helpful if they're somehow implicitly converted to distribution metrics on the Datadog side. Is that how it works?

I guess it would be nice to get a better grasp on the need for the distribution metrics? This is calculated in our backend, and I don't really have the knowledge of the top of my head to better explain this at the moment (owned by another team) how that is done, other than on the agent integration side, we do indeed have the ability to submit them to that distribution intake.

But the reason why we collect them as counters is due the way Openmetrics exposes them as count of observations/events within a certain bucket. For example a value of 10 for a upper_bound:1 could mean that there's 10 requests that were under 1ms in response time. Most customer from my experience benefits from just doing a _sum/_count to get the average latency for example for the whole histogram and track that over time.

I do have to admit that currently we don't have the best visual capability either to be able to display a histogram in the form a of bellshape curved.

@steveny91
Copy link
Contributor

I also can't get the integration working locally so that I can develop a dashboard against it - I opened an issue at #2553 and haven't received a response.

I'll take a look at this today if not tomorrow.

@tstirrat15
Copy link
Author

I guess it would be nice to get a better grasp on the need for the distribution metrics?

Our desire is to be able to export a dashboard that works similarly to the one we developed internally in Grafana, where openmetrics histograms for latency metrics are displayed as p50/p95/p99 timeseries. From my understanding that sort of display is really only possible with distributions in datadog, and aggregating those into a single rate/max/sum isn't going to work for our customers.

I'm able to do it using the openmetrics integration directly, as in this repo: https://github.com/authzed/examples/tree/main/observability/simple-datadog

In this example, is the agent submitting counts to the API and the API is doing the distribution aggregation, or is it that the agent itself is aggregating the histogram data and then submitting it as distribution data?

My sense is that the agent is doing something to disaggregate, transform, and reaggregate a histogram metric into a distribution metric. That's what I want in this integration, and it works in the Openmetric integration, and this integration is a special case of an Openmetrics integration.

Is there something I'm missing?

@steveny91
Copy link
Contributor

@tstirrat15

The link you provided. I thinks does indeed go through the agent:
https://github.com/authzed/examples/blob/main/observability/simple-datadog/conf.d/openmetrics.d/conf.yaml

Your understanding is correct, is that we do submit some form of a sketch to our backend for it to aggregate. But I'm not really sure what calculation and how it aggregates it, so I can't confidently go into detail in that part of that aspect.

There's also a UI portion after metric ingestion that needs to be done according to this guide:
https://docs.datadoghq.com/metrics/distributions/

So all in all:

  • histogram_buckets_as_distributions: true setting does indeed submit sketches to our backend for recalculation and aggregations
  • Once histograms are ingested as distributions, you would indeed be able to calculate the different percentiles via the UI

@tstirrat15
Copy link
Author

@steveny91 the part I'm getting at is that in the linked configuration for an Openmetrics check, a metric type of histogram works just fine. The integration I'm working on extends that Openmetrics check. Why can't I submit histogram metrics through my integration?

Like suppose I submit all of these metrics piecewise as counters and gauges. How can a user then turn those into distributions? It's not going to be sufficient for them to take a counter and display it as rate, because I've lost the distribution information, and that's not useful for latency metrics.

@steveny91
Copy link
Contributor

The part I'm getting at is that in the linked configuration for an Openmetrics check, a metric type of histogram works just fine. The integration I'm working on extends that Openmetrics check. Why can't I submit histogram metrics through my integration?

I apologize if I gave off the impression that you can't. You most definitely can if you extend the Openmetrics base class. All I wanted to comment on that was that since it was using the Openmetrics check, it would go through the agent, which you're doing and that is fine. After that you'd have to do some stuff in the UI at app.datadoghq.com to calculate the percentiles and different aggregations.

When you say "a metric type of histogram works just fine", can I conclude from this that the value and submission type for this is as you expect etc. and you can derive the necessary information from it using that exact linked config?

I'm just trying to narrow down the problem here, if it's more along the lines of:

  • The metric values not accurate after these are submitted as distributions

or

  • I'm having trouble submitting these as distributions in the current PR.

We had a few back and forth and things might have gotten muddled. I want to see exactly where we stand so I can better assist here. I can definitely help with the integration stuff and get things align and passing CI to be merged. But if it's more of a discussion of metric value and behavior, then I'd have to find someone that can better understand and explain the nuances of distribution and calculations that happen in the backend.

Thanks and I do apologize for maybe derailing this a bit.

@tstirrat15
Copy link
Author

tstirrat15 commented Dec 18, 2024

@steveny91 it's the latter - the validator complains that distributions and histograms aren't valid metric types in metadata.csv, and the unit tests don't pick up the histogram metrics even though the integration tests do it just fine.

Is this an issue with the harness/validation or an issue with what I'm trying to do?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
editorial review Waiting on a more in-depth review from a docs team editor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants