Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle duplicated metrics gracefully #79

Open
pnyheim opened this issue Feb 28, 2019 · 30 comments
Open

Handle duplicated metrics gracefully #79

pnyheim opened this issue Feb 28, 2019 · 30 comments

Comments

@pnyheim
Copy link
Contributor

pnyheim commented Feb 28, 2019

Summary:

The exporter remembers metrics by Graphite name. It is possible to get into situations where multiple samples (graphite metrics) result in the same Prometheus metric.

At the moment, this makes the exporter catastrophically fail (return error 500 on /metrics). Instead, it should discard the older sample and use the new sample's value.

Original issue description:


We tried using v0.5.0 in our dev env today, upgrading from 0.4.2.
After the upgrade, metrics seem to be dropping out regularly, as can be clearly seen in this grafana dashboard.
image

So something is definitively wrong here.
There are no changes to the environment or config between 0.4.2 and 0.5.0, and we can not see anything abnormal in the graphite_exporter logfiles.

This looks like a critical bug to us.

It should be noted that we have seen similar behavior before where we had UDP packet drops, but this was resolved by tuning the UDP receive buffer as explained here: https://nav.uninett.no/doc/4.3/howto/tuning_graphite.html

@matthiasr matthiasr added the bug label Feb 28, 2019
@matthiasr
Copy link
Contributor

Damn!

When this happens, are there scrape errors?
What does the exporter's /metrics endpoint look like when it's working and when not?
Does it log anything?
What does it log with --log-level=debug?

@pnyheim
Copy link
Contributor Author

pnyheim commented Feb 28, 2019

Damn!

Indeed.

When this happens, are there scrape errors?

How can I tell if there are scrape errors? I've only looked in the graphite_exporter log file, but did not see anything out of the ordinary. I'll try and look in the prometheus log files tomorrow.

What does the exporter's /metrics endpoint look like when it's working and when not?

Looks normal AFAIKT. Hard to tell if metrics are missing from /metrics though. Will try to look a bit more into it tomorrow as well

Does it log anything?

Nothing out of the ordinary

What does it log with --log-level=debug?

Did not try with the debug flag. Can do that tomorrow as well.

@matthiasr
Copy link
Contributor

How can I tell if there are scrape errors?

On the Prometheus server, the up metric for this target at this time would go to 0. There is additional information on the Targets page of the Prometheus server – when there is an error it's shown next to the target there.

@raypettersen
Copy link

image

@raypettersen
Copy link

Also noticed this:

2019/02/28 17:31:22 http: multiple response.WriteHeader calls
2019/02/28 17:31:35 http: multiple response.WriteHeader calls

I've never seen these messages before and I don't know what they mean.

@brian-brazil
Copy link
Contributor

To be clear, those are coming from the exporter, not Prometheus?

@raypettersen
Copy link

The log entries are from the exporter (if I understood the question correctly). They came after about 3 hours of runtime.

@brian-brazil
Copy link
Contributor

That's probably a client_golang bug then, @beorn7

https://github.com/prometheus/client_golang/blob/2daed26f633ce23fab5efc02fb3f2e803b3526c3/prometheus/promhttp/http.go#L86

I can see this happening if you have a gzipped response, and then there's a problem encoding a metric family.

@raypettersen
Copy link

If there is enough "evidence", then I suggest removing the release for now.
Let me know if I can do anything else for providing more details.

@brian-brazil
Copy link
Contributor

The question is what's causing the underlying error. Could you add some extra logging around that bit of code?

@pnyheim
Copy link
Contributor Author

pnyheim commented Feb 28, 2019

I was able to poll the /metrics endpoint regularly for a while - and every once in a while it would return this:

An error has occurred while serving metrics:

2 error(s) occurred:
* collected metric "hikari_Wait" { label:<name:"attribute" value:"mean_rate" > label:<name:"datasource" value:"dsname" > label:<name:"foo" value:"bar" > gauge:<value:0 > } was collected before with the same name and label values
* collected metric "hikari_ConnectionCreation" { label:<name:"attribute" value:"p98" > label:<name:"datasource" value:"dsname2" > label:<name:"foo" value:"bar" > gauge:<value:25 > } was collected before with the same name and label values

I've seen different metric names and labels reported, and different number of errors from time to time

Ideally, this should not be written to the /metrics response, but logged as an error instead.

@brian-brazil
Copy link
Contributor

The error there is expected and correct from an exposition standpoint. The question is what's causing it inside the graphite exporter.

@pnyheim
Copy link
Contributor Author

pnyheim commented Feb 28, 2019

Still, some error lines in the logfile when this happens would have been nice.

BTW, I enabling debug level logging, but could not find anything useful in the log.

@matthiasr
Copy link
Contributor

matthiasr commented Feb 28, 2019 via email

@matthiasr
Copy link
Contributor

matthiasr commented Feb 28, 2019 via email

@fffonion
Copy link
Contributor

fffonion commented Mar 4, 2019

Sorry this looks like my carelessness during bumping up client_golang. FYI a similar error happened when running e2e test, hence this fix 3d09334. Indeed this is a behaviour change in client_golang. I feel it could be addressed by not creating new metric every time in

graphite_exporter/main.go

Lines 192 to 201 in 3d09334

for _, sample := range samples {
if ageLimit.After(sample.Timestamp) {
continue
}
ch <- prometheus.MustNewConstMetric(
prometheus.NewDesc(sample.Name, sample.Help, []string{}, sample.Labels),
sample.Type,
sample.Value,
)
}
, instead we can cache the metric instance for same originalName (like in statsd_exporter).

However I do have a question in mind and I'm not very familiar with the graphite protocol: if two metric with the same name and labels arrives between two prometheus scrape, the current behaviour will be exporter only export the later value (since they are Gauge). Is this expected?

@matthiasr
Copy link
Contributor

matthiasr commented Mar 4, 2019 via email

@beorn7
Copy link
Member

beorn7 commented Mar 5, 2019

That's probably a client_golang bug then […] I can see this happening if you have a gzipped response, and then there's a problem encoding a metric family.

Sorry for late follow-up. I think @brian-brazil is right here. If the encoding fails somewhere in the middle, we cannot send an HTTP error anymore. I'll think about a way to detect this and handle better.
See prometheus/client_golang#542

@rtkgjacobs
Copy link

rtkgjacobs commented Mar 8, 2019

Also ran into this, reverted back to 0.4.2 to unblock for now. Observed that the graphite-exporter returns HTTP/500's
image
and the following errors observed by directly curling the exporter as such:

HTTP/500 Response

And payload returned

An error has occurred while serving metrics:

15 error(s) occurred:
* collected metric "foobox_ddns_updater_timer_seconds" { gauge:<value:6604.048014 > } was collected before with the same name and label values
* collected metric "foobox_ddns_updater_timer_seconds" { gauge:<value:6604.048014 > } was collected before with the same name and label values
* collected metric "foobox_ddns_updater_timer_seconds" { gauge:<value:1 > } was collected before with the same name and label values
* collected metric "foobox_ddns_updater_timer_seconds" { gauge:<value:1 > } was collected before with the same name and label values
* collected metric "foobox_ddns_updater_timer_seconds" { gauge:<value:6604.048014 > } was collected before with the same name and label values
* collected metric "foobox_ddns_updater_timer_seconds" { gauge:<value:4.3613450171217345e+07 > } was collected before with the same name and label values
* collected metric "foobox_ddns_updater_timer_seconds" { gauge:<value:0 > } was collected before with the same name and label values
* collected metric "foo_ddns_updater_timer_seconds" { gauge:<value:6604.048014 > } was collected before with the same name and label values
* collected metric "os_network_dispatch_total" { label:<name:"action" value:"dispatch" > label:<name:"host" value:"graphite-i-45993598_ops_infra_prod" > gauge:<value:0 > } was collected before with the same name and label values
* collected metric "foo_ddns_updater_timer_seconds" { gauge:<value:4.3613450171217345e+07 > } was collected before with the same name and label values
* collected metric "os_network_send_total" { label:<name:"action" value:"send" > label:<name:"host" value:"graphite-i-45993598_ops_infra_prod" > gauge:<value:0 > } was collected before with the same name and label values
* collected metric "foo_ddns_updater_timer_seconds" { gauge:<value:6604.048014 > } was collected before with the same name and label values
* collected metric "foo_ddns_updater_timer_seconds" { gauge:<value:6604.048014 > } was collected before with the same name and label values
* collected metric "foo_ddns_updater_timer_seconds" { gauge:<value:6604.048014 > } was collected before with the same name and label values
* collected metric "foo_ddns_updater_timer_seconds" { gauge:<value:0.1 > } was collected before with the same name and label values

Let me know if you need any other logging or other details.

@rtkgjacobs
Copy link

Further update from my own testing / findings to date:

  • Using 0.5.0 results in no data returned once snags are hit, reverting to 0.4.2 allows partial data flow as confirmed above
  • It appears in my case with our graphite data feed, we have values outside the default 5 minute expiration/duplicate window? By raising that on my side to --graphite.sample-expiry=20m I'm able to have all the data pass into Prometheus without any dropping for say 40 minutes. This has also been pinching me on 0.4.2 as we've been ramping more prod data feeds.

I assume because we're avoiding the client exporer error was collected before with the same name and label values from the prometheus.MustNewConstMetric() - meaning a new duplicate Metric is being created and the prometheus client library is angry about it and dropping large sets of our data in 0.4.2, and all data regardless in in 0.5.0.

By extending the window of time I can avoid it with our data's behavior. Certainly if logging debug output from the prometheus library is possible, it would be handy in the future to debug this stuff easier, as I have no real way of knowing what is tripping up the exporter client library.

I think above @fffonion is onto something, when suggesting this above:

Indeed this is a behaviour change in client_golang. I feel it could be addressed by not creating new metric every time in

graphite_exporter/main.go

Lines 192 to 201 in 3d09334

 for _, sample := range samples { 
 	if ageLimit.After(sample.Timestamp) { 
 		continue 
 	} 
 	ch <- prometheus.MustNewConstMetric( 
 		prometheus.NewDesc(sample.Name, sample.Help, []string{}, sample.Labels), 
 		sample.Type, 
 		sample.Value, 
 	) 
 } 

, instead we can cache the metric instance for same originalName (like in statsd_exporter).
However I do have a question in mind and I'm not very familiar with the graphite protocol: if two metric with the same name and labels arrives between two prometheus scrape, the current behaviour will be exporter only export the later value (since they are Gauge). Is this expected?

It seems I can get a working pipeline but only 2x the duration I set with graphite.sample-expiry. Our metrics data feed we're slowly replacing, has several field sets of data that are delivered out of order at times. We're fine with more recent data wining, but it seems the ordering is a pinch for us with the graphite-exporter. Anyone have suggestions? Seems the above suggestion would address this for us in part.

@matthiasr
Copy link
Contributor

I don't think keeping persistent metric instances around is the best way. It fixes the issue incidentally because it forces us to dedup, but we can also dedup and then generate a constmetric.

At least as a direct fix for the issue at hand, what if we add straight-up deduplication to this loop? Basically, remember which (sample.Name,sample.Labels) combinations we have already seen and continue if we encounter one a second time?

That would also allow logging the duplicated samples, without relying on being able to instrument the client library.

As an alternative approach, just above that loop we're copying samples out of the graphiteCollector, what if instead of a slice we copy them into a map, with the (sample.Name,sample.Labels) as the key in some form? That way we would get the deduplication "for free". I'm not sure if either approach would guarantee "last metric wins" though.

@rtkgjacobs
Copy link

As you perform the deduplicate loop, can you look at the age of the entry vs what is already staged for passing into the Prometheus client exporter duplicate name wise? Design wise the goal is to ensure 'newer' entries of more than one duplicate found always end up on top as what is exported, a kind of 'bubbles up' sortation wise during that decision making. After all records are iterated over, only the 'newest' entry wins for passing up then. This feature could be an 'arg' like --graphite.dedupe etc if experimental or optional use case wise (would be nice to also have logging at the DEBUG level show when duplicates were found/displaced.

For us right now, our plan for production today sadly is to set a very long retention time like --graphite.sample-expiry=240m and then 'restarting' the service every say 60 minutes to garbage collect older metrics since we have 1000's of systems reporting legacy graphite data that auto-scale up and down. We can't immediately pivot to poll based metrics scraping and are building out a legacy migration story from graphite to push to prometheus scraping over time. We will be moving more and more into modern platforms and approaches like K8's and Prometheus of course and this project is part of that push.

Hope this helps. Happy to test anything out on my side, let me know thx.

@matthiasr
Copy link
Contributor

I don't understand why a longer expiry helps?

@rtkgjacobs
Copy link

In our case we're receiving data from field devices in part, which have sporadic connectivity. Think of it as a IoT type use case in part. So we're getting a mixture of 'new' and also sometimes older graphite data. It is all mixed together. The majority of our data is 'fresh' - but it appears to be the sporadic elements that pose the problem for us. We really only care about 'latest' data, but this is a pinch that contrasts how say graphite vs prometheus handles this particular situation.

We find once that mixed set of older/newer data hits most of it is dropped by graphite-exporter (presumably the prometheus exporter library angry about duplicates?). I can't get any logging from the scrape exporter layer to confirm or understand my pinch surface area, which seems to be a limitation of how the prometheus export client is being used presently.

From looking at source, and increasing the expiration time via graphite.sample-expiry, we are able to receive all our data and have it scraped by Prometheus, at the cost of not rolling off old / not seen elements after x time, hence the 'worse case' garbage collection via a service restart. Yes, ugly I know :)

From a high level strategy, we're at a point now where Graphite isn't capable of scaling, and we're migrating to Prometheus. This is a two step approach, first replacing Graphite's role as is by using graphite-exporter to get that data in Prometheus, and then the slower refactoring and implementation of data collection fixtures (push vs poll etc) with the legacy graphite approach perhaps entirely being displaced in time.

Is there anything else I can gather or do to help out?

@rtkgjacobs
Copy link

rtkgjacobs commented Mar 12, 2019

So just to close out, we did much more investigation on our end and found that several origin data points graphite data was being 'duplicated' for prior teams convenience in the past - those two feeds were being sent into our single graphite-exporter - and when faced with those duplicate points at times (or delayed x vs y arrival times), graphite-exporter was 'doing its job'. The issue above still stands, that any duplicates HTTP/503 version 0.5.0 - so we'll use 0.4.2 until fixed. Of course not having logging output from its prometheus client exporter made finding the root cause hard, so a +1 on if that can ever be exposed logging wise.

I think the concept or idea of a --graphite.dedupe-old makes sense as a safety flag or legacy use case where someone wants to firehose some legacy data in and not spend the immense effort to ensure its 100% clean, but we've found our biggest pinch getting this to prod. It's also forcing a better design pattern where we scrape from major sources vs trying to use a single point ingress (which is not very prometheus-proper i'll agree) and likely run a single graphite-exporter far too hot.

Thanks again for your consideration and efforts to date!

@matthiasr
Copy link
Contributor

matthiasr commented Mar 12, 2019 via email

@matthiasr matthiasr changed the title Metrics dropping out when upgrading to 0.5.0 Handle duplicated metrics gracefully Apr 8, 2019
@mkoppanen
Copy link

I am seeing this issue with Apache Storm and the built-in graphite reporter (https://storm.apache.org/releases/current/javadocs/org/apache/storm/metrics2/reporters/GraphiteStormReporter.html). With quick debugging it seems that parallelism in Apache Storm causes duplicate metrics to be sent to Graphite exporter (each have the same key):

* collected metric "storm_xyz_response_time_per_product_stddev" { label:<name:"component_name" value:"xyz_updater" > label:<name:"job" value:"storm-supervisor-001" > label:<name:"topology_name" value:"product-ingestion" > gauge:<value:0 > } was collected before with the same name and label values

I would be happy by either being able to ignore duplicates or group by with operator. At the moment the behaviour is quite strict and causes graphite-exporter to go into failure state when duplicates are found. (tested with docker tag master and latest)

@matthiasr
Copy link
Contributor

if anyone wants to pick this up, please do.

as for implementation, one way out (that I would prefer long-term) could be to extract the relevant parts of the statsd exporter as we already do for the mapper itself.

Logically, the graphite exporter does pretty much the same as the statsd exporter, except it only accepts gauges and the line format is slightly different. they have diverged a long time ago, and in the end, we keep fixing the same issues independently in both. Strategically, I would like to factor out all the reusable parts (listener, mapper[done], collector, …) into separate packages within the statsd exporter repo, and then only plug them together with a different line parser here.

@matthiasr
Copy link
Contributor

There are more thoughts on what I have in mind for the split here prometheus/statsd_exporter#234 (comment)

@matthiasr
Copy link
Contributor

x-ref: #133 made the Collector unchecked, so it should fail in fewer cases now. #134 for re-using more statsd exporter components so we don't have to solve all these issues over and over again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants