Expose metrics about the connected node network on the `/metrics` endpoint #381

jsdw · 2021-08-25T15:58:39Z

Currently, the /metrics endpoint exposes prometheus-compatible metrics mainly focused on evaluating the runtime performance of the telemetry deployment.

@chevdor mentioned that it would also be a great place to expose network related metrics, such as which versions of substrate are currently in use (and by how many nodes). I could certainly see this being quite handy to track!

Other thoughts on specific network related metrics that would be valuable to expose are welcome!

The text was updated successfully, but these errors were encountered:

lovelaced · 2021-08-30T12:27:12Z

@jsdw this could potentially have very high/arbitrary cardinality and, if the network name was a label, it could be used to DDoS any prometheus server a publicly exposed instance is connected to. Builds by people who are compiling their own commits and/or building their own networks will add lots of label complexity unfortunately. This would indeed be very interesting information but it would best be exposed as some sort of JSON so it could be scraped into a database or something rather than use prometheus for this. @gabreal any thoughts?

dvdplm · 2021-08-30T12:36:33Z

Maybe we can control the cardinality in the code to, say 500 chains max. @lovelaced Is that an acceptable number?

lovelaced · 2021-08-30T12:39:18Z

From https://www.robustperception.io/cardinality-is-key (a highly reliable source on prometheus best practices):

As a general rule of thumb I'd avoid having any metric whose cardinality on a /metrics could potentially go over 10 due to growth in label values. The way I think of it is that if it has already grown to be 10 today, it might be 15 in a year's time. It can be additionally okay to have no more than a literal handful within a given Prometheus that go to 100.

my worry is that the multiplier will be huge - each chain having who knows how many versions of their own binaries would be a cardinality explosion.

dvdplm · 2021-08-30T12:49:54Z

Ok, that's good info. FWIW we distinguish chains based on genesis hash, so a mere label change is not enough for a DDoS, they'd have to spin up actual new chains.

Perhaps we could have an allowlist then, that includes the chains we (Parity) cares about the most?

lovelaced · 2021-08-30T12:56:40Z

Yeah, that sounds better to me. Keep in mind cardinality is also multiplicative (see previous link), so every unique k=v combination is an increase of 1 - so it would still be DDoSable if someone spun up a ton of nodes with custom binary identifiers on Polkadot, for instance.

dvdplm · 2021-08-30T14:09:48Z

dq: Does Prometheus have any measures in place to stop ingesting data when it's too much, dunno, put a server on "chill" and simply stop collecting metrics from it when it's too much (or decrease collection frequency)?

lovelaced · 2021-08-30T14:27:46Z

If there is, I'm not aware of it- and that's a pretty hacky/bad way to deal with it anyway. We should be actively keeping tabs on the amount of series our prometheus server ingests (and we do). This number should be constant accounting for the number of servers (as in, there should be no labels which have arbitrary growth, really).

gabreal · 2021-08-30T15:41:13Z

for the metrics it is important to consider the dimensions that endpoint will provide. while exporting the version as a label in polkadot (label is only on one metric polkadot_build_info) is alright for a limited amount of servers we control, software like the matrix server expose unbounded metrics which can easily go beyond 2Mb. it would be alright imo to limit the label set by the software like with the above mentioned allowlist of networks and the build info reduced to tagged releases. it is interesting information. w3f made an exporter for substrate telemetry data.

dvdplm · 2021-08-31T08:43:47Z

w3f made an exporter for substrate telemetry data.

Interesting. It's essentially a rewrite of our substrate-analytics.

jsdw added the enhancement New feature or request label Aug 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose metrics about the connected node network on the `/metrics` endpoint #381

Expose metrics about the connected node network on the `/metrics` endpoint #381

jsdw commented Aug 25, 2021

lovelaced commented Aug 30, 2021

dvdplm commented Aug 30, 2021

lovelaced commented Aug 30, 2021

dvdplm commented Aug 30, 2021

lovelaced commented Aug 30, 2021

dvdplm commented Aug 30, 2021

lovelaced commented Aug 30, 2021

gabreal commented Aug 30, 2021

dvdplm commented Aug 31, 2021

Expose metrics about the connected node network on the /metrics endpoint #381

Expose metrics about the connected node network on the /metrics endpoint #381

Comments

jsdw commented Aug 25, 2021

lovelaced commented Aug 30, 2021

dvdplm commented Aug 30, 2021

lovelaced commented Aug 30, 2021

dvdplm commented Aug 30, 2021

lovelaced commented Aug 30, 2021

dvdplm commented Aug 30, 2021

lovelaced commented Aug 30, 2021

gabreal commented Aug 30, 2021

dvdplm commented Aug 31, 2021

Expose metrics about the connected node network on the `/metrics` endpoint #381

Expose metrics about the connected node network on the `/metrics` endpoint #381