Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose metrics about the connected node network on the /metrics endpoint #381

Open
jsdw opened this issue Aug 25, 2021 · 9 comments
Open
Labels
enhancement New feature or request

Comments

@jsdw
Copy link
Collaborator

jsdw commented Aug 25, 2021

Currently, the /metrics endpoint exposes prometheus-compatible metrics mainly focused on evaluating the runtime performance of the telemetry deployment.

@chevdor mentioned that it would also be a great place to expose network related metrics, such as which versions of substrate are currently in use (and by how many nodes). I could certainly see this being quite handy to track!

Other thoughts on specific network related metrics that would be valuable to expose are welcome!

@jsdw jsdw added the enhancement New feature or request label Aug 25, 2021
@lovelaced
Copy link
Collaborator

@jsdw this could potentially have very high/arbitrary cardinality and, if the network name was a label, it could be used to DDoS any prometheus server a publicly exposed instance is connected to. Builds by people who are compiling their own commits and/or building their own networks will add lots of label complexity unfortunately. This would indeed be very interesting information but it would best be exposed as some sort of JSON so it could be scraped into a database or something rather than use prometheus for this. @gabreal any thoughts?

@dvdplm
Copy link
Contributor

dvdplm commented Aug 30, 2021

Maybe we can control the cardinality in the code to, say 500 chains max. @lovelaced Is that an acceptable number?

@lovelaced
Copy link
Collaborator

From https://www.robustperception.io/cardinality-is-key (a highly reliable source on prometheus best practices):

As a general rule of thumb I'd avoid having any metric whose cardinality on a /metrics could potentially go over 10 due to growth in label values. The way I think of it is that if it has already grown to be 10 today, it might be 15 in a year's time. It can be additionally okay to have no more than a literal handful within a given Prometheus that go to 100.

my worry is that the multiplier will be huge - each chain having who knows how many versions of their own binaries would be a cardinality explosion.

@dvdplm
Copy link
Contributor

dvdplm commented Aug 30, 2021

Ok, that's good info. FWIW we distinguish chains based on genesis hash, so a mere label change is not enough for a DDoS, they'd have to spin up actual new chains.

Perhaps we could have an allowlist then, that includes the chains we (Parity) cares about the most?

@lovelaced
Copy link
Collaborator

Yeah, that sounds better to me. Keep in mind cardinality is also multiplicative (see previous link), so every unique k=v combination is an increase of 1 - so it would still be DDoSable if someone spun up a ton of nodes with custom binary identifiers on Polkadot, for instance.

@dvdplm
Copy link
Contributor

dvdplm commented Aug 30, 2021

dq: Does Prometheus have any measures in place to stop ingesting data when it's too much, dunno, put a server on "chill" and simply stop collecting metrics from it when it's too much (or decrease collection frequency)?

@lovelaced
Copy link
Collaborator

If there is, I'm not aware of it- and that's a pretty hacky/bad way to deal with it anyway. We should be actively keeping tabs on the amount of series our prometheus server ingests (and we do). This number should be constant accounting for the number of servers (as in, there should be no labels which have arbitrary growth, really).

@gabreal
Copy link
Contributor

gabreal commented Aug 30, 2021

for the metrics it is important to consider the dimensions that endpoint will provide. while exporting the version as a label in polkadot (label is only on one metric polkadot_build_info) is alright for a limited amount of servers we control, software like the matrix server expose unbounded metrics which can easily go beyond 2Mb. it would be alright imo to limit the label set by the software like with the above mentioned allowlist of networks and the build info reduced to tagged releases. it is interesting information. w3f made an exporter for substrate telemetry data.

@dvdplm
Copy link
Contributor

dvdplm commented Aug 31, 2021

w3f made an exporter for substrate telemetry data.

Interesting. It's essentially a rewrite of our substrate-analytics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants