-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose metrics about the connected node network on the /metrics
endpoint
#381
Comments
@jsdw this could potentially have very high/arbitrary cardinality and, if the network name was a label, it could be used to DDoS any prometheus server a publicly exposed instance is connected to. Builds by people who are compiling their own commits and/or building their own networks will add lots of label complexity unfortunately. This would indeed be very interesting information but it would best be exposed as some sort of JSON so it could be scraped into a database or something rather than use prometheus for this. @gabreal any thoughts? |
Maybe we can control the cardinality in the code to, say 500 chains max. @lovelaced Is that an acceptable number? |
From https://www.robustperception.io/cardinality-is-key (a highly reliable source on prometheus best practices):
my worry is that the multiplier will be huge - each chain having who knows how many versions of their own binaries would be a cardinality explosion. |
Ok, that's good info. FWIW we distinguish chains based on genesis hash, so a mere label change is not enough for a DDoS, they'd have to spin up actual new chains. Perhaps we could have an allowlist then, that includes the chains we (Parity) cares about the most? |
Yeah, that sounds better to me. Keep in mind cardinality is also multiplicative (see previous link), so every unique k=v combination is an increase of 1 - so it would still be DDoSable if someone spun up a ton of nodes with custom binary identifiers on Polkadot, for instance. |
dq: Does Prometheus have any measures in place to stop ingesting data when it's too much, dunno, put a server on "chill" and simply stop collecting metrics from it when it's too much (or decrease collection frequency)? |
If there is, I'm not aware of it- and that's a pretty hacky/bad way to deal with it anyway. We should be actively keeping tabs on the amount of series our prometheus server ingests (and we do). This number should be constant accounting for the number of servers (as in, there should be no labels which have arbitrary growth, really). |
for the metrics it is important to consider the dimensions that endpoint will provide. while exporting the version as a label in polkadot (label is only on one metric |
Interesting. It's essentially a rewrite of our substrate-analytics. |
Currently, the
/metrics
endpoint exposes prometheus-compatible metrics mainly focused on evaluating the runtime performance of the telemetry deployment.@chevdor mentioned that it would also be a great place to expose network related metrics, such as which versions of substrate are currently in use (and by how many nodes). I could certainly see this being quite handy to track!
Other thoughts on specific network related metrics that would be valuable to expose are welcome!
The text was updated successfully, but these errors were encountered: