-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Interface directly with polkadot node to collect gossip messages #91
Comments
Instinctively I like the "traffic mirroring" approach the best; to it sounds like the only option that allows full access to all the data and while it has a higher up-front cost to implement, we know by now how the jaeger/tracing/loki approach requires a lot of maintenance and is pretty brittle. Can you elaborate a bit on what the security issues are here (beyond "moar code, moar bugs")? The gossip traffic is not secret, so what would the risks be if there was a bug of some sort and a malicious party got access to the gossip stream? |
Traffic mirroring makes sense to me. Maybe some kind of more compact repr of the messages that are sent, if we're concerned about resource usage. This is something we could maybe even encourage community validators to run alongside their nodes and build an opt-in telemetry platform around it. |
I'm still thinking about mirroring this information into a sort of queue service (kafka or Redis streams) where we can thereafter use this information by any interested consumer. |
@vstakhov This is not excluding this possibility. We could export this information to the outside world so folks can implement exactly that. |
@dvdplm Yeah, you are right, the gossip traffic is no secret, so no encryption likely required, maybe compression makes sense. What I am trying to point at is that the security posture of this feature depends on what knobs we expose to the outside world, one of those being the ability to setup a filter to not receive all messages. It's definitely in the realm of more code and more bugs but with the impact being that someone can take control of the node or crash it, resulting in slashing or it could allow an attacker to control many nodes which are vulnerable at once allowing for more sofisticated attacks without having any DOT at stake. |
@sandreim I think that we could use some more details on the implementation plan for the traffic mirroring option to pursue the conversation. What would the interfaces look like? How are filters defined/configured and where do they plug in? And the collected traffic, what format are we thinking? My point is: this sounds great but to really have a productive conversation I think we need to drill down and think it through. :) |
Of course, I would need to prototype a bit in order to produce such an implementation plan. I created the issue to discuss the high level idea, not the actual implementation details 😄 |
I think this is at least worthy to put in the public roadmap, with the actual details following up later. |
I think hooking into the network bridge is easily too high of an overhead. There are lots of messages, networking is usually a major contributor to load. If we hook into nodes already, I would suggest to think first about what we are actually interested in and then hook into the subsystem providing that information in the least noisiest form, instead of hooking into the noisiest of them all. Often, any kind of monitoring usually broke at some point because of load, hence if we implement something from scratch, I would aim to first figuring out concrete questions we have/might have and then based on them integrate observability in least noisiest way still answering the questions. We do have events on chain - we could also have node side events. Each subsystem reporting on particular events: E.g. for each candidate:
That being said, monitoring and crunching all the network traffic might still be useful and give additional insights, but to me that would make the most sense on a standalone node that just participates in gossip (once we opened it to non validator nodes). |
Yes, I am expecting some additional overhead from filtering which messages need to be forwarded. Hooking into a specific subsystem might be a better option for having this enabled by validators in prod with opt-in as @rphmeier suggested. The way I currently see things is that this still has the advantage of being more simple to implement and more versatile:
For starters we could only target to only forward what the node sends out (not what it receives) - specific messages of
I am not aiming to be crunching all the traffic, but be dynamic and selective about what gets streamed over from the node - which IMO makes most sense for debugging. If we are thinking about network observability I believe we should do it against the real world messages rather than something which is related to the inner workings of the subsystems. How feasible is to have non-validators participate in gossip ? This makes a lot of sense to me. |
I've done some experiments:
While they look very promising I tend to agree with you @eskimor - we shouldn't be mirroring the network bridge traffic, but rather have some separate node side events. Aside from the noise issues you pointed out (it's rather hard and cpu intensive to filter/compact the info) we also need to provide more context which is not available in the messages themselves, but in the node subsystems. I'll move towards a node event approach with a dedicated subsystem to handle and expose them to the outside world. |
Been playing around with node events: https://github.com/paritytech/polkadot/compare/sandreim/node_events_try1 The current plan is to create 1 new subsystem to gather the events from all other subsytems and 1 crate which defines the events to allow other applications to work with them. The Some open questions:
|
Current status
Right now
Introspector
works with on-chain data via RPC calls to nodes, so it's ability to trace parachain blocks and detect issues is very limited to what block authors put in blocks which is just the tip of the iceberg when it comes to what actually goes on in the network.Vision
We must be able to collect and analyse gossip data to provide a better view of the network activity. This should allow us to answer questions like "Are parts of the network under attack?" and provide more granular root causing of issues across the parachain block pipeline.
For this to work we must largely focus (but not limited to) gossip messages exchanged during parachain consensus spanning from collation generation to backing, approval voting and dispute resolution.
Technical challenge
The main one is that the gossip messages we are interested in are only sent between the validators in the active set. Even with an embedded light client, we could still not receive them as the active set currently works as a "cartel".
To overcome this we have the following options (unless you know better ones):
network bridge traffic mirroring
in polkadot/cumulus such that third party applications likepolkadot-introspector
can connect to a node via a dedicated connections in order to receive all or a subset of the gossip messages that are sent or received.While the first option was tackled to some extent by Jaeger spans, Grafana/Loki it did not provide a pretty structured and concise view of what goes on and is very expensive.
Proposed solution
We should implement the
network bridge traffic mirroring
feature in Polkadot with built-in security, filtering and ease of use in mind.Polkadot-introspector
will connect to a specific port and subscribe to different network protocol message streams and receive the messages exchanged through thenetwork-bridge-subsystem
in real time. While this sounds dangerous in practice and might be a resource hog, we will implement this as debug mechanism first and later open it up for burn-in nodes on Kusama.Security considerations
The risks of having an additional attack surface which remote actors can access is something that we need to mitigate by built-in security features like:
localhost
As usual, discussion is open on all of the above points.
The text was updated successfully, but these errors were encountered: