Replies: 5 comments 34 replies
-
Thank you for opening this!
In general, we spawn a single task per connection and anything you do in your I know that @mxinden is connected to > 10000 nodes on his exported node, 1500 being polkadot nodes and it seems to be doing fine: https://kademlia-exporter.max-inden.de/d/Pfr0Fj6Mk/rust-libp2p?orgId=1&refresh=30s&var-data_source=Prometheus&var-instance=kademlia-exporter-polkadot:8080 I am not PromQL expert but the following is supposed to be the CPU utilization: https://kademlia-exporter.max-inden.de/explore?orgId=1&left=%7B%22datasource%22:%22Prometheus%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22100%20-%20(avg%20by(instance)%20(irate(node_cpu_seconds_total%7Bmode%3D%5C%22idle%5C%22%7D%5B5m%5D))%20*%20100)%22%7D%5D,%22range%22:%7B%22from%22:%22now-1h%22,%22to%22:%22now%22%7D%7D Overall, it really depends on what you are doing in your |
Beta Was this translation helpful? Give feedback.
-
Gathered some more flamegraphs, this time from a full validator node. I attached the files as .txt files, since github captures the SVG file as a regular image otherwise and doesn't allow you to navigate through it. Remove the txt extension once downloaded before opening it. Here are two flamegraphs:
Some interesting things I noticed:
Keen to see what everybody else is noticing on these graphs. |
Beta Was this translation helpful? Give feedback.
-
Another avenue which I think is worth exploring in the future is socket buffering. At the moment, according to the flamegraphs, there isn't a lot of CPU time spent in syscalls, which I believe is due to the other high CPU consumers. Once we optimise those, I think we'll see a much higher percentage for socket syscalls. In our current setup (yamux + noise + TCP), theoretically, there is some buffering involved. However, it turns out that it's not effective, resulting in quite a large number of socket system calls with small messages.
I don't know whether this is a conscious decision, but I think it does make sense to not buffer messages in order to provide low-latency for communicating with nodes to which we don't transmit a lot of data.
We can optimise this, in order to optimistically try to read a larger number of bytes from the underlying socket. This would help in cases where there is a lot of data being frequently received from a node. Another thing I noticed is that we always perform at least 2 syscalls, on both RX and TX for a single packet. One for the length and another one for the packet. This can also be easily optimised to halve the socket syscalls. I wouldn't start this work now, until we minimise the CPU usage coming from other places and make sure that syscalls are a real problem. |
Beta Was this translation helpful? Give feedback.
-
Meta: @alindima we do a biweekly open maintainers call. See #4276 for the previous one. It would be great to have you in the next one (15th of August). |
Beta Was this translation helpful? Give feedback.
-
We're trying to scale Polkadot to support 400 validators and 80 parachains and are running into trouble with libp2p performance. Out of all of the tasks in Polkadot, libp2p uses a lot of CPU time and this is a blocker for scaling Polkadot to these higher validator/parachain numbers.
Switching between the task executor Substrate uses/using whatever implementation libp2p provides doesn't seem to yield a meaningful difference.
Here's an image of polling durations, orange/pinkish is block import and yellow is
libp2p-node
which is defined in builder and used to create theSwarm
object in network service:The execution time of each task is measured with
Histogram::start_timer()
:spawn_inner()
Do you have any tips how we might go about debugging this?
Beta Was this translation helpful? Give feedback.
All reactions