High CPU usage with larger peer counts #3840

altonen · 2023-04-27T12:07:04Z

altonen
Apr 27, 2023

We're trying to scale Polkadot to support 400 validators and 80 parachains and are running into trouble with libp2p performance. Out of all of the tasks in Polkadot, libp2p uses a lot of CPU time and this is a blocker for scaling Polkadot to these higher validator/parachain numbers.

Switching between the task executor Substrate uses/using whatever implementation libp2p provides doesn't seem to yield a meaningful difference.

Here's an image of polling durations, orange/pinkish is block import and yellow is libp2p-node which is defined in builder and used to create the Swarm object in network service:

The execution time of each task is measured with Histogram::start_timer(): spawn_inner()

Do you have any tips how we might go about debugging this?

altonen · 2023-04-27T14:30:45Z

altonen
Apr 27, 2023
Author

Here's a few more images, in case they're helpful:

0 replies

thomaseizinger · 2023-04-27T16:05:32Z

thomaseizinger
Apr 27, 2023
Collaborator

Thank you for opening this!

How many connections does each node have?
How much traffic is running on these nodes?
Have you considered adding libp2p::metrics to the application? (metrics feature-flag)

In general, we spawn a single task per connection and anything you do in your ConnectionHandler is running on that task.

I know that @mxinden is connected to > 10000 nodes on his exported node, 1500 being polkadot nodes and it seems to be doing fine: https://kademlia-exporter.max-inden.de/d/Pfr0Fj6Mk/rust-libp2p?orgId=1&refresh=30s&var-data_source=Prometheus&var-instance=kademlia-exporter-polkadot:8080

I am not PromQL expert but the following is supposed to be the CPU utilization: https://kademlia-exporter.max-inden.de/explore?orgId=1&left=%7B%22datasource%22:%22Prometheus%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22100%20-%20(avg%20by(instance)%20(irate(node_cpu_seconds_total%7Bmode%3D%5C%22idle%5C%22%7D%5B5m%5D))%20*%20100)%22%7D%5D,%22range%22:%7B%22from%22:%22now-1h%22,%22to%22:%22now%22%7D%7D

Overall, it really depends on what you are doing in your ConnectionHandlers on how much CPU you will consume. It is definitely possible that we have a bug somewhere too but in general, we don't do much on a connection unless the user tells us to.

15 replies

alindima Jul 12, 2023

AFAICT, the flamegraph from above doesn't contain any symbols from within the polkadot code (only from other libraries) and was collected only after a 20seconds run on a fresh validator node, so it's not very representative.

The parking_lot calls are not coming from yamux, since yamux calls are actually present in the flamegraph in other stacks and sum up to about 1%. Also, the parking_lot mutex implementation does not rely on parking_lot::CondVar and it's not calling parking_lot::condvar::Condvar::notify_one_slow or parking_lot::condvar::Condvar::wait_until_internal (which are the function calls reported in the graph). It relies on pthread condvar primitives on linux instead.

I gathered a flamegraph after attaching on an existing validator for a longer time (~12 mins), and the condvar calls only sum up to about 10%.

In conclusion, the long condvar CPU usage must be coming from somewhere else, but we can't tell unless we gather a proper flamegraph, with all the symbols.

I'll dig a bit to find why this is happening, once I get my hands on a linux machine, and hopefully we'll be able to identify the culprits

alindima Jul 19, 2023

I root-caused why the flamegraph above does not contain all the polkadot symbols. flamegraph-rs by default uses perf record with DWARF stack unwinding, which caps the stack size at 8192 bytes (which for polkadot's and tokio's large stacks is not enough).

Using frame pointer unwinding yielded much better results. I compiled polkadot with RUSTFLAGS='-C force-frame-pointers=y' and used perf record --call-graph fp. However, if building in --release mode, we don't see the stack frames that correspond to standard library calls (since the stdlib is likely compiled without using frame pointers).

Here's a flamegraph I gathered from a VM running a polkadot versi node (not a full validator node), using the debug binary.
Take it with a grain of salt, since it was gathered over a short period and not from a validator node, so it may not be very representative.

One interesting point I discovered is that we spend a lot of time doing vec resizes (see #4221)

I am now working on gathering a detailed flamegraph like this one from a full versi validator node. I'll post it here once I have it.

thomaseizinger Jul 19, 2023
Collaborator

Awesome! Thanks a lot for this work, it is really valuable information to us :)

mxinden Jul 23, 2023
Collaborator

Seconding Thomas here. Thank you @altonen.

What is also surprising to me is the first larger peak on libp2p_kad::query::peers::closest::ClosestPeersIter::next (2,991 samples, 7.13%).

In case you have more flamegraphs handy, seeing those would be helpful.

alindima Jul 26, 2023

managed to get polkadot to run under pyroscope with frame-pointer profiling (note that the flamegraph does not contain standard library calls, as the standard library is not compiled with force-frame-pointers=y).
here's a flamegraph: https://flamegraph.com/share/b44f8bcc-2bb7-11ee-b7df-723ba4706d7b

alindima · 2023-07-28T12:17:21Z

alindima
Jul 28, 2023

Gathered some more flamegraphs, this time from a full validator node.

I attached the files as .txt files, since github captures the SVG file as a regular image otherwise and doesn't allow you to navigate through it. Remove the txt extension once downloaded before opening it.

Here are two flamegraphs:

flamegraph-dwarf-7-mins.svg.txt. Gathered on a 7 min run using dwarf unwinding (very time consuming to process, but contains symbols from the stdlib)
flamegraph-fp-12-mins.svg.txt Gathered on a 12 min run using frame pointer unwinding (faster to process, but does not contain many symbols coming from the stdlib).

Some interesting things I noticed:

There is a 10-12% overhead coming from libp2p_swarm::connection::gather_supported_protocols. This is a recently added method (PR feat: report changes in supported protocols to ConnectionHandler #3651), which is called on every Connection<THandler>::poll, in the worst case scenario.
The bulk of the time is spent determining the protocol changes, performing two differences between two HashSets. While I assume the number of protocols is quite low, I believe the HashSet::difference implementation has a time complexity linear in the capacity, not the number of elements.
I assume that protocol changes are a thing that happen quite rarely on a connection, so can we add some message broadcasting as a notification of such changes, instead of contantly performing differences (which are time consuming)? Alternatively, we could at least use a data structure for which iterating through the elements is linear in the number of elements (like a linked_hash_set or a BTreeSet), but this would be less optimal. Or we could even have a checksum of the items in the set and compare this before computing the differences, since protocols will rarely change.
Another 5-8% is caused by substrate recreating the local copies of the listen_addresses and external_addresses on every call to NetworkWorker::next_action().
See: https://github.com/paritytech/substrate/blob/868c41635612eea75195bf706fc1580ae8ced0c8/client/network/src/service.rs#L1189-L1195
This clones the addresses from a HashSet and places them in a new vector every time, even if they were not modified (modifications should be rare). From a quick look over the code, we can fix this by only updating the local copy of the external_addresses when we get a SwarmEvent::Behaviour(BehaviourOut::PeerIdentify {..}) event or SwarmEvent::ExternalAddrConfirmed or the other events concerning external addresses. Similarly, we can update the local copy of the listen_addresses when we get a SwarmEvent::NewListenAddr or SwarmEvent::ExpiredListenAddr or the other listener events. WDYT? I'll prepare a PR for this

Keen to see what everybody else is noticing on these graphs.

12 replies

thomaseizinger Aug 1, 2023
Collaborator

Thanks for the detailed explanation, @thomaseizinger. While I don't know a whole lot about the inner workings of libp2p, I don't see how the refactorings you mentioned could help with optimizing the code. Even if we would be returning lazy iterators, they would still have to be collected on every call to poll, in order to perform the diff.

The difference would be that I think it would only be a single allocation instead of many many small ones, which is what currently happens. I doubt that we would see a single allocation.

I think that no matter how much we try to optimise this code, it'll still be noticeably inefficient unless we resort to message passing for notifications in the protocol changes. How feasible would this be?

I am reluctant to do this because it requires protocol authors to explicitly send a message to the Connection when they change what they are going to return from ConnectionHandler::listen_protocol. ConnectionHandler::listen_protocols is the source of truth for which protocols are supported.

The risk of not computing this on ever poll invocation is that we "miss" an update to the listen protocols. I think we can deal with this risk by adding a timer that fires several seconds after the task has gone to sleep unless we just updated the list. In other words:

Every time we gather the list of supported protocols, we set a flag and record the timestamp.
Before updating the list, we check the timestamp. If we updated it within the last 5 seconds, don't update it.
Before we go to sleep (ie return Poll::Pending), register a timer to wake us up in 5 seconds IFF the flag is set to "dirty".

Unless I am missing something, this should drastically reduce how often we compute this list and still ensures that we will eventually have a consistent state, yet we won't continuously update it on a complete idle connection.

What I'm saying is that I believe both approaches bring eventual consistency.

If the user has a bug in their protocol and forgets to send the message, the state will never be consistent.

alindima Aug 2, 2023

I am reluctant to do this because it requires protocol authors to explicitly send a message to the Connection when they change what they are going to return from ConnectionHandler::listen_protocol. ConnectionHandler::listen_protocols is the source of truth for which protocols are supported.

I get this. Still, I've seen this as a pattern in other places of the codebase as well. For example with confirmed_external_addr. The swarm keeps a local copy that is updated when it receives a ToSwarm::ExternalAddrConfirmed event from a behaviour. Other behaviours rely on the swarm emitting an event whenever it changes what it's going to return from external_addresses(). This sounds similar.
I don't know the code well enough to say that we need to take this approach, I'm just saying that all solutions will be inferior to this from a performance perspective. But maybe we can find a good-enough alternative that doesn't consume as much CPU.

Every time we gather the list of supported protocols, we set a flag and record the timestamp.
Before updating the list, we check the timestamp. If we updated it within the last 5 seconds, don't update it.
Before we go to sleep (ie return Poll::Pending), register a timer to wake us up in 5 seconds IFF the flag is set to "dirty".
Unless I am missing something, this should drastically reduce how often we compute this list and still ensures that we will eventually have a consistent state, yet we won't continuously update it on a complete idle connection.

I think this would be a reasonable compromise 👍🏻
Do you plan on creating a patch for this?

thomaseizinger Aug 2, 2023
Collaborator

I am reluctant to do this because it requires protocol authors to explicitly send a message to the Connection when they change what they are going to return from ConnectionHandler::listen_protocol. ConnectionHandler::listen_protocols is the source of truth for which protocols are supported.

I get this. Still, I've seen this as a pattern in other places of the codebase as well. For example with confirmed_external_addr. The swarm keeps a local copy that is updated when it receives a ToSwarm::ExternalAddrConfirmed event from a behaviour. Other behaviours rely on the swarm emitting an event whenever it changes what it's going to return from external_addresses(). This sounds similar.

I don't think this comparison is valid. For external addresses, the ones reported via ToSwarm::ExternalAddrConfirmed is the source of truth. But for our supported protocols, ConnectionHandler::listen_protocols is the source of truth. It will be called for every new inbound stream.

If we add an additional event, the information that all ConnectionHandlers have about the supported protocols and what is actually supported may diverge which will yield to funny and hard to find bugs.

Every time we gather the list of supported protocols, we set a flag and record the timestamp.
Before updating the list, we check the timestamp. If we updated it within the last 5 seconds, don't update it.
Before we go to sleep (ie return Poll::Pending), register a timer to wake us up in 5 seconds IFF the flag is set to "dirty".
Unless I am missing something, this should drastically reduce how often we compute this list and still ensures that we will eventually have a consistent state, yet we won't continuously update it on a complete idle connection.

I think this would be a reasonable compromise 👍🏻 Do you plan on creating a patch for this?

I'll create an issue for it to capture it properly. I am not sure when I'll be able to spend time on it. Help would be much appreciated!

thomaseizinger Aug 2, 2023
Collaborator

Issue is here: #4284

alindima Aug 3, 2023

Thanks for opening the issue. I'll start working on it likely after I return from vacation.

Anyway, the PR that introduced the performance problem (with gather_supported_protocols) has only been recently introduced in substrate/polkadot. We've seen this large CPU consumption coming from libp2p even before that PR was adopted (this discussion has been opened beforehand).

The big question of where this CPU overhead is coming from still remains. I'm curious if there are any further ideas.

alindima · 2023-08-02T07:32:44Z

alindima
Aug 2, 2023

Another avenue which I think is worth exploring in the future is socket buffering. At the moment, according to the flamegraphs, there isn't a lot of CPU time spent in syscalls, which I believe is due to the other high CPU consumers. Once we optimise those, I think we'll see a much higher percentage for socket syscalls.

In our current setup (yamux + noise + TCP), theoretically, there is some buffering involved. However, it turns out that it's not effective, resulting in quite a large number of socket system calls with small messages.

TX side
The noise transport is buffering unencrypted messages up to MAX_FRAME_LEN (64511). However, when coupled with yamux, there is actually no buffering, because Muxer::poll_inner ends up polling the active yamux connection, which in turn always calls poll_flush. The poll_flush from yamux then ends up calling poll_flush in the noise transport, which immediately writes all bytes to the socket. So we never get to leverage this buffer.

I don't know whether this is a conscious decision, but I think it does make sense to not buffer messages in order to provide low-latency for communicating with nodes to which we don't transmit a lot of data.

RX side
The noise transport first reads 2 bytes from the socket, which represent the upcoming packet length. It then attempts to read this exact number of bytes.

We can optimise this, in order to optimistically try to read a larger number of bytes from the underlying socket. This would help in cases where there is a lot of data being frequently received from a node.

Another thing I noticed is that we always perform at least 2 syscalls, on both RX and TX for a single packet. One for the length and another one for the packet. This can also be easily optimised to halve the socket syscalls.

I wouldn't start this work now, until we minimise the CPU usage coming from other places and make sure that syscalls are a real problem.
It's just an idea at this point.

6 replies

alindima Aug 3, 2023

Looking at the performance dashboard you linked, I can't help but notice the huge difference in throughput between rust-libp2p-tcp and go-libp2p-tcp (median 19M vs 255M).
According to the FAQ, it should be due to the fact that: it configures a small Yamux receive window by default.
AFAICT go-yamux configures the same default window size as rust-yamux, right?

mxinden Aug 3, 2023
Collaborator

@alindima let me know if libp2p/rust-yamux#162 resolves all of your questions.

In general, I would very much like to see Substrate move to QUIC https://github.com/paritytech/substrate/issues/9162. It is the better TCP in every way. #3454 is merged and unless there are any major bugs we will cut a stable libp2p-quic release soon.

See also rust-libp2p/master throughput numbers on https://observablehq.com/@libp2p-workspace/performance-dashboard.

The rust-libp2p maintainer team is small and thus we will prioritize optimizations on QUIC over TCP+X+Yamux.

alindima Aug 4, 2023

In general, I would very much like to see Substrate move to QUIC https://github.com/paritytech/substrate/issues/9162. It is the better TCP in every way. #3454 is merged and unless there are any major bugs we will cut a stable libp2p-quic release soon.

I think this would be great! Looking forward to seeing how it performs

@alindima let me know if libp2p/rust-yamux#162 resolves all of your questions.

My question was more like: why is there such a big discrepancy between rust-yamux and go-yamux. Was go-libp2p-yamux configured with a different window size in the benchmark? Because AFAICT from the code they have the same default (as per the spec).

alindima Aug 24, 2023

My question was more like: why is there such a big discrepancy between rust-yamux and go-yamux. Was go-libp2p-yamux configured with a different window size in the benchmark? Because AFAICT from the code they have the same default (as per the spec).

I looked a bit more at the go code, looks like even if it sets the default as rust-yamux and the spec (256Kib), it also has another constant, MaxStreamWindowSize (of 16 Mib). The go implementation doubles the window size on every window update until they reach the max, if they are sending window updates too frequently (if the time from the last window update is less than 4*RTT).
So I guess that's where the discrepancy is coming from.

Polkadot doesn't configure the window size, and the max throughput noticed in testnets seems to coincide with the rough 20Mbps throughput limit observed here: https://observablehq.com/@libp2p-workspace/performance-dashboard.

We'll try configuring it and see what happens

mxinden Aug 24, 2023
Collaborator

My question was more like: why is there such a big discrepancy between rust-yamux and go-yamux.

Also documented in libp2p/rust-yamux#162.

We'll try configuring it and see what happens

Cool. Curious what you find out @alindima. Thanks.

mxinden · 2023-08-03T12:23:49Z

mxinden
Aug 3, 2023
Collaborator

Meta: @alindima we do a biweekly open maintainers call. See #4276 for the previous one. It would be great to have you in the next one (15th of August).

1 reply

alindima Aug 4, 2023

Thanks for letting me know, I wasn't aware of it. I can't join it on the 15th unfortunately

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High CPU usage with larger peer counts #3840

{{title}}

Replies: 5 comments 34 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

High CPU usage with larger peer counts #3840

Replies: 5 comments · 34 replies

altonen Apr 27, 2023 Author

thomaseizinger Apr 27, 2023 Collaborator

thomaseizinger Jul 19, 2023 Collaborator

mxinden Jul 23, 2023 Collaborator

thomaseizinger Aug 1, 2023 Collaborator

thomaseizinger Aug 2, 2023 Collaborator

thomaseizinger Aug 2, 2023 Collaborator

mxinden Aug 3, 2023 Collaborator

mxinden Aug 24, 2023 Collaborator

mxinden Aug 3, 2023 Collaborator

Replies: 5 comments 34 replies

altonen
Apr 27, 2023
Author

thomaseizinger
Apr 27, 2023
Collaborator

thomaseizinger Jul 19, 2023
Collaborator

mxinden Jul 23, 2023
Collaborator

thomaseizinger Aug 1, 2023
Collaborator

thomaseizinger Aug 2, 2023
Collaborator

thomaseizinger Aug 2, 2023
Collaborator

mxinden Aug 3, 2023
Collaborator

mxinden Aug 24, 2023
Collaborator

mxinden
Aug 3, 2023
Collaborator