-
Notifications
You must be signed in to change notification settings - Fork 798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
litep2p authority discovery is very slow 5m vs 1h #7077
Comments
The slow discovery of authority records is related to the fact that litep2p provides found records after the kademlia query finishes execution. In contrast, libp2p provides the record as soon as it receives it from the network. Adopting a similar approach for litep2p (provide records as soon as we discover them), leads to significant improvements, outpeforming libp2p:
I'll have another look tomorrow and investigate CPU consumption, which might increase with the number of propagated messages: |
I appreciate that these issues are generally for bugs and the like. However, thus far my experience is good on all thirty of my validators minus a few hiccups at the beginning. Given that we may now have more peers using litep2p, how do the numbers look, has the time to reach 95% peers reduced? |
Thanks @paradox-tt for your help! We are seeing good improvements with paritytech/litep2p#315, which would be reflected in the next point release of polkadot. This should reduce the time to around ~2/3minutes to discover ~95% of records, which outperforms libp2p (current backend). We have temporarily put on pause litep2p rollout until we debug: #7076 (comment), thanks again for your help 🙏 |
This PR provides the partial results of the `GetRecord` kademlia query. - A new `GetRecordPartialResult` event is introduced for kademlia - `GetRecordSuccess` is modified to include only the query ID - Kademlia `GetRecord` implementation no longer stores network records internally and forwards valid (unexpired) ones back to the user The change is needed to speedup authority discovery for substrate based chains. More context can be found at: paritytech/polkadot-sdk#7077 (comment) ### Next Steps - [x] Adjust testing to API breaking change - [x] Investigate CPU impact (as suggested by @dmitry-markin this should be unnoticeable 🙏 ) --------- Signed-off-by: Alexandru Vasile <[email protected]>
This PR provides the partial results of the `GetRecord` kademlia query. This significantly improves the authority discovery records, from ~37 minutes to ~2/3 minutes. In contrast, libp2p discovers authority records in around ~10 minutes. The authority discovery was slow because litep2p provided the records only after the Kademlia query was completed. A normal Kademlia query completes in around 40 seconds to a few minutes. In this PR, partial records are provided as soon as they are discovered from the network. ### Testing Done Started a node in Kusama with `--validator` and litep2p backend. The node discovered 996/1000 authority records in ~ 1 minute 45 seconds. ![Screenshot 2025-01-09 at 12 26 08](https://github.com/user-attachments/assets/b618bf7c-2bba-43a0-a021-4047e854c075) ### Before & After In this image, on the left side is libp2p, in the middle litep2p without this PR, on the right litep2p with this PR ![Screenshot 2025-01-07 at 17 57 56](https://github.com/user-attachments/assets/a8d467f7-8dc7-461c-bcff-163b94d01ae8) Closes: #7077 cc @paritytech/networking --------- Signed-off-by: Alexandru Vasile <[email protected]>
…tech#7099) This PR provides the partial results of the `GetRecord` kademlia query. This significantly improves the authority discovery records, from ~37 minutes to ~2/3 minutes. In contrast, libp2p discovers authority records in around ~10 minutes. The authority discovery was slow because litep2p provided the records only after the Kademlia query was completed. A normal Kademlia query completes in around 40 seconds to a few minutes. In this PR, partial records are provided as soon as they are discovered from the network. ### Testing Done Started a node in Kusama with `--validator` and litep2p backend. The node discovered 996/1000 authority records in ~ 1 minute 45 seconds. ![Screenshot 2025-01-09 at 12 26 08](https://github.com/user-attachments/assets/b618bf7c-2bba-43a0-a021-4047e854c075) ### Before & After In this image, on the left side is libp2p, in the middle litep2p without this PR, on the right litep2p with this PR ![Screenshot 2025-01-07 at 17 57 56](https://github.com/user-attachments/assets/a8d467f7-8dc7-461c-bcff-163b94d01ae8) Closes: paritytech#7077 cc @paritytech/networking --------- Signed-off-by: Alexandru Vasile <[email protected]>
Our kusama nodes that run with litep2p discover very slowly their peers, which will result in them being sparsely connected, with lib2p on connect to 95% of nodes on 5m while with litep2p it takes around ~30m to get to around 90% of connections and another 30 min to reach around 95%.
Why is this bad
Sparsely connected validators contribute negatively to the network in a few ways:
They won't see enough assignments and approvals to approve the candidate, so they won't vote on finality.
Because they aren't connected to at least 1/3 of the network they won't be able to approve candidates, but assignments will be triggered and distributed, so they will count as no-shows on other nodes. Other nodes will cover the no-show, but that still means they contribute to finality being delayed for at least no-show period.
My take
If we have enough of this type of nodes all restarting at the same time, we will end up stress testing the network, theoretically we should be able to support 1/3 of the network being in this state at the same time, but if we get past the 1/3 threshold the finality will lag until the fail safe kicks in. With that in mind I think we should first improve litep2p on this dimension before we enable it on a significant numbers of validators in kusama.
cc: @paritytech/networking
The text was updated successfully, but these errors were encountered: