Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

litep2p authority discovery is very slow 5m vs 1h #7077

Closed
Tracked by #7076
alexggh opened this issue Jan 7, 2025 · 3 comments · Fixed by #7099
Closed
Tracked by #7076

litep2p authority discovery is very slow 5m vs 1h #7077

alexggh opened this issue Jan 7, 2025 · 3 comments · Fixed by #7099

Comments

@alexggh
Copy link
Contributor

alexggh commented Jan 7, 2025

Our kusama nodes that run with litep2p discover very slowly their peers, which will result in them being sparsely connected, with lib2p on connect to 95% of nodes on 5m while with litep2p it takes around ~30m to get to around 90% of connections and another 30 min to reach around 95%.

Screenshot 2025-01-07 at 15 47 20

Why is this bad

Sparsely connected validators contribute negatively to the network in a few ways:

  1. They won't see enough assignments and approvals to approve the candidate, so they won't vote on finality.

  2. Because they aren't connected to at least 1/3 of the network they won't be able to approve candidates, but assignments will be triggered and distributed, so they will count as no-shows on other nodes. Other nodes will cover the no-show, but that still means they contribute to finality being delayed for at least no-show period.

My take

If we have enough of this type of nodes all restarting at the same time, we will end up stress testing the network, theoretically we should be able to support 1/3 of the network being in this state at the same time, but if we get past the 1/3 threshold the finality will lag until the fail safe kicks in. With that in mind I think we should first improve litep2p on this dimension before we enable it on a significant numbers of validators in kusama.

cc: @paritytech/networking

@lexnv
Copy link
Contributor

lexnv commented Jan 7, 2025

The slow discovery of authority records is related to the fact that litep2p provides found records after the kademlia query finishes execution. In contrast, libp2p provides the record as soon as it receives it from the network.

Adopting a similar approach for litep2p (provide records as soon as we discover them), leads to significant improvements, outpeforming libp2p:

  • libp2p: discovering 1k records took 10 minutes (left graph)
  • litep2p without improvements: 37 minutes (middle graph)
  • litep2p with improvements: 2.5 minutes (right graph

I'll have another look tomorrow and investigate CPU consumption, which might increase with the number of propagated messages:

Screenshot 2025-01-07 at 17 57 56

@paradox-tt
Copy link
Contributor

paradox-tt commented Jan 9, 2025

I appreciate that these issues are generally for bugs and the like. However, thus far my experience is good on all thirty of my validators minus a few hiccups at the beginning. Given that we may now have more peers using litep2p, how do the numbers look, has the time to reach 95% peers reduced?

@lexnv
Copy link
Contributor

lexnv commented Jan 13, 2025

Thanks @paradox-tt for your help!

We are seeing good improvements with paritytech/litep2p#315, which would be reflected in the next point release of polkadot. This should reduce the time to around ~2/3minutes to discover ~95% of records, which outperforms libp2p (current backend).

We have temporarily put on pause litep2p rollout until we debug: #7076 (comment), thanks again for your help 🙏

lexnv added a commit to paritytech/litep2p that referenced this issue Jan 14, 2025
This PR provides the partial results of the `GetRecord` kademlia query.
- A new `GetRecordPartialResult` event is introduced for kademlia
- `GetRecordSuccess` is modified to include only the query ID
- Kademlia `GetRecord` implementation no longer stores network records
internally and forwards valid (unexpired) ones back to the user


The change is needed to speedup authority discovery for substrate based
chains.


More context can be found at:
paritytech/polkadot-sdk#7077 (comment)

### Next Steps
- [x] Adjust testing to API breaking change
- [x] Investigate CPU impact (as suggested by @dmitry-markin this should
be unnoticeable 🙏 )

---------

Signed-off-by: Alexandru Vasile <[email protected]>
github-merge-queue bot pushed a commit that referenced this issue Jan 15, 2025
This PR provides the partial results of the `GetRecord` kademlia query.

This significantly improves the authority discovery records, from ~37
minutes to ~2/3 minutes.
In contrast, libp2p discovers authority records in around ~10 minutes. 

The authority discovery was slow because litep2p provided the records
only after the Kademlia query was completed. A normal Kademlia query
completes in around 40 seconds to a few minutes.
In this PR, partial records are provided as soon as they are discovered
from the network.

### Testing Done

Started a node in Kusama with `--validator` and litep2p backend.
The node discovered 996/1000 authority records in ~ 1 minute 45 seconds.

![Screenshot 2025-01-09 at 12 26
08](https://github.com/user-attachments/assets/b618bf7c-2bba-43a0-a021-4047e854c075)


### Before & After

In this image, on the left side is libp2p, in the middle litep2p without
this PR, on the right litep2p with this PR

![Screenshot 2025-01-07 at 17 57
56](https://github.com/user-attachments/assets/a8d467f7-8dc7-461c-bcff-163b94d01ae8)



Closes: #7077

cc @paritytech/networking

---------

Signed-off-by: Alexandru Vasile <[email protected]>
Nathy-bajo pushed a commit to Nathy-bajo/polkadot-sdk that referenced this issue Jan 21, 2025
…tech#7099)

This PR provides the partial results of the `GetRecord` kademlia query.

This significantly improves the authority discovery records, from ~37
minutes to ~2/3 minutes.
In contrast, libp2p discovers authority records in around ~10 minutes. 

The authority discovery was slow because litep2p provided the records
only after the Kademlia query was completed. A normal Kademlia query
completes in around 40 seconds to a few minutes.
In this PR, partial records are provided as soon as they are discovered
from the network.

### Testing Done

Started a node in Kusama with `--validator` and litep2p backend.
The node discovered 996/1000 authority records in ~ 1 minute 45 seconds.

![Screenshot 2025-01-09 at 12 26
08](https://github.com/user-attachments/assets/b618bf7c-2bba-43a0-a021-4047e854c075)


### Before & After

In this image, on the left side is libp2p, in the middle litep2p without
this PR, on the right litep2p with this PR

![Screenshot 2025-01-07 at 17 57
56](https://github.com/user-attachments/assets/a8d467f7-8dc7-461c-bcff-163b94d01ae8)



Closes: paritytech#7077

cc @paritytech/networking

---------

Signed-off-by: Alexandru Vasile <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants