Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[P2P] Optimize block pool requester retry and peer pick up logic #170

Merged
merged 8 commits into from
Dec 20, 2023

Conversation

yzang2019
Copy link
Contributor

@yzang2019 yzang2019 commented Dec 19, 2023

Describe your changes and provide context

Problems:
There are some inefficiency in the current p2p block sync module:

  • Even If a bpRequester has already downloaded the block from the peer, it can reset the block to nil whenever a retry is triggered. A retry can be triggered whenever a peer got disconnected or timedout, this could lead to a lot of extra network usages and block sync slow down. What usually happens is that during retry, it actually pick bad peers and give up previously downloaded block from good peers, then it end up having to wait for the response for a long time
  • Block sync reactor will need next two consecutive blocks to be available in the block pool in order to proceed the applyBlock, if any one is missing, the reactor will stuck and wait for the bprequester download to be finished
  • There's no timeout for a single block request, a block request can take more than a few minutes to get the response from a bad/slow peer, which would cause block sync to stuck during this time
  • We are not using the good peer list when picking available peers

Solution:

  • Introducing a retry reason, there are two major reasons, either we need to retry for a peer being removed due to disconnection, or we need to retry when we failed validating the block in the reactor.
  • When we handle the retry operation, we should check if we already have the block, if we already have the block read, we don't need to retry any more, unless the retry reason is bad block
  • Introducing an extra timeout for requester routine, when we wait for the block response, we should try a different peer if we haven't got the response after timeout

Testing performed to validate your change

Tested on standalone rpc nodes

@yzang2019 yzang2019 changed the title P2P Improvements: Fix block sync reactor and block pool retry logic P2P Improvements: Optimize block pool requester retry logic and peer pick up logic Dec 19, 2023
@yzang2019 yzang2019 changed the title P2P Improvements: Optimize block pool requester retry logic and peer pick up logic P2P]: Optimize block pool requester retry and peer pick up logic Dec 19, 2023
@yzang2019 yzang2019 changed the title P2P]: Optimize block pool requester retry and peer pick up logic [P2P] Optimize block pool requester retry and peer pick up logic Dec 19, 2023
Comment on lines +478 to +480
if index >= len(goodPeers) {
index = len(goodPeers) - 1
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - woudl this case ever be hit? could also use modulo to have more randomness

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought math.randn is inclusive?

Copy link

codecov bot commented Dec 20, 2023

Codecov Report

Attention: 15 lines in your changes are missing coverage. Please review.

Comparison is base (72bb29c) 59.11% compared to head (6f4c6d3) 58.09%.
Report is 1 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #170      +/-   ##
==========================================
- Coverage   59.11%   58.09%   -1.03%     
==========================================
  Files         281      249      -32     
  Lines       38940    33907    -5033     
==========================================
- Hits        23021    19698    -3323     
+ Misses      14124    12645    -1479     
+ Partials     1795     1564     -231     
Files Coverage Δ
internal/eventbus/event_bus.go 89.18% <100.00%> (ø)
internal/p2p/peermanager.go 80.14% <0.00%> (-0.29%) ⬇️
internal/blocksync/pool.go 77.96% <78.94%> (+0.49%) ⬆️

... and 50 files with indirect coverage changes

@yzang2019 yzang2019 merged commit 016c1b9 into main Dec 20, 2023
24 checks passed
yzang2019 added a commit that referenced this pull request Dec 20, 2023
* P2P Improvements: Fix block sync reactor and block pool retry logic
stevenlanders added a commit that referenced this pull request Jan 4, 2024
* Perf: Increase buffer size for pubsub server to boost performance (#167)

* Increase buffer size for pubsub server

* Add more timeout for test failure

* Add more timeout

* Fix test split scripts

* Fix test split

* Fix unit test

* Unit test

* Unit test

* [P2P] Optimize block pool requester retry and peer pick up logic (#170)

* P2P Improvements: Fix block sync reactor and block pool retry logic

* Revert "Add event data to result event (#165)" (#176)

This reverts commit 72bb29c.

* Fix block sync auto restart not working as expected (#175)

* Fix edge case for blocksync (#178)

* fix evm pending nonce

* fix test

* deflake a test

* de-flake test

* Revert "merge main"

This reverts commit 58b9424, reversing
changes made to 02d1478.

* consider keep-in-cache logic when removing from cache

* undo test tweaks

---------

Co-authored-by: Yiming Zang <[email protected]>
Co-authored-by: Jeremy Wei <[email protected]>
stevenlanders added a commit that referenced this pull request Jan 4, 2024
* Perf: Increase buffer size for pubsub server to boost performance (#167)

* Increase buffer size for pubsub server

* Add more timeout for test failure

* Add more timeout

* Fix test split scripts

* Fix test split

* Fix unit test

* Unit test

* Unit test

* [P2P] Optimize block pool requester retry and peer pick up logic (#170)

* P2P Improvements: Fix block sync reactor and block pool retry logic

* Revert "Add event data to result event (#165)" (#176)

This reverts commit 72bb29c.

* Fix block sync auto restart not working as expected (#175)

* Fix edge case for blocksync (#178)

* fix evm pending nonce

* fix test

* deflake a test

* de-flake test

* Revert "merge main"

This reverts commit 58b9424, reversing
changes made to 02d1478.

* consider keep-in-cache logic when removing from cache

* undo test tweaks

---------

Co-authored-by: Yiming Zang <[email protected]>
Co-authored-by: Jeremy Wei <[email protected]>
stevenlanders added a commit that referenced this pull request Jan 4, 2024
* Perf: Increase buffer size for pubsub server to boost performance (#167)

* Increase buffer size for pubsub server

* Add more timeout for test failure

* Add more timeout

* Fix test split scripts

* Fix test split

* Fix unit test

* Unit test

* Unit test

* [P2P] Optimize block pool requester retry and peer pick up logic (#170)

* P2P Improvements: Fix block sync reactor and block pool retry logic

* Revert "Add event data to result event (#165)" (#176)

This reverts commit 72bb29c.

* Fix block sync auto restart not working as expected (#175)

* Fix edge case for blocksync (#178)

* fix evm pending nonce

* fix test

* deflake a test

* de-flake test

* Revert "merge main"

This reverts commit 58b9424, reversing
changes made to 02d1478.

* consider keep-in-cache logic when removing from cache

* undo test tweaks

---------

Co-authored-by: Yiming Zang <[email protected]>
Co-authored-by: Jeremy Wei <[email protected]>
codchen pushed a commit that referenced this pull request Jan 5, 2024
* Perf: Increase buffer size for pubsub server to boost performance (#167)

* Increase buffer size for pubsub server

* Add more timeout for test failure

* Add more timeout

* Fix test split scripts

* Fix test split

* Fix unit test

* Unit test

* Unit test

* [P2P] Optimize block pool requester retry and peer pick up logic (#170)

* P2P Improvements: Fix block sync reactor and block pool retry logic

* Revert "Add event data to result event (#165)" (#176)

This reverts commit 72bb29c.

* Fix block sync auto restart not working as expected (#175)

* Fix edge case for blocksync (#178)

* fix evm pending nonce

* fix test

* deflake a test

* de-flake test

* Revert "merge main"

This reverts commit 58b9424, reversing
changes made to 02d1478.

* consider keep-in-cache logic when removing from cache

* undo test tweaks

---------

Co-authored-by: Yiming Zang <[email protected]>
Co-authored-by: Jeremy Wei <[email protected]>
codchen pushed a commit that referenced this pull request Jan 11, 2024
* Perf: Increase buffer size for pubsub server to boost performance (#167)

* Increase buffer size for pubsub server

* Add more timeout for test failure

* Add more timeout

* Fix test split scripts

* Fix test split

* Fix unit test

* Unit test

* Unit test

* [P2P] Optimize block pool requester retry and peer pick up logic (#170)

* P2P Improvements: Fix block sync reactor and block pool retry logic

* Revert "Add event data to result event (#165)" (#176)

This reverts commit 72bb29c.

* Fix block sync auto restart not working as expected (#175)

* Fix edge case for blocksync (#178)

* fix evm pending nonce

* fix test

* deflake a test

* de-flake test

* Revert "merge main"

This reverts commit 58b9424, reversing
changes made to 02d1478.

* consider keep-in-cache logic when removing from cache

* undo test tweaks

---------

Co-authored-by: Yiming Zang <[email protected]>
Co-authored-by: Jeremy Wei <[email protected]>
stevenlanders added a commit that referenced this pull request Jan 30, 2024
* Perf: Increase buffer size for pubsub server to boost performance (#167)

* Increase buffer size for pubsub server

* Add more timeout for test failure

* Add more timeout

* Fix test split scripts

* Fix test split

* Fix unit test

* Unit test

* Unit test

* [P2P] Optimize block pool requester retry and peer pick up logic (#170)

* P2P Improvements: Fix block sync reactor and block pool retry logic

* Revert "Add event data to result event (#165)" (#176)

This reverts commit 72bb29c.

* Fix block sync auto restart not working as expected (#175)

* Fix edge case for blocksync (#178)

* fix evm pending nonce

* fix test

* deflake a test

* de-flake test

* Revert "merge main"

This reverts commit 58b9424, reversing
changes made to 02d1478.

* consider keep-in-cache logic when removing from cache

* undo test tweaks

---------

Co-authored-by: Yiming Zang <[email protected]>
Co-authored-by: Jeremy Wei <[email protected]>
stevenlanders added a commit that referenced this pull request Jan 30, 2024
* Standardize lag status response format (#187)

* Standardize lag status response format

* Fix flaky unit test

* Make ReadMaxTxs atomic (#166)

* Support pending transaction in mempool (#169)

* fix unconfirmed tx to consider pending txs (#172)

* fix pending pop (#173)

* add TTL for pending txs (#174)

* [EVM] Fix evm pending nonce (#179)

* Perf: Increase buffer size for pubsub server to boost performance (#167)

* Increase buffer size for pubsub server

* Add more timeout for test failure

* Add more timeout

* Fix test split scripts

* Fix test split

* Fix unit test

* Unit test

* Unit test

* [P2P] Optimize block pool requester retry and peer pick up logic (#170)

* P2P Improvements: Fix block sync reactor and block pool retry logic

* Revert "Add event data to result event (#165)" (#176)

This reverts commit 72bb29c.

* Fix block sync auto restart not working as expected (#175)

* Fix edge case for blocksync (#178)

* fix evm pending nonce

* fix test

* deflake a test

* de-flake test

* Revert "merge main"

This reverts commit 58b9424, reversing
changes made to 02d1478.

* consider keep-in-cache logic when removing from cache

* undo test tweaks

---------

Co-authored-by: Yiming Zang <[email protected]>
Co-authored-by: Jeremy Wei <[email protected]>

* Fix bug when popping pending TXs (#188)

* Add mempool metrics for number of pending tx and expired txs (#189)

* Add metrics for mempool pending transaction size

* Add expired tx count metrics

* [EVM] Allow multiple txs from same account in a block (#190)

* add mempool prioritization with evm nonce

* fix priority stability

* index fixes

* replace with binary search insert

* impl binary search

* fix removeTx to push next queued evm tx (#191)

* fix expire metric (#193)

* [EVM] Fix duplicate evm txs from priority queue (#195)

* debug duplicate evm tx

* add more logs

* add some \ns

* more logs

* fix swap check

* add-lockable-reap-by-gas

* add invariant checks

* fix invariant parenthesis

* fix log

* remove invalid invariant

* fix nonce ordering pain

* handle ordering of insert

* fix remove

* cleanup

* fix imports

* cleanup

* avoid getTransactionByHash(hash) panic due to index

* use Key() to compare instead of pointer

* [EVM] prevent duplicate txs from getting inserted (#196)

* prevent duplicates in mempool

* use timestamp in priority queue

* [EVM] Add logging for expiration (#198)

* add logging for expired txs

* cleanup

* [EVM] Avoid returning nil transactions on ForEach (#197)

* remove heapIndex to avoid nil scenario

* avoid returning nil in loop (mimic Peek)

---------

Co-authored-by: Yiming Zang <[email protected]>
Co-authored-by: codchen <[email protected]>
Co-authored-by: Jeremy Wei <[email protected]>
udpatil pushed a commit that referenced this pull request Feb 28, 2024
* Perf: Increase buffer size for pubsub server to boost performance (#167)

* Increase buffer size for pubsub server

* Add more timeout for test failure

* Add more timeout

* Fix test split scripts

* Fix test split

* Fix unit test

* Unit test

* Unit test

* [P2P] Optimize block pool requester retry and peer pick up logic (#170)

* P2P Improvements: Fix block sync reactor and block pool retry logic

* Revert "Add event data to result event (#165)" (#176)

This reverts commit 72bb29c.

* Fix block sync auto restart not working as expected (#175)

* Fix edge case for blocksync (#178)

* fix evm pending nonce

* fix test

* deflake a test

* de-flake test

* Revert "merge main"

This reverts commit 58b9424, reversing
changes made to 02d1478.

* consider keep-in-cache logic when removing from cache

* undo test tweaks

---------

Co-authored-by: Yiming Zang <[email protected]>
Co-authored-by: Jeremy Wei <[email protected]>
udpatil pushed a commit that referenced this pull request Mar 26, 2024
* Perf: Increase buffer size for pubsub server to boost performance (#167)

* Increase buffer size for pubsub server

* Add more timeout for test failure

* Add more timeout

* Fix test split scripts

* Fix test split

* Fix unit test

* Unit test

* Unit test

* [P2P] Optimize block pool requester retry and peer pick up logic (#170)

* P2P Improvements: Fix block sync reactor and block pool retry logic

* Revert "Add event data to result event (#165)" (#176)

This reverts commit 72bb29c.

* Fix block sync auto restart not working as expected (#175)

* Fix edge case for blocksync (#178)

* fix evm pending nonce

* fix test

* deflake a test

* de-flake test

* Revert "merge main"

This reverts commit 58b9424, reversing
changes made to 02d1478.

* consider keep-in-cache logic when removing from cache

* undo test tweaks

---------

Co-authored-by: Yiming Zang <[email protected]>
Co-authored-by: Jeremy Wei <[email protected]>
udpatil pushed a commit that referenced this pull request Apr 16, 2024
* Perf: Increase buffer size for pubsub server to boost performance (#167)

* Increase buffer size for pubsub server

* Add more timeout for test failure

* Add more timeout

* Fix test split scripts

* Fix test split

* Fix unit test

* Unit test

* Unit test

* [P2P] Optimize block pool requester retry and peer pick up logic (#170)

* P2P Improvements: Fix block sync reactor and block pool retry logic

* Revert "Add event data to result event (#165)" (#176)

This reverts commit 72bb29c.

* Fix block sync auto restart not working as expected (#175)

* Fix edge case for blocksync (#178)

* fix evm pending nonce

* fix test

* deflake a test

* de-flake test

* Revert "merge main"

This reverts commit 58b9424, reversing
changes made to 02d1478.

* consider keep-in-cache logic when removing from cache

* undo test tweaks

---------

Co-authored-by: Yiming Zang <[email protected]>
Co-authored-by: Jeremy Wei <[email protected]>
udpatil pushed a commit that referenced this pull request Apr 16, 2024
* Perf: Increase buffer size for pubsub server to boost performance (#167)

* Increase buffer size for pubsub server

* Add more timeout for test failure

* Add more timeout

* Fix test split scripts

* Fix test split

* Fix unit test

* Unit test

* Unit test

* [P2P] Optimize block pool requester retry and peer pick up logic (#170)

* P2P Improvements: Fix block sync reactor and block pool retry logic

* Revert "Add event data to result event (#165)" (#176)

This reverts commit 72bb29c.

* Fix block sync auto restart not working as expected (#175)

* Fix edge case for blocksync (#178)

* fix evm pending nonce

* fix test

* deflake a test

* de-flake test

* Revert "merge main"

This reverts commit 58b9424, reversing
changes made to 02d1478.

* consider keep-in-cache logic when removing from cache

* undo test tweaks

---------

Co-authored-by: Yiming Zang <[email protected]>
Co-authored-by: Jeremy Wei <[email protected]>
udpatil added a commit that referenced this pull request Apr 19, 2024
* Make ReadMaxTxs atomic (#166)

* Support pending transaction in mempool (#169)

* fix unconfirmed tx to consider pending txs (#172)

* fix pending pop (#173)

* add TTL for pending txs (#174)

* [EVM] Fix evm pending nonce (#179)

* Perf: Increase buffer size for pubsub server to boost performance (#167)

* Increase buffer size for pubsub server

* Add more timeout for test failure

* Add more timeout

* Fix test split scripts

* Fix test split

* Fix unit test

* Unit test

* Unit test

* [P2P] Optimize block pool requester retry and peer pick up logic (#170)

* P2P Improvements: Fix block sync reactor and block pool retry logic

* Revert "Add event data to result event (#165)" (#176)

This reverts commit 72bb29c.

* Fix block sync auto restart not working as expected (#175)

* Fix edge case for blocksync (#178)

* fix evm pending nonce

* fix test

* deflake a test

* de-flake test

* Revert "merge main"

This reverts commit 58b9424, reversing
changes made to 02d1478.

* consider keep-in-cache logic when removing from cache

* undo test tweaks

---------

Co-authored-by: Yiming Zang <[email protected]>
Co-authored-by: Jeremy Wei <[email protected]>

* Fix bug when popping pending TXs (#188)

* Add mempool metrics for number of pending tx and expired txs (#189)

* Add metrics for mempool pending transaction size

* Add expired tx count metrics

* [EVM] Allow multiple txs from same account in a block (#190)

* add mempool prioritization with evm nonce

* fix priority stability

* index fixes

* replace with binary search insert

* impl binary search

* fix removeTx to push next queued evm tx (#191)

* fix expire metric (#193)

* [EVM] Fix duplicate evm txs from priority queue (#195)

* debug duplicate evm tx

* add more logs

* add some \ns

* more logs

* fix swap check

* add-lockable-reap-by-gas

* add invariant checks

* fix invariant parenthesis

* fix log

* remove invalid invariant

* fix nonce ordering pain

* handle ordering of insert

* fix remove

* cleanup

* fix imports

* cleanup

* avoid getTransactionByHash(hash) panic due to index

* use Key() to compare instead of pointer

* [EVM] prevent duplicate txs from getting inserted (#196)

* prevent duplicates in mempool

* use timestamp in priority queue

* [EVM] Add logging for expiration (#198)

* add logging for expired txs

* cleanup

* [EVM] Avoid returning nil transactions on ForEach (#197)

* remove heapIndex to avoid nil scenario

* avoid returning nil in loop (mimic Peek)

* call callback from mempool (#200)

* separate limit for pending tx (#202)

* Add EVM txs eviction logic (#204)

* Fix debug log (#205)

* EVM transaction replacement (#206) (#208)

* Add heapIndex with safety check (#213)

* add heapIndex with safety check

* cleanup

* comment out for perf test

* add back perf improvement

* fix nil test

* Use write-lock in (*TxPriorityQueue).ReapMax funcs (#209)

ReapMaxBytesMaxGas and ReapMaxTxs funcs in TxPriorityQueue claim
> Transactions returned are not removed from the mempool transaction
> store or indexes.

However, they use a priority queue to accomplish the claim
> Transaction are retrieved in priority order.

This is accomplished by popping all items out of the whole heap, and
then pushing then back in sequentially. A copy of the heap cannot be
obtained otherwise. Both of the mentioned functions use a read-lock
(RLock) when doing this. This results in a potential scenario where
multiple executions of the ReapMax can be started in parallel, and
both would be popping items out of the priority queue.

In practice, this can be abused by executing the `unconfirmed_txs` RPC
call repeatedly. Based on our observations, running it multiple times
per millisecond results in multiple threads picking it up at the same
time. Such a scenario can be obtained via the WebSocket interface, and
spamming `unconfirmed_txs` calls there. The behavior that happens is a
`Panic in WSJSONRPC handler` when a queue item unexpectedly disappears
for `mempool.(*TxPriorityQueue).Swap`.
(`runtime error: index out of range [0] with length 0`)

This can additionally lead to a `CONSENSUS FAILURE!!!` if the race
condition occurs for `internal/consensus.(*State).finalizeCommit`
when it tries to do `mempool.(*TxPriorityQueue).RemoveTx`, but
the ReapMax has already removed all elements from the underlying
heap. (`runtime error: index out of range [-1]`)

This commit switches the lock type to a write-lock (Lock) to ensure
no parallel modifications take place. This commit additionally updates
the tests to allow parallel execution of the func calls in testing,
as to prevent regressions (in case someone wants to downgrade the locks
without considering the implications from the underlying heap usage).

---------

Co-authored-by: Valters Jansons <[email protected]>

* Pending Txs Update Condition (#214)

* Add metrics for mempool size changes (#220)

* [EVM] Adjust locking for replacement (#224)

* Remove tx from cache when canAddPendingTx fails (#230)

* add tx hash to evm info proto (#231)

---------

Co-authored-by: codchen <[email protected]>
Co-authored-by: Steven Landers <[email protected]>
Co-authored-by: Yiming Zang <[email protected]>
Co-authored-by: Jeremy Wei <[email protected]>
Co-authored-by: Valters Jansons <[email protected]>
Co-authored-by: Kartik Bhat <[email protected]>
udpatil pushed a commit that referenced this pull request Apr 19, 2024
* Perf: Increase buffer size for pubsub server to boost performance (#167)

* Increase buffer size for pubsub server

* Add more timeout for test failure

* Add more timeout

* Fix test split scripts

* Fix test split

* Fix unit test

* Unit test

* Unit test

* [P2P] Optimize block pool requester retry and peer pick up logic (#170)

* P2P Improvements: Fix block sync reactor and block pool retry logic

* Revert "Add event data to result event (#165)" (#176)

This reverts commit 72bb29c.

* Fix block sync auto restart not working as expected (#175)

* Fix edge case for blocksync (#178)

* fix evm pending nonce

* fix test

* deflake a test

* de-flake test

* Revert "merge main"

This reverts commit 58b9424, reversing
changes made to 02d1478.

* consider keep-in-cache logic when removing from cache

* undo test tweaks

---------

Co-authored-by: Yiming Zang <[email protected]>
Co-authored-by: Jeremy Wei <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants