You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue is spun off from some findings we've seen with neutrino over time, either when we have a lot of active peers, or some of the peers are connected over tor. This'll serve as a mastering tracking issue for a number of improvements we can make to this area of the code base. I also consider much of this to be preparatory work before we begin parallel block download in btcd in earnest. This was spun off discussions in lightningnetwork/lnd#8250.
Sampling Based Peer Ranking
Today the default peer tanking algorithm will just re-order the peers:
As peers start to fill up, then we'll allocate a task to them next peer in line. In the wild, we've seen that this'll end up allocating tasks to slow peers.
To get around this, we should look at using a sampling based method. So we'll map the rankings into a probability that the peer will respond properly instead.
Dynamic Ping Tuned Timeouts
Today we have a hard coded 2 second timeout for all requests. This works in settings of very good network connections, but for tor nodes, this might be too slow.
Instead, we should use the ping latency to tune the timeout for a given peer, based on the extrapolated RTT. This can also be fed into the peer ranking itself. We should also refresh this metric in the background (as network jitter can change perceived latency, especially on tor).
Work Stealing Workers
We should add work stealing to ensure that fast peers are able to dynamically increase the load to increase overall throughput. With work stealing peers would check at the top of their loop if another peer has work that has yet to be claimed, then adding that to its own queue. The Go scheduler also uses work stealing itself to ensure all processors are utilized: https://rakyll.org/scheduler/.
Along the way, we should modify the work queue to add better type safety via type params.
The text was updated successfully, but these errors were encountered:
As peers start to fill up, then we'll allocate a task to them next peer in line. In the wild, we've seen that this'll end up allocating tasks to slow peers.
I do not think so, currently, tasks are assigned to a peer one at a time and if the peer takes too long, it times out and the task is reassigned to another worker (peer). In my understanding with the current set up, there is really "no work to steal"
I think the dynamic ping time-out would be useful though in determining the most "realistic" time for a time out giving the RTT.
So we'll map the rankings into a probability that the peer will respond properly instead.
Currently, peers are awarded a score based on their history of responsiveness that reduces or increases their likelihood of being assigned a task, do you think that would suffice?
Along the way, we should modify the work queue to add better type safety via type params.
I do not think so, currently, tasks are assigned to a peer one at a time and if the peer takes too long, it times out and the task is reassigned to another worker (peer
So this is about more active queue management. That timeout is done by the scheduler, rather than a peer stealing the work directly. What we have right now is a sorted list, and we assign to directly to that, not being aware of any active load.
Instead, it should recognize that the queue is full, to allocate to another peer. We should move to a thread-safe queue instead of the current channel based mechanism.
Check out that link I posted at the end, re how work stealing works in the Go compiler.
Currently, peers are awarded a score based on their history of responsiveness that reduces or increases their likelihood of being assigned a task, do you think that would suffice?
Re-reading this, I think it can be even simpler:
Select two peers that are capable of taking on the work.
Pick one of them at random to assign
Have the peers work steal in the background to smooth things out.
Along the way, we should modify the work queue to add better type safety via type params.
This issue is spun off from some findings we've seen with neutrino over time, either when we have a lot of active peers, or some of the peers are connected over tor. This'll serve as a mastering tracking issue for a number of improvements we can make to this area of the code base. I also consider much of this to be preparatory work before we begin parallel block download in btcd in earnest. This was spun off discussions in lightningnetwork/lnd#8250.
Sampling Based Peer Ranking
Today the default peer tanking algorithm will just re-order the peers:
neutrino/query/peer_rank.go
Lines 20 to 26 in 42a196f
As peers start to fill up, then we'll allocate a task to them next peer in line. In the wild, we've seen that this'll end up allocating tasks to slow peers.
To get around this, we should look at using a sampling based method. So we'll map the rankings into a probability that the peer will respond properly instead.
Dynamic Ping Tuned Timeouts
Today we have a hard coded 2 second timeout for all requests. This works in settings of very good network connections, but for tor nodes, this might be too slow.
Instead, we should use the ping latency to tune the timeout for a given peer, based on the extrapolated RTT. This can also be fed into the peer ranking itself. We should also refresh this metric in the background (as network jitter can change perceived latency, especially on tor).
Work Stealing Workers
We should add work stealing to ensure that fast peers are able to dynamically increase the load to increase overall throughput. With work stealing peers would check at the top of their loop if another peer has work that has yet to be claimed, then adding that to its own queue. The Go scheduler also uses work stealing itself to ensure all processors are utilized: https://rakyll.org/scheduler/.
Along the way, we should modify the work queue to add better type safety via type params.
The text was updated successfully, but these errors were encountered: