Update Message to support larger number of peers #114

lalexgap · 2022-09-28T21:18:44Z

Fixes #111
Fixes #29

It looks like our previous implementation of a lib p2p message service had some limitations. Whenever we had a large amount of load on the message service we would get "stream reset" errors. However we didn't notice this because we silently close the stream when getting a stream reset.

Changes

Updates the message service to open and close a stream per message. I've based this on this chat example. This avoids the "stream reset" errors by not having long running streams.
Adds some basic retry logic when attempting to establish a stream to a peer. If we fail we retry up to 20 times, waiting 5 seconds between each retry.

Effects

After these changes I've been able to successfully run a test with 50 participants. I've also been able to run a 3 participant test with 10,000 concurrency.

There still appears to be two failure modes when you add enough participants (i>50):

A message service fails to establish a stream within 20 attempts
The hardhat docker instance crashes on the vm.

We might be reaching limitations of the cloud VM? A client may just not have the computing power to handle so many tcp requests in a timely manner?

I'm still going to close #111 and #29. ~~I will create a new issue for investigating the performance after this change.~~ See #115

Performance Comparison

Simple Scenario

A simple scenario with 3 participants, concurrency 1, run for 2 minutes.

Before: 13.7 ms

http://34.168.92.245:3000/d/5OBBeW37k/time-to-first-payment?orgId=1&from=1664473900255&to=1664474074277&var-runId=ccqtme8nr2gk3i239hmg&var-jobCount=1&var-testDuration=2m0s&var-hubs=1&var-payees=1&var-jitter=0&var-latency=0&var-payers=1&var-payeepayers=0&var-nitroVersion=v0.0.0-20220922174011-3e33cafaa1f3

After: 14.0 ms

http://34.168.92.245:3000/d/5OBBeW37k/time-to-first-payment?orgId=1&from=1664474132546&to=1664474342866&var-runId=ccqtofgnr2gk3i239hn0&var-jobCount=1&var-testDuration=2m0s&var-hubs=1&var-payees=1&var-jitter=0&var-latency=0&var-payers=1&var-payeepayers=0&var-nitroVersion=v0.0.0-20220922174011-3e33cafaa1f3

Benchmark Scenario

Our established "benchmark" scenario.

Before: 4.20 s

http://34.168.92.245:3000/d/5OBBeW37k/time-to-first-payment?orgId=1&from=1664475776074&to=1664475886730&var-runId=ccqu5a0nr2gk3i239hq0&var-jobCount=10&var-testDuration=30s&var-hubs=1&var-payees=1&var-jitter=2&var-latency=15&var-payers=10&var-payeepayers=0&var-nitroVersion=v0.0.0-20220922174011-3e33cafaa1f3

After: 4.66 s

http://34.168.92.245:3000/d/5OBBeW37k/time-to-first-payment?orgId=1&from=1664475882755&to=1664475952310&var-runId=ccqu5h0nr2gk3i239hqg&var-jobCount=10&var-testDuration=30s&var-hubs=1&var-payees=1&var-jitter=2&var-latency=15&var-payers=10&var-payeepayers=0&var-nitroVersion=v0.0.0-20220922174011-3e33cafaa1f3

Long Benchmark Scenario

Our established "benchmark" scenario but run for 2 minutes.

Before: 7.56s

http://34.168.92.245:3000/d/5OBBeW37k/time-to-first-payment?orgId=1&from=1664476483036&to=1664476752826&var-runId=ccquap0nr2gk3i239ht0&var-jobCount=10&var-testDuration=2m0s&var-hubs=1&var-payees=1&var-jitter=2&var-latency=15&var-payers=10&var-payeepayers=0&var-nitroVersion=v0.0.0-20220922174011-3e33cafaa1f3

After: 7.12s

http://34.168.92.245:3000/d/5OBBeW37k/time-to-first-payment?orgId=1&from=1664476878409&to=1664477152413&var-runId=ccqueb8nr2gk3i239hug&var-jobCount=10&var-testDuration=2m0s&var-hubs=1&var-payees=1&var-jitter=2&var-latency=15&var-payers=10&var-payeepayers=0&var-nitroVersion=v0.0.0-20220922174011-3e33cafaa1f3

The long running streams seem to reset eventually.

geoknee

I'd say a 5x in the maximum number of clients we can spin up in our test is pretty awesome progress! Still lots of digging to do, of course but I think this change makes a lot of sense.

messaging/service.go

geoknee · 2022-09-30T09:32:28Z

The link to the run with 50 nodes seems to show a certain level of success, but I don't think the stats made their way into InfluxDB. I can't see any data in grafana and there are some errors about timeouts in the logs?

lalexgap · 2022-09-30T18:40:57Z

The link to the run with 50 nodes seems to show a certain level of success, but I don't think the stats made their way into InfluxDB. I can't see any data in grafana and there are some errors about timeouts in the logs?

50 nodes is right around the point where we see things break down, so I may have linked a run that didn't properly succeed. Here's a run with 40 participants.

lalexgap · 2022-10-04T17:51:54Z

I think the chat example I linked is incorrect. Here's an example where the stream is closed after sending a message.

lalexgap added 7 commits September 28, 2022 14:17

p2p ms no longer uses long running streams

269fa54

The long running streams seem to reset eventually.

ignore EOF from client close

293be65

remove for loop

5be82ce

close streams when done in handler

08f6e23

use more isolated port range

278a227

add a basic retry mechanism

b718349

measure send duration

e67e7a7

lalexgap changed the title ~~WIP: P2P Message service no longer uses long running streams~~ Update Message to support larger number of peers Sep 29, 2022

lalexgap marked this pull request as ready for review September 29, 2022 18:32

lalexgap mentioned this pull request Sep 29, 2022

Investigate limits on the amount of clients/concurrency #115

Open

lalexgap requested review from geoknee and kerzhner September 29, 2022 18:49

geoknee approved these changes Sep 30, 2022

View reviewed changes

messaging/service.go Outdated Show resolved Hide resolved

messaging/service.go Outdated Show resolved Hide resolved

messaging/service.go Outdated Show resolved Hide resolved

typo

51267d3

lalexgap added 2 commits September 30, 2022 12:45

combine loop

53ff47c

remove redundant else

c0b17fb

lalexgap merged commit d763d15 into main Sep 30, 2022

lalexgap deleted the fix-ms branch September 30, 2022 20:06

lalexgap mentioned this pull request Oct 4, 2022

Adds the lib-p2p message service statechannels/go-nitro#912

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Message to support larger number of peers #114

Update Message to support larger number of peers #114

lalexgap commented Sep 28, 2022 •

edited

Loading

geoknee left a comment

geoknee commented Sep 30, 2022

lalexgap commented Sep 30, 2022

lalexgap commented Oct 4, 2022

Update Message to support larger number of peers #114

Update Message to support larger number of peers #114

Conversation

lalexgap commented Sep 28, 2022 • edited Loading

Changes

Effects

Performance Comparison

Simple Scenario

Before: 13.7 ms

After: 14.0 ms

Benchmark Scenario

Before: 4.20 s

After: 4.66 s

Long Benchmark Scenario

Before: 7.56s

After: 7.12s

geoknee left a comment

Choose a reason for hiding this comment

geoknee commented Sep 30, 2022

lalexgap commented Sep 30, 2022

lalexgap commented Oct 4, 2022

lalexgap commented Sep 28, 2022 •

edited

Loading