Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Message to support larger number of peers #114

Merged
merged 10 commits into from
Sep 30, 2022
Merged

Update Message to support larger number of peers #114

merged 10 commits into from
Sep 30, 2022

Conversation

lalexgap
Copy link
Contributor

@lalexgap lalexgap commented Sep 28, 2022

Fixes #111
Fixes #29

It looks like our previous implementation of a lib p2p message service had some limitations. Whenever we had a large amount of load on the message service we would get "stream reset" errors. However we didn't notice this because we silently close the stream when getting a stream reset.

Changes

  • Updates the message service to open and close a stream per message. I've based this on this chat example. This avoids the "stream reset" errors by not having long running streams.
  • Adds some basic retry logic when attempting to establish a stream to a peer. If we fail we retry up to 20 times, waiting 5 seconds between each retry.

Effects

After these changes I've been able to successfully run a test with 50 participants. I've also been able to run a 3 participant test with 10,000 concurrency.

There still appears to be two failure modes when you add enough participants (i>50):

  • A message service fails to establish a stream within 20 attempts
  • The hardhat docker instance crashes on the vm.

We might be reaching limitations of the cloud VM? A client may just not have the computing power to handle so many tcp requests in a timely manner?

I'm still going to close #111 and #29. I will create a new issue for investigating the performance after this change. See #115

Performance Comparison

Simple Scenario

A simple scenario with 3 participants, concurrency 1, run for 2 minutes.

Before: 13.7 ms

http://34.168.92.245:3000/d/5OBBeW37k/time-to-first-payment?orgId=1&from=1664473900255&to=1664474074277&var-runId=ccqtme8nr2gk3i239hmg&var-jobCount=1&var-testDuration=2m0s&var-hubs=1&var-payees=1&var-jitter=0&var-latency=0&var-payers=1&var-payeepayers=0&var-nitroVersion=v0.0.0-20220922174011-3e33cafaa1f3

After: 14.0 ms

http://34.168.92.245:3000/d/5OBBeW37k/time-to-first-payment?orgId=1&from=1664474132546&to=1664474342866&var-runId=ccqtofgnr2gk3i239hn0&var-jobCount=1&var-testDuration=2m0s&var-hubs=1&var-payees=1&var-jitter=0&var-latency=0&var-payers=1&var-payeepayers=0&var-nitroVersion=v0.0.0-20220922174011-3e33cafaa1f3

Benchmark Scenario

Our established "benchmark" scenario.

Before: 4.20 s

http://34.168.92.245:3000/d/5OBBeW37k/time-to-first-payment?orgId=1&from=1664475776074&to=1664475886730&var-runId=ccqu5a0nr2gk3i239hq0&var-jobCount=10&var-testDuration=30s&var-hubs=1&var-payees=1&var-jitter=2&var-latency=15&var-payers=10&var-payeepayers=0&var-nitroVersion=v0.0.0-20220922174011-3e33cafaa1f3

After: 4.66 s

http://34.168.92.245:3000/d/5OBBeW37k/time-to-first-payment?orgId=1&from=1664475882755&to=1664475952310&var-runId=ccqu5h0nr2gk3i239hqg&var-jobCount=10&var-testDuration=30s&var-hubs=1&var-payees=1&var-jitter=2&var-latency=15&var-payers=10&var-payeepayers=0&var-nitroVersion=v0.0.0-20220922174011-3e33cafaa1f3

Long Benchmark Scenario

Our established "benchmark" scenario but run for 2 minutes.

Before: 7.56s

http://34.168.92.245:3000/d/5OBBeW37k/time-to-first-payment?orgId=1&from=1664476483036&to=1664476752826&var-runId=ccquap0nr2gk3i239ht0&var-jobCount=10&var-testDuration=2m0s&var-hubs=1&var-payees=1&var-jitter=2&var-latency=15&var-payers=10&var-payeepayers=0&var-nitroVersion=v0.0.0-20220922174011-3e33cafaa1f3

After: 7.12s

http://34.168.92.245:3000/d/5OBBeW37k/time-to-first-payment?orgId=1&from=1664476878409&to=1664477152413&var-runId=ccqueb8nr2gk3i239hug&var-jobCount=10&var-testDuration=2m0s&var-hubs=1&var-payees=1&var-jitter=2&var-latency=15&var-payers=10&var-payeepayers=0&var-nitroVersion=v0.0.0-20220922174011-3e33cafaa1f3

@lalexgap lalexgap changed the title WIP: P2P Message service no longer uses long running streams Update Message to support larger number of peers Sep 29, 2022
@lalexgap lalexgap marked this pull request as ready for review September 29, 2022 18:32
Copy link
Contributor

@geoknee geoknee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say a 5x in the maximum number of clients we can spin up in our test is pretty awesome progress! Still lots of digging to do, of course but I think this change makes a lot of sense.

messaging/service.go Outdated Show resolved Hide resolved
messaging/service.go Outdated Show resolved Hide resolved
messaging/service.go Outdated Show resolved Hide resolved
@geoknee
Copy link
Contributor

geoknee commented Sep 30, 2022

The link to the run with 50 nodes seems to show a certain level of success, but I don't think the stats made their way into InfluxDB. I can't see any data in grafana and there are some errors about timeouts in the logs?

@lalexgap
Copy link
Contributor Author

The link to the run with 50 nodes seems to show a certain level of success, but I don't think the stats made their way into InfluxDB. I can't see any data in grafana and there are some errors about timeouts in the logs?

50 nodes is right around the point where we see things break down, so I may have linked a run that didn't properly succeed. Here's a run with 40 participants.

@lalexgap
Copy link
Contributor Author

lalexgap commented Oct 4, 2022

I think the chat example I linked is incorrect. Here's an example where the stream is closed after sending a message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tests fail/slow with a high amount of concurrency Test fails with a large amount of clients
2 participants