Skip to content
This repository has been archived by the owner on Jun 14, 2024. It is now read-only.

Protocol request: Direct group communication protocol for low-latency applications (<100ms) #446

Closed
oskarth opened this issue Aug 3, 2021 · 11 comments

Comments

@oskarth
Copy link
Contributor

oskarth commented Aug 3, 2021

Problem

Some applications have a requirement for lower-latency direct communication as a group. This can be due to (soft) real time comm requirement. For example, video chat.

This can either be for 1-1 or as a group of N participants.

Relay/Gossip latency

From https://research.protocol.ai/publications/gossipsub-v1.1-evaluation-report/

Gossipsub-v1.1 achieves timely delivery.In our test network, with 1k honest peers and connection RTTs of 100 ms, we have not found a case where the v1.1 protocol experienced delivery delays higher than 1.6 sec for the 99th percentile of the latency distribution, even in scenarios with Sybil:honest connection ratio ashigh as 40:1. The maximum latency observed was about 5s but that affected a few messages while the systemwas recovering from an attack.

This is what we are working with. More benchmarking etc can be done, but gossiping over multiple hops in open network will always have some latency.

Example usage

Sketch

Basically we want to trade-off some metadata protection and flexibility for latency in a specific negotiated context.

We can use relay protocol to discover peers to talk to, then negotiate a separate group context where all nodes can dial each other. Then based on that context

The simplest version would be a 1-1 direct voice chat, say. Initially via WebSockets but WebRTC (or possibly QUIC?) would be useful to do things like video chat in a browser.

There may be some more infrastructure work on libp2p needed here to make this suitable for voice/video, cc @dryajov re this

100ms is based on general response time limits (https://www.nngroup.com/articles/response-times-3-important-limits/) as well as intuition re things like FPS gaming for "real time feel".

Acceptance criteria

  1. Issue with more limited scope for PoC

  2. Better understanding of hard requirements and required work / reduced uncertainty on things like:

  • WebRTC and/or QUIC (plus infrastructure needed for it)
  • libp2p possibilities and limitations, especially for things like running from a browser
  • understanding of how limited group negotiation would work, with joining context/consensus etc (make simplifying assumptions for initial spec)

^ @D4nte @jm-clius @arnetheduck @staheri14 FYI

@staheri14
Copy link
Contributor

staheri14 commented Aug 3, 2021

If nodes in the limited context are supposed to be trusted as well as with no churn, then, for a more advanced solution, you may want to consider the Kademlia routing overlay which features lower storage overhead (logarithmic instead of linear) and logarithmic routing complexity. @oskarth

Update: Had another look at the issue, I think Kademlia might not be very relevant.

@D4nte
Copy link
Contributor

D4nte commented Mar 10, 2022

Draft notes for potential bounty

outcome sketch:

  • app layer protocol over existing waku protocols that help establish webrtc endpoints
    • raw spec and protobufs/poc impl
  • handover to new protocol poc for 1-1 webrtc unencrypted audio/text direct
    • raw specs and protobuf/poc impl

Could possibly split this up

User story: As a user of Waku you should be able to find other nodes (e.g. in chat) and then establish a direct WebRTC connection

@zah
Copy link

zah commented Apr 13, 2022

Nimbus also has a use case for this where we would allow a group of Nimbus beacon nodes to work together in a way that ensures that there is no single point of failure in the system. The low latency is key for ensuring that all validator actions are performed in time (the validator rewards don't suffer as a result of latency) and Vac/Waku seem useful in the sense that they may allow the group to be formed with almost zero network configuration. The nodes can form groups automatically based on the validator identities and the user wouldn't have to deal with things such as public/private IP addresses, port forwarding, VPNs, etc.

@D4nte
Copy link
Contributor

D4nte commented Apr 13, 2022

Thanks for the input @zah.

What is the current roadmap? is that something you would us to explore further?

@fryorcraken
Copy link
Contributor

I wonder if the best way forward would be to create a nwaku PoC.

According to the requirements above and from https://notes.status.im/waku-vac-devcon-2022#

Nimbus is interested in an easy way for a group of beacon nodes to establish a direct P2P connection with low latency (e.g. it could be a WebRTC connection). The primary goal of operating such a group of nodes is to increase the resilience of the system (to remove single points of failure) and to address some concerns regarding possible DDoS attacks against a single beacon node. Since all nodes in the group will be owned by the same operator, there are no privacy concerns regarding the communication channel and having extremely low latency is the most desirable property. To get maximum safety, the nodes may be located in different data centers within a single city. Similarly, solo stakers may choose to run nodes from their homes, offices, etc, so another desirable property of the system is having zero configuration. This would be similar to a VPN network where the nodes can find each other regardless of the current physical network they are attached to. In Vac/Waku parlance, this could be considered a group channel with an automatically derived name (e.g. the name can be derived from the private key of the operated validators). In other words, the nodes would use the automatically determined channel to find each other to orchestrate the establishing of a full P2P mesh between them (potentially performing UDP hole punching in the process).

It seems that we still need some nat traversal/hole punching first in nwaku/nim-libp2p for that. @jm-clius what is the status for this and what issues are tracking?

Some design assumptions:

  1. Sym key encryption based on private key of the validator (e.g. double hash of private key)
  2. Content topics based on validator keys (e.g. double hash of public key)

Possible protocol (Alice, Bob are different nodes handled by the same validator as described above)

  1. Alice connects to Waku network
  2. Alice discovers her external ip address/port access, e.g. via libp2p identify)
  3. Alice broadcast her ENR on discovery content topic /nimbus/0/disc/<pubkey hash>/proto
  4. Bob connects to Waku network, retrieve ENRs from discovery content topic /nimbus/disc/0/<pubkey hash>/proto
  5. Bob directly connects to Alice
  6. Bob/Alice add each other as Waku Relay direct peers to ensure they are in each other's gossipsub's meshs
    i. OR, light push is used for even more resilience
  7. Alice/Bob uses other content topics to exchange messages, ie, /nimbus/0/tx/<pubkey hash>/proto

Other ideas:

  • The nodes could use a separate pubsub to exchange messages post connection, to ensure the messages are not leaked to the network (they would be encrypted either way)

@cskiraly
Copy link

cskiraly commented Oct 6, 2022

A similar topic was going with the "Application-Layer Multicast" name some time ago. Focusing on low-latency, I could point to deadline-based schedulers (Abeni, L., Kiraly, C., Lo Cigno, R. (2009). On the Optimal Scheduling of Streaming Applications in Unstructured Meshes), and some other works we did in low-latency video distribution.
Basically what you want to achieve is fast initial diffusion at the individual message level, and some peer/chuck selection policy that cuts the tail of the delay distribution by making lagging messages "catch up".

This means:

  • eliminating some parallelism from message sending: since (if) messages are multiple packets, you don't want to interleave them too much, so that you optimize the distribution of one-hop receive delays.
  • having nodes specialize in distributing some messages, while deprioritizing others: this comes from the observation that if you are early in the diffusion tree of the specific message, your contribution can be much more important than at the end of the diffusion tree, where most of your efforts are just generating duplicates. This also means that you are breaking away from the fixed degree notion of gossipsub
  • for catch-up of messages lagging behind, and to flatten to delay distribution, one can use deadline-based scheduling
  • finally, one can do some limited cut-through routing. This is a controversial topic as it can propagate and amplify errors, but it can be hop-limited to limit the risk.

These together can nicely reduce the overall latency distribution.

@zah
Copy link

zah commented Oct 6, 2022

The usefulness of the proposed Nimbus setup increases dramatically when there are at least 3 nodes in the group (you would then use 2 out of 3 threshold signing to allow one of the nodes to be offline without disrupting the system). The ideal setup would involve 5 nodes configured with 3 out of 5 threshold signing.

Using the public key hash in the topic name is not an ideal solution as this would allow other nodes on the network to speculatively monitor all public keys to discover the ENRs of the participating nodes, but this is just a detail for which we'll surely find an appropriate solution.

Setup with more nodes won't improve the reliability further, but the latency will be increased, so I think for our use case we care about group sizes of up to 5 nodes. Due to this, I think a full mesh would be the most appropriate topology (every node sends its own messages to all other nodes).

@jm-clius
Copy link
Contributor

jm-clius commented Oct 6, 2022

It seems that we still need some nat traversal/hole punching first in nwaku/nim-libp2p for that. @jm-clius what is the status for this and what issues are tracking?

This is tracked as medium-to-high priority (my interpretation) in the nim-libp2p roadmap: vacp2p/nim-libp2p#777

@Menduist
Copy link

Menduist commented Oct 6, 2022

nim-libp2p can already be used as a hole punching server (autonat & relay are available), but cannot hole punch itself (missing the dctur for that)

@fryorcraken
Copy link
Contributor

Such enhancement would also be interesting for larger data transfer.

@jimstir
Copy link
Contributor

jimstir commented Jun 13, 2024

Issue moved here

@jimstir jimstir closed this as completed Jun 13, 2024
@jimstir jimstir closed this as not planned Won't fix, can't repro, duplicate, stale Jun 14, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

9 participants