feat: enhanced end-to-end tests #10

NikitaMasych · 2024-09-09T13:24:33Z

What?

All tests refactored - added clarity on what is tested, comments, enhanced homogeneity of codebase, made it more rust-idiomatic.
Introducing shared test mocks package available with feature test-mocks.
Moved end-to-end tests to /tests as they describe integration.
Added more integration coverage.

Changes apart from tests

Fixed Bugs:

Threshold check is now ≥ instead of > as it should be in BFT requirement.
Threshold now accounts for party's own weight.

New Features:

Rate-limiting for incoming messages.
Threshold computation function to BPConConfig.
Migrate CI test running to nextest-rs.

Analysis on end-to-end coverage

Categories of parties

Good - party sends messages to other participants based on following events, and correctly receives and processes messages from other parties.
Faulty - party has troubles receiving/sending messages. These are simply mitigated by the weighed threshold and redundancy of consensus participants.
Malicious - party launches DDoS attack using unbounded sending of messages - to deal with this, we introduce rate-limiting mechanism in accepting messages inside the Party, however it is also required by integrating 'external' system, which handles P2P, to attest to this, because otherwise receiving channel may get flooded by malicious messages and block messages from other parties. Another way to cause trouble is by sending invalid messages. For this, each party has a set of checks for certain fields like current ballot number, status, etc. Additionally, if the state transition caused by incoming message errored, it does not impact the party in either way.

Note on the leader

If the leader of the ballot is faulty or malicious, the ballot deterministically fails and needs to be relaunched.

Note on the communication discrepancies

Each party has a certain period in which it may accept particular messages for a certain stage (example: having passed 1a stage, it is open for accepting only 1b messages for 2 seconds). These periods are configurable using BPConConfig.

In addition, it is possible to schedule parties to launch at a specific absolute time.

…ctions for other cases

…eparate mock data module

…n tests in Cargo.toml

…omputation function to config

aldur · 2024-09-16T04:43:44Z

Thank you @NikitaMasych, this PR definitely adds a lot of value!

While reviewing it, a few questions came to mind. Maybe the code already handles/test the cases mentioned below, in that case, great! Otherwise, I'd like to address them.

The tests I have seen focus on what happens within a single ballot. However, we have discussed that a leader (either malicious, or out of sync) could try starting another ballot, as the previous one is still in progress. Can you add a test, if it is not there already, where a leader at ballot N+1 (both rightly elected and malicious) starts a new ballot while the previous is ongoing?
I was double checking the mechanics around timers. What happens if a party, based on its internal clock, expects to be in eg stage 2a but it hasn't received the required messages for a state transition?
Have you considered alternatives to unbounded channels?

The rationale for these questions: implementing Paxos gets tricky if one wants to ensure liveness in the presence of faulty leaders (the first question) and when one needs to calibrate timers (the second questions).

NikitaMasych · 2024-09-18T14:41:39Z

Hey, @aldur! Apologies for the response delay, I was OOO a few days.

NikitaMasych · 2024-09-18T14:42:40Z

* The tests I have seen focus on what happens within a single ballot. However, we have discussed that a leader (either malicious, or out of sync) could try starting _another_ ballot, as the previous one is still in progress. Can you add a test, if it is not there already, where a leader at ballot N+1 (both rightly elected and malicious) starts a new ballot while the previous is ongoing?

Each party integrates LeaderElector component, which, in the default version, is seeded with some party state, including ballot number. Meaning that for the party with another ballot number than in leader party, elected leader would probably (depends on the amount of parties overall) be different, and no state transition will occur (code). Considering the leader is malicious and does not increment ballot number upon launching another one, as the first is still in progress, nothing bad will happen, since the only vectors of attack include, specific to elevated leader rights, the following:

send launch ballot message - if other parties are already in progress of the ballot, no state transition will occur due to status checks in each party (code)
send launch 2a message - same with status checks and, additionally, leader may send some incorrectly selected value, but that is mitigated using ValueSelector component in each party, which will error, if the elected value doesn't satisfy wanted criteria (code).

Furthermore, leader does not have control to launch ballot process in other parties - that does the integrating code (example), he may only launch 1a message triggering state change (1 bullet point)

NikitaMasych · 2024-09-18T14:43:21Z

* I was double checking the mechanics around timers. What happens if a party, based on its internal clock, expects to be in eg stage 2a but it hasn't received the required messages for a state transition?

If the party is trying to follow an event launching some stage, it needs corresponding state, which is updated mostly by other parties. In case the actual state does not match needed, the party exits the ballot with error (code).

NikitaMasych · 2024-09-18T14:44:56Z

* Have you considered alternatives to unbounded channels?

We did not, as channels seemed like a perfect abstract solution, not requiring delving into networking. Basically, it is akin to Go channels, which are quite popular in consensus algorithms.

NikitaMasych · 2024-09-18T14:48:25Z

The rationale for these questions: implementing Paxos gets tricky if one wants to ensure liveness in the presence of faulty leaders (the first question) and when one needs to calibrate timers (the second questions).

Overall, I agree with your statement, but there is not much we can do - that is mostly distributed computing problematics. Implementation allows configuring how leader will be elected, time bounds for rounds, amount of parties, their weights etc. - these are specific to integrating system requirements.

aldur · 2024-09-23T04:28:38Z

The rationale for these questions: implementing Paxos gets tricky if one wants to ensure liveness in the presence of faulty leaders (the first question) and when one needs to calibrate timers (the second questions).

Overall, I agree with your statement, but there is not much we can do - that is mostly distributed computing problematics. Implementation allows configuring how leader will be elected, time bounds for rounds, amount of parties, their weights etc. - these are specific to integrating system requirements.

Dear @NikitaMasych,

Thank you for the exhaustive answers. I am worried about leaving a lot of responsibilities (and burden) to the systems integrating this library -- for instance, when should a new ballot start? That said, I think that such considerations might be out of scope for this PR, which effectively addresses E2E tests. Thank you!

NikitaMasych · 2024-09-23T16:58:10Z

I would say that delegating responsibilities to the caller is the price of making the library abstract and flexible. In reality, I think the only unpleasant difficulty would be timing out the correct stage's launch event triggers. If you guys have any further questions, you are very welcome to ask so that I can help you with integration.

Regarding ballot process start timeout - it is a feature we enabled, not really necessary, but rather useful: this one allows you to launch all parties simultaneously, based on their clock, mitigating delay of the sequential launch and networking propagation. In case you don't need this - just set this to 0 and parties will launch instantly.

Would be thankful if you approve the PR, so that I am clear to merge it 🙂

matteojug · 2024-09-26T08:18:49Z

Hey @NikitaMasych , I was looking at the code (not only the diff here) and have a two general comments:

Messages are not signed in any way, so looking just at the code anyone can play as any other participant. Does Paxos assume anything specific on the channels requirements? Are they authenticated?
launch_ballot is one ballot, it should be repeated multiple times to overcome failures. Where is it done?

In particular, the second point is quite important as the current implementation assumes strict sync phases, eg ignoring any message that is not for the exact ballot the party is executing; but this (having well defined, sync phases across parties) would require some synchronization across parties, and I'm afraid to reach that we fall back in the consensus problem. An effect of this is that if for some reason some party fall back (eg because of some clock skew), it will start rejecting all the messages from (according to its view of current phase) future ballots, making impossible to recover it.

I also have some other general comments/questions about the code, eg:

Why do we use both bincode and rkyv?
DefaultLeaderElector::elect_leader (specifically, hash_to_range) panics for party_weights = vec![1] (or summing to 1):
...

Do you prefer I open some specific issues for them? To avoid hijacking this PR.
Thanks!

NikitaMasych · 2024-09-26T10:45:46Z

Hi @matteojug!

Signed Messages

This is a good question, haven't thought about it. Paxos specification itself does not mention anything regarding signing messages/communication channels as I understand because these are not consensus responsibility. In practice, I am not sure that we need it, because parties are managed by a centralized entity who creates them, distributes config, and in general each party has entanglement with centralized entity's P2P module regarding message communication channels, meaning that for some party to act like legit one, it would need to change receiving channel in centralized entity. So, the signatures are somewhat mitigated by rust specifics of channels entanglement.

I may be wrong, and then we'll need to add signing of the messages, but for that there will be another scope of work.

Repeated launch ballot

In case you would like to add some amount of retries before a ballot is considered failed, you may easily do that in your integrating module, which launches ballots in the first place. We have integration tests, which in primitive way demonstrate how to use the library and if you'd like to do graceful amount of runs, for example add for loop in this function (parties creation shall be excluded from loop).

Regarding your extended rationale to the question, party will fail itself ASAP if something is off with latencies etc. and you might launch it again. So, you have a set of parties on which you launch a ballot, and you need to wait for them all to exit before launching another try.

We indeed need to require good sync across parties, because there is actually no way how to do this differently. That is, you might allow bigger time periods between stages, so that each more-or-less syncing party would support the ballot.

Bincode and rkyv

We use rkyv since it is very performant, and it is our internal mechanics, meaning we don't need to worry about it being convenient to integrating code. Bincode, in contrary, allows more flexibility and is more recognisable by engineers in general, thus we use it for serde ops on Value, which should be implemented by clients.

`DefaultLeaderElector` panics for sums to 1

I missed this moment, thank you for noticing. It doesn't make practical sense though to run a ballot with such options. Each party shall have at least 1 weight, and the amount of parties overall needs to be more than 1. In case you are testing out the library, you can just implement your own mocked leader elector.

Contact vectors

I have opened discussions for this repository, feel free to message anything there / or create issues if appropriate 😇. Plus, we have a Slack channel, maybe you'd like to ask someone from your side to add yourself in it.

matteojug · 2024-09-26T12:38:32Z

Thank you for the answers, I've opened specific issues for the others to avoid mixing threads.

Signed Messages

Given the general nature of the library, it makes sense to delegate the authentication of messages outside. But to avoid surprises, can you make it an explicit assumption (in the readme and other relevant places) that the channel provider MUST authenticate the messages?
Also note that even in the current mpsc channel-based implementation, given that we don't have point to point channels, nothing prevents one party to impersonate someone else.

Repeated launch ballot

The issue is that the implementation assumes synchronicity in executing a single ballot, but parties may skip a ballot (eg, missing some message) and in general be out of sync. The BPCon algorithm (eg, looking at Lamport paper) takes this into account by having conditions (eg, an acceptor performs a ballot-b Phase 1b or 2b action only
if it has not performed an action for a higher-numbered ballot) that are more relaxed that the one used here (ie, a party ignores messages for ballots other than the one it thinks is the current one).

It's true that the process is repeated multiple times until converges, but in its current form it wouldn't work as if each one just start the next ballot, if they are out of sync then it will stay like that.

So, you have a set of parties on which you launch a ballot, and you need to wait for them all to exit before launching another try.

As we are talking about a distributed setting, we cannot really wait for them all to exit, as we fall back into the consensus problem again.

We indeed need to require good sync across parties, because there is actually no way how to do this differently

This may not be a required assumption of Paxos (based on what exactly are you referring to), but it's not just about having larger timeouts, the issue is that given the possibility of faults it may be possible that some message are lost, so it could be that multiple ballots takes place simultaneously for some party (based on its point of view).

NikitaMasych added 3 commits September 9, 2024 15:28

feat: refactored happy case test, splitting into several reusable fun…

7be28ac

…ctions for other cases

fix: threshould requirement shall allow equality

1448ec3

feat: added faulty party test

c17fd34

NikitaMasych added the enhancement New feature or request label Sep 9, 2024

NikitaMasych self-assigned this Sep 9, 2024

NikitaMasych changed the title ~~feat: enhanced end end tests~~ feat: enhanced end-to-end tests Sep 9, 2024

fix: resolve linter issue with complex return type

c1a689e

NikitaMasych force-pushed the feat/enhanced-end-end-tests branch from 5f31d60 to c1a689e Compare September 9, 2024 13:33

NikitaMasych added 15 commits September 10, 2024 14:22

fix: include own weight in threshold checks in update_state

9807e83

chore: improve update state warning source with comments

dd3e01c

feat: added rate limiting and malicious party test

8c3e5b0

feat: refactored tests - moved ballot tests to integration, created s…

e9be6d6

…eparate mock data module

feat: refactored integration tests, better results analysis

58891cd

feat: added faulty leader test for clarity

736e409

chore: update cargo.toml and readme

ad1f6fa

chore: fixed clippy check with --all-features

67fd502

feat: added check for own message receival, warnings

2d36dee

feat: improve config stages timeouts; feature-enabling for integratio…

fd0f116

…n tests in Cargo.toml

feat: added integration test for large amount of parties, threshold c…

9fca324

…omputation function to config

chore: flaky tests in rust ci

35808da

fix: threshold ceiling division

84d1d5c

chore: testing with nextest in rust workflow

3d99cc0

fix: changed scheduling to instant

b421020

NikitaMasych marked this pull request as ready for review September 11, 2024 02:26

NikitaMasych requested review from xoloki, aldur, olegfomenko and cylewitruk September 11, 2024 02:35

chore: added security considerations section to readme

4d70977

NikitaMasych merged commit a364193 into main Oct 17, 2024
11 checks passed

NikitaMasych deleted the feat/enhanced-end-end-tests branch October 17, 2024 14:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: enhanced end-to-end tests #10

feat: enhanced end-to-end tests #10

Uh oh!

NikitaMasych commented Sep 9, 2024 •

edited

Loading

Uh oh!

aldur commented Sep 16, 2024 •

edited

Loading

Uh oh!

NikitaMasych commented Sep 18, 2024

Uh oh!

NikitaMasych commented Sep 18, 2024 •

edited

Loading

Uh oh!

NikitaMasych commented Sep 18, 2024

Uh oh!

NikitaMasych commented Sep 18, 2024

Uh oh!

NikitaMasych commented Sep 18, 2024

Uh oh!

aldur commented Sep 23, 2024

Uh oh!

NikitaMasych commented Sep 23, 2024

Uh oh!

matteojug commented Sep 26, 2024

Uh oh!

NikitaMasych commented Sep 26, 2024

Uh oh!

matteojug commented Sep 26, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

feat: enhanced end-to-end tests #10

feat: enhanced end-to-end tests #10

Uh oh!

Conversation

NikitaMasych commented Sep 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What?

Changes apart from tests

Fixed Bugs:

New Features:

Analysis on end-to-end coverage

Categories of parties

Note on the leader

Note on the communication discrepancies

Uh oh!

aldur commented Sep 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NikitaMasych commented Sep 18, 2024

Uh oh!

NikitaMasych commented Sep 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NikitaMasych commented Sep 18, 2024

Uh oh!

NikitaMasych commented Sep 18, 2024

Uh oh!

NikitaMasych commented Sep 18, 2024

Uh oh!

aldur commented Sep 23, 2024

Uh oh!

NikitaMasych commented Sep 23, 2024

Uh oh!

matteojug commented Sep 26, 2024

Uh oh!

NikitaMasych commented Sep 26, 2024

Signed Messages

Repeated launch ballot

Bincode and rkyv

DefaultLeaderElector panics for sums to 1

Contact vectors

Uh oh!

matteojug commented Sep 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Signed Messages

Repeated launch ballot

Uh oh!

Uh oh!

Uh oh!

NikitaMasych commented Sep 9, 2024 •

edited

Loading

aldur commented Sep 16, 2024 •

edited

Loading

NikitaMasych commented Sep 18, 2024 •

edited

Loading

`DefaultLeaderElector` panics for sums to 1

matteojug commented Sep 26, 2024 •

edited

Loading