-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: enhanced end-to-end tests #10
Conversation
5f31d60
to
c1a689e
Compare
…eparate mock data module
…n tests in Cargo.toml
…omputation function to config
Thank you @NikitaMasych, this PR definitely adds a lot of value! While reviewing it, a few questions came to mind. Maybe the code already handles/test the cases mentioned below, in that case, great! Otherwise, I'd like to address them.
The rationale for these questions: implementing Paxos gets tricky if one wants to ensure liveness in the presence of faulty leaders (the first question) and when one needs to calibrate timers (the second questions). |
Hey, @aldur! Apologies for the response delay, I was OOO a few days. |
Each party integrates
Furthermore, leader does not have control to launch ballot process in other parties - that does the integrating code (example), he may only launch 1a message triggering state change (1 bullet point) |
If the party is trying to follow an event launching some stage, it needs corresponding state, which is updated mostly by other parties. In case the actual state does not match needed, the party exits the ballot with error (code). |
We did not, as channels seemed like a perfect abstract solution, not requiring delving into networking. Basically, it is akin to |
Overall, I agree with your statement, but there is not much we can do - that is mostly distributed computing problematics. Implementation allows configuring how leader will be elected, time bounds for rounds, amount of parties, their weights etc. - these are specific to integrating system requirements. |
Dear @NikitaMasych, Thank you for the exhaustive answers. I am worried about leaving a lot of responsibilities (and burden) to the systems integrating this library -- for instance, when should a new ballot start? That said, I think that such considerations might be out of scope for this PR, which effectively addresses E2E tests. Thank you! |
I would say that delegating responsibilities to the caller is the price of making the library abstract and flexible. In reality, I think the only unpleasant difficulty would be timing out the correct stage's launch event triggers. If you guys have any further questions, you are very welcome to ask so that I can help you with integration. Regarding ballot process start timeout - it is a feature we enabled, not really necessary, but rather useful: this one allows you to launch all parties simultaneously, based on their clock, mitigating delay of the sequential launch and networking propagation. In case you don't need this - just set this to 0 and parties will launch instantly. Would be thankful if you approve the PR, so that I am clear to merge it 🙂 |
Hey @NikitaMasych , I was looking at the code (not only the diff here) and have a two general comments:
In particular, the second point is quite important as the current implementation assumes strict sync phases, eg ignoring any message that is not for the exact ballot the party is executing; but this (having well defined, sync phases across parties) would require some synchronization across parties, and I'm afraid to reach that we fall back in the consensus problem. An effect of this is that if for some reason some party fall back (eg because of some clock skew), it will start rejecting all the messages from (according to its view of current phase) future ballots, making impossible to recover it. I also have some other general comments/questions about the code, eg:
Do you prefer I open some specific issues for them? To avoid hijacking this PR. |
Hi @matteojug! Signed MessagesThis is a good question, haven't thought about it. Paxos specification itself does not mention anything regarding signing messages/communication channels as I understand because these are not consensus responsibility. In practice, I am not sure that we need it, because parties are managed by a centralized entity who creates them, distributes config, and in general each party has entanglement with centralized entity's P2P module regarding message communication channels, meaning that for some party to act like legit one, it would need to change receiving channel in centralized entity. So, the signatures are somewhat mitigated by rust specifics of channels entanglement. I may be wrong, and then we'll need to add signing of the messages, but for that there will be another scope of work. Repeated launch ballotIn case you would like to add some amount of retries before a ballot is considered failed, you may easily do that in your integrating module, which launches ballots in the first place. We have integration tests, which in primitive way demonstrate how to use the library and if you'd like to do graceful amount of runs, for example add for loop in this function (parties creation shall be excluded from loop). Regarding your extended rationale to the question, party will fail itself ASAP if something is off with latencies etc. and you might launch it again. So, you have a set of parties on which you launch a ballot, and you need to wait for them all to exit before launching another try. We indeed need to require good sync across parties, because there is actually no way how to do this differently. That is, you might allow bigger time periods between stages, so that each more-or-less syncing party would support the ballot. Bincode and rkyvWe use rkyv since it is very performant, and it is our internal mechanics, meaning we don't need to worry about it being convenient to integrating code. Bincode, in contrary, allows more flexibility and is more recognisable by engineers in general, thus we use it for serde ops on
|
Thank you for the answers, I've opened specific issues for the others to avoid mixing threads. Signed MessagesGiven the general nature of the library, it makes sense to delegate the authentication of messages outside. But to avoid surprises, can you make it an explicit assumption (in the readme and other relevant places) that the channel provider MUST authenticate the messages? Repeated launch ballotThe issue is that the implementation assumes synchronicity in executing a single ballot, but parties may skip a ballot (eg, missing some message) and in general be out of sync. The BPCon algorithm (eg, looking at Lamport paper) takes this into account by having conditions (eg, an acceptor performs a ballot-b Phase 1b or 2b action only It's true that the process is repeated multiple times until converges, but in its current form it wouldn't work as if each one just start the next ballot, if they are out of sync then it will stay like that.
As we are talking about a distributed setting, we cannot really wait for them all to exit, as we fall back into the consensus problem again.
This may not be a required assumption of Paxos (based on what exactly are you referring to), but it's not just about having larger timeouts, the issue is that given the possibility of faults it may be possible that some message are lost, so it could be that multiple ballots takes place simultaneously for some party (based on its point of view). |
What?
All tests refactored - added clarity on what is tested, comments, enhanced homogeneity of codebase, made it more rust-idiomatic.
Introducing shared test mocks package available with feature
test-mocks
.Moved end-to-end tests to
/tests
as they describe integration.Added more integration coverage.
Changes apart from tests
Fixed Bugs:
≥
instead of>
as it should be in BFT requirement.New Features:
BPConConfig
.Analysis on end-to-end coverage
Categories of parties
Good
- party sends messages to other participants based on following events, and correctly receives and processes messages from other parties.Faulty
- party has troubles receiving/sending messages. These are simply mitigated by the weighed threshold and redundancy of consensus participants.Malicious
- party launches DDoS attack using unbounded sending of messages - to deal with this, we introduce rate-limiting mechanism in accepting messages inside theParty
, however it is also required by integrating 'external' system, which handlesP2P
, to attest to this, because otherwise receiving channel may get flooded by malicious messages and block messages from other parties. Another way to cause trouble is by sending invalid messages. For this, each party has a set of checks for certain fields like current ballot number, status, etc. Additionally, if the state transition caused by incoming message errored, it does not impact the party in either way.Note on the leader
If the
leader
of the ballot is faulty or malicious, the ballot deterministically fails and needs to be relaunched.Note on the communication discrepancies
Each party has a certain period in which it may accept particular messages for a certain stage (example: having passed 1a stage, it is open for accepting only 1b messages for 2 seconds). These periods are configurable using
BPConConfig
.In addition, it is possible to schedule parties to launch at a specific absolute time.