Skip to content
This repository has been archived by the owner on Feb 15, 2024. It is now read-only.

Minimal fixes to address NCFP-10: Do not allow "red" in-queue verifiers anymore. #25

Open
wants to merge 21 commits into
base: master
Choose a base branch
from

Conversation

EggPool
Copy link

@EggPool EggPool commented Jul 29, 2020

Reference:
#22

By voting Yes to ncfp-10, current cycle acknowledged the issue with always red verifiers being able to remain in queue with no actual verifier running, just spamming nodejoins once in "nodes" file.

This patch does

  • persist the new communicationFailureCount metric in nodes file.
  • rebalance increment/reset of failure count by reseting communicationFailureCount on successful answer only, never on incoming messages alone.
  • log nodejoins messages in a more verbose way, so external scripts can use that info (ip and short-id) to auto-ban or rate limit spamming nodes.
  • adds some context in changelog.md

This version has been tested since several days on several verifiers (both in-cycle and in-queue).
v596 and 597 have been integrated since so this PR can be merge with no conflict.

A preview of what the eligible (> 30 days) nyzo queue would look like if present at https://nyzo.today/queue from a modded verifier.
https://nyzo.today/dropped is a searchable table of "dropped" verifiers (nodes that are in the original nodes file but dropped from the modded version due to being consistently inactive for too long)

The difference with current nyzo queue is striking, and no false positive was reported so far.
This is just a first step to address huge number of verifiers abusing current code, but in no way a definitive answer.
This only patches today's most urging issue in the most minimal and side effect proof way.

@itsMarcoSolis
Copy link

itsMarcoSolis commented Jul 29, 2020

By comparing the queue before and after applying this fix it is evident that a major improvement in the joining process has been achieved. Not a single member of the community has reported a false positive so it's safe to asume this change won't impact the cycle on a negative manner.
I hope the core devs can merge this PR soon.

@bh00ker
Copy link

bh00ker commented Jul 29, 2020

I fully support this proposed PR. Been following closely. I think it addresses NCFP-10 quite well.

@m-burner
Copy link

I'm against this PR. We need a vote from the whole community.
If someone decides to make changes that cut 3/4 of the queue, a general decision is needed.

@bh00ker
Copy link

bh00ker commented Jul 29, 2020

I'm against this PR. We need a vote from the whole community.
If someone decides to make changes that cut 3/4 of the queue, a general decision is needed.

We voted that red verifiers shouldnt be allowed to participate in the queue / lottery. This addresses it, and well I might add.

@orisica
Copy link

orisica commented Jul 29, 2020

Great work, about time we remove fake verifiers from the queue and resume Proof Of Diversity which is in accordance with cycle vote.

@Syd-ai
Copy link

Syd-ai commented Jul 29, 2020

Great work, it will definitely grants a better diversity to nyzo cycle. Looking forward for a merge 🙏

@setanimals
Copy link

Supported! TY Sy for all the great work on this.

@l-u-f-o
Copy link

l-u-f-o commented Jul 29, 2020

Reviewed the code and test results. This PR has my full support. The cycle tx result of NCFP-10 also shows support by the community.

@nsejourne
Copy link

I want this PR to be merged !

@krypdkat
Copy link

krypdkat commented Aug 3, 2020

LGTM 👍 tested for 4 days. No issue.

I think you should resolve the conflict, squash all commits into 1 and move the ChangeLog to somewhere else to make it easier for the dev to build the next version upon your PR 🙂

@@ -96,7 +101,9 @@ private static void updateNode(byte[] identifier, byte[] ipAddress, int portTcp,
if (portUdp > 0) {
existingNode.setPortUdp(portUdp);
}
existingNode.markSuccessfulConnection();
if (isNodeJoinResponse) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would need to be reviewed in detail. It could likely have a negative effect on in-cycle connectivity.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing and commenting.
NCFP-10 shows the cycle considers this an important matter, with long lasting and destructive effect on the Nyzo cycle and concept itself.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't make such grand claims. We know people are worked up about this, and we sincerely appreciate your involvement and effort, but a bunch of red nodes does not have a "long lasting and destructive effect on the Nyzo cycle and concept itself." Remember that the original entrance process was a simple FIFO queue. It was, in fact, sufficient, though not ideal. The lottery is light-years ahead of the FIFO queue. As long as you have any source of scarcity as a reference for entrance, and as long as no single entity can cheaply control significantly more than half of that source of scarcity for long enough (years, with the current cycle size), then the proof of diversity is safe.

IP addresses are a source of scarcity. A lot of people feel that you should have a verifier continually running behind those IP addresses. We don't. We feel that controlling an IP address is enough, and the current system already enforced that. We also want to eliminate wastefulness as much as possible, and forcing someone to run a verifier is wasteful. Additionally, running a single verifier that receives entire blocks of node-join requests would be incredibly easy. So, you're more likely to cut out the small-time operators' red nodes while the larger blocks of red nodes are more likely to set up a single verifier to back them (and you'd only need to forward incoming 9444 requests to that verifier, leaving the IP addresses otherwise usable). This would further reward the big blocks while pushing out more of the little guys.

Overall, eliminating red nodes from the lottery, while inadvertently catching a number of white nodes in the process, drastically reduced the diversity of the lottery and harmed the diversity of the system as a result. There are currently 3,374 white nodes on https://nyzo.today/dropped.

Your idea of IP-based lottery scoring is, however, a wonderful solution. When you really dig into the idea and start looking at all of the facets of it, it's absolutely brilliant and very much a Nyzo-esque solution. It's simple, it encourages diversity, and it encourages good behavior. These are all natural properties that just fall out of the solution, not things that have to be implemented with complex machinations. We will be releasing an NTTP soon to explore this further.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please accept my apologies for the "grand claim".
With no feedback from you, no idea of what your stance on this issue - or if this was an issue at all, we tried our best to cope with what many saw as an effective attack on the queue (more than 50% joins) and therefore the cycle.
With the current cycle size indeed, it would take years. But who knows how many verifiers these groups already do control?
The first c-class ip network seen in the queue goes back to a pretty long time (jim tweeted at that time iirc, can't find the date back)

As for the destructive effect, it's also a psychological one on small operators and individuals.
How can you pay for a vps to add one ip to the queue when you see 10's of red c-classes already? This discourages all the small fishes, and I know it encouraged knowledgeable users to level up their game as well.

This PR was more of a first - temporary - answer, meant to be quickly applied and give time for a more thoughtful one (the lottery).
I'm aware that even with this PR, operators of full c-class could stay in the queue, just that would require more work and resources for them.

Your stance on wastefulness we had, but your stance of not necessarily needing to operate a full verifier is also an important data we did not have until now.
Speaking for myself, while I'm not fond of waste at all, I see the current working as hugely balanced toward isps (see odessa) or companies already owning lots of ips - for other uses - so their marginal cost to run a huge amount of in-queue (significantly more than half of the queue currently) is null.
This PR does not address that, only the c-class aware lottery can.

As for white verifiers caught in the process, the image nyzo.today/dropped shows is the result of 2 things:

  • drops due to failedCount
  • bans due to nodejoin spam
    Excessive nodejoins spams are auto-banned on this system that was heavily spammed, this could play a role.

tl;dr: I do agree this PR is not a full-featured solution and has its flaws.
It was thought as a minimal patch and impactless (so I believed) temporary solution - quick to apply - so to regain the confidence of small and average actors that can not operate bunches of 256 ips, limiting the risk on the cycle, buying time to build a more complex solution (the lottery).
Thanks a lot for your answers, can't wait for the upcoming NTTP!

@@ -462,11 +470,19 @@ private static void loadPersistedNodes() {
int portUdp = Integer.parseInt(split[3]);
long queueTimestamp = Long.parseLong(split[4]);
// long identifierChangeTimestamp = Long.parseLong(split[5]); no longer used
long communicationFailureCount = Long.parseLong(split[5]);
if (communicationFailureCount > 1000) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a smart check.

@@ -234,6 +233,7 @@ public void run() {
socket.setSoTimeout(1000);
response = readFromStream(socket.getInputStream(), socket.getInetAddress().getAddress(),
message.getType());
NodeManager.markSuccessfulConnection(hostNameOrIp);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could also have a negative impact on in-cycle connectivity. During periods of stress, a verifier might be able to make a TCP connection but not be able to process and respond within the 1000ms timeout period. The behavior of this with large messages should be carefully considered, also.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the feedback.
The current working is overly optimistic regarding users using alternate code, and overly pessimistic regarding network ability.
As a result, once you're in it's way to easy to remain in "nodes" file and be lottery eligible.

Since a markSuccessfulConnection() does reset the failed count, and to drop a node you need 6 consecutive fails plus a timestamp threshold,
you don't need every request to end within the timeout or with a successful processing.
At worse, you just need just one real answer during 6 consecutive queries + timestamp threshold and all failed reset.
No matter the stress period or some long messages: it will be reset anyway for a real verifier.
Moreover, the only way to increment the "failed" count is to not open a socket on incoming message.
If the peer timeouts, failed is not even incremented.
Lastly, iirc an in-cycle that would drop of the nodes files that way will be re-added to the nodes files instantly on next nodejoin, so it looks no more harmful than an in-cycle being down for some time.

If this poses a real risk to in-cycle connectivity (runs since almost 2 weeks on some verifiers, 30% of the cycle)
then the early markSuccessfulConnection() should only be called for in-cycles, not in-queue.
I'm unsure of the computational cost for such a lookup.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an additional feedback and second thought:
failedCount should only matter for in-queue verifiers.
In-cycle can be down for some time, they are backed up by sentinels, and their performance score is what have them voted out if they are off for a long time.
This is the main mechanism for dropping in-cycle, along with no block provided - be it by verifier or sentinel.

So failed count could only be relative in-queue, as a way to make sure that ip indeed hosts a verifier, with an open and working port.
Logic would then be to simply ignore markFailedCommunicationAttempt() for in-cycle verifiers.
In-cycle connectivity would then be preserved. Once in-cycle, a node is then less sensible to stress or DoS or temp issue.
All that matters is produce a block (verifier or sentinel) and perf metrics (long term working).

Would that cause some other side effects?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failedCount is important for in-cycle verifiers because it eliminates verifiers that are no longer present. If we did not have this, we would still be sending messages to every IP ever associated with an in-cycle verifier.

This needs to be examined in the specific case of a cycle stall or near-stall. The cycle has seen no stress under this code, so it has not yet been tested. There is a delicate balance between pruning permanently dead connections quickly to eliminate excess traffic and allowing connections with temporary problems to recover efficiently.

When the blockchain is moving forward at a normal pace, everything is easy. When the cycle becomes distressed, that's when the danger happens, and that's what we always need to engineer for. This is exceptionally difficult, because simulating the current Nyzo mesh is impractical.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the feedback.

We were in the dark for some time with no hint and missed you a lot.
I now understand your concerns about these changes better.
Would address that by differentiating in-cycle from out of cycle in regard to MarkSuccessful, since danger comes from out of cycle, and in-cycle needs more failsafes.

Will answer more deeply in the second comment.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.