Skip to content
This repository has been archived by the owner on Feb 15, 2024. It is now read-only.

601 sentinel isn't working #27

Closed
orisica opened this issue Sep 7, 2020 · 14 comments
Closed

601 sentinel isn't working #27

orisica opened this issue Sep 7, 2020 · 14 comments

Comments

@orisica
Copy link

orisica commented Sep 7, 2020

Numerous people reported that latest sentinel versions aren't working properly, which could be the cause of recent cycle dropouts.

I had the same tracking problem on the newest version with sentinels. All is good after downgrade. Plenty of power and multiple always track verifiers per sentinel

i can second that sentinels with v601 fall behind in some cases. i dont know the reason yet, because similar setups have different results.

I found my sentinel had stopped freezing blocks after I upgraded it to v601. Unfortunately it was supposed to be protecting one of my friend's verifiers and it dropped out of the cycle. I downgraded my Sentinel to v598 and it's been working fine so far.

I have just verified this myself on a fresh sentinel installation I created yesterday, sentinel gets stuck on a certain block and doesn't move. Downgrading to v587 fixes the issue.

Here is a log:

froze block [Block: v=2, height=8947803, hash=7d87...dace, id=5228...5e93], efficiency: 11.0% froze block [Block: v=2, height=8947803, hash=7d87...dace, id=5228...5e93], efficiency: 11.0% froze block [Block: v=2, height=8947803, hash=7d87...dace, id=5228...5e93], efficiency: 11.0% froze block [Block: v=2, height=8947803, hash=7d87...dace, id=5228...5e93], efficiency: 11.0% waiting for message queue to clear from thread [Thread-8], size is 1 requesting block with votes for height 8947804 [1599462571.662 (2020-09-07 07:09:31.662 UTC)]: trying to fetch BlockWithVotesRequest37 from e168...3ea5 froze block [Block: v=2, height=8947803, hash=7d87...dace, id=5228...5e93], efficiency: 11.0% [1599462571.751 (2020-09-07 07:09:31.751 UTC)]: block-with-votes response is [BlockWithVotesResponse(block=[Block: v=2, height=8947804, hash=bf18...b752, id=a132...726e], votes=0)] froze block [Block: v=2, height=8947803, hash=7d87...dace, id=5228...5e93], efficiency: 11.0% froze block [Block: v=2, height=8947803, hash=7d87...dace, id=5228...5e93], efficiency: 11.0% froze block [Block: v=2, height=8947803, hash=7d87...dace, id=5228...5e93], efficiency: 11.0% froze block [Block: v=2, height=8947803, hash=7d87...dace, id=5228...5e93], efficiency: 11.0% froze block [Block: v=2, height=8947803, hash=7d87...dace, id=5228...5e93], efficiency: 11.0% froze block [Block: v=2, height=8947803, hash=7d87...dace, id=5228...5e93], efficiency: 11.0% requesting block with votes for height 8947804 [1599462574.716 (2020-09-07 07:09:34.716 UTC)]: trying to fetch BlockWithVotesRequest37 from acc0...f75a [1599462574.861 (2020-09-07 07:09:34.861 UTC)]: block-with-votes response is [BlockWithVotesResponse(block=[Block: v=2, height=8947804, hash=bf18...b752, id=a132...726e], votes=0)] froze block [Block: v=2, height=8947803, hash=7d87...dace, id=5228...5e93], efficiency: 11.0% froze block [Block: v=2, height=8947803, hash=7d87...dace, id=5228...5e93], efficiency: 11.0%

@EggPool
Copy link

EggPool commented Sep 7, 2020

I can confirm several users reported newest sentinels "freezing".

I did experience that myself.
v595 sentinel
fed by several in-cycles and "always_track_blockchain = 1" in-queue.
plenty of disk space
plenty of CPU and ram left

This sentinel regularly ends up stuck with
"Protecting verifiers: uncertain"
When frozen, all out of cycle say "0" and all in-cycle "10"
Average TTL: 1.5 day

logs look like:
froze block [Block: v=2, height=8831679, hash=82c0...fa49, id=e503...af39], efficiency: 10,4% froze block [Block: v=2, height=8831679, hash=82c0...fa49, id=e503...af39], efficiency: 10,4% [..repeat..] froze block [Block: v=2, height=8831679, hash=82c0...fa49, id=e503...af39], efficiency: 10,4% requesting block with votes for height 8831680 trying to fetch BlockWithVotesRequest37 from 34a7...04c9 froze block [Block: v=2, height=8831679, hash=82c0...fa49, id=e503...af39], efficiency: 10,4% waiting for message queue to clear from thread [Thread-29], size is 1 waiting for message queue to clear from thread [Thread-19], size is 1 [..repeat..] waiting for message queue to clear from thread [Thread-26], size is 1 block-with-votes response is [BlockWithVotesResponse(block=[Block: v=2, height=8831680, hash=0acb...1d6c, id=eb4c...baa9], votes=0)] froze block [Block: v=2, height=8831679, hash=82c0...fa49, id=e503...af39], efficiency: 10,4% [..repeat..]

frozen block not moving anymore (the managed verifiers, however, do work and track blocks)
Have to resync blocks from a healthy sentinel and restart.

@EggPool
Copy link

EggPool commented Oct 7, 2020

Update with more tests, since this is happening again and again, including with the newest version.

Context:

  • I have in cycles and out of the cycle
  • every verifier is managed by one or several sentinels
  • every sentinel manages a mix of in-cycle and out of cycle

The issue:

  • Some sentinels (always the same ones) end up stuck, requesting and freezing the same block over and other.
  • Web interface shows "10" for every in-cycle, and "0" for every out of cycle
  • Status is "protection verifiers: Uncertain"
  • Frozen edge does not move
  • Restart does not help to move, sentinel needs to restart with blocks from a working one.
  • Happens from 2 times a day to every 3 days.

Looking at the logs, they are always the same kind, be it when stuck or after a restart and stuck again.
Roughly:

  • (waiting for message queue to clear from thread [Thread-12], size is 1)
  • requesting block with votes for height 9318144
  • trying to fetch BlockWithVotesRequest37 from bbbb...eee
  • block-with-votes response is [BlockWithVotesResponse(block=[Block: v=2, height=9318144, hash=77b2...abe1, id=24f6...734c], votes=0)]
  • froze block [Block: v=2, height=9318143, hash=0ba0...477c, id=39b5...29c3], efficiency: 10.2%
  • froze block [Block: v=2, height=9318143, hash=0ba0...477c, id=39b5...29c3], efficiency: 10.2%
  • froze block [Block: v=2, height=9318143, hash=0ba0...477c, id=39b5...29c3], efficiency: 10.2%
  • and again, with the very same block.

I first thought of a resource issue, since the main failing sentinel was on an HDD machine.
I duplicated that always failing sentinel on a fast SSD machine with powerful CPU, 8 cores, 32Gb ram.
The morning after both the original (v595) and the clone (v605, same managed_verifiers) were stuck, both with the above log scheme.

So, this seems to be pointing to something verifier related.
However, while the stuck sentinel only had "votes=0" answers, the very same in-cycle verifiers were properly answering to other, working and tracking, sentinels and sending votes, tracking the chain.

At this stage, I'd look at the network/message level, see whether the block-with-votes response messages could be truncated or such.

What more can I do to help troubleshoot?
I have real logs if needed.

@razzle-san
Copy link

razzle-san commented Oct 7, 2020

As a case study, here's what happens. I have 3 sentinels watching some queued and some in-cycle. The first two are 597 and 587, both not working despite everything I've tried. Been mingling with them for days without any luck. The only one that's working is a v605 tracking some in-cycle verifiers (no queued).

sentinels

Later edit : I've taken off all of the queued verifiers from the one with 597 and although it took a while to catch up, it's tracking just fine. Seems to be an issue related to queued verifiers. The 587 has been restarted several times, now it's showing this :
image

@NeoDimi
Copy link

NeoDimi commented Oct 24, 2020

I add a comment on the same issue with my experience in case it helps:

I couldn't get my sentinels to start protecting verifiers so I wiped the VPS of a sentinel clean (new install of ubuntu 18.04) and reinstalled the sentinel completely. Everything started off fine until the sentinel stopped tracking after a short while (see below)

image

Now I'm back with 10s in the web listener in the line of managed verifiers that are tracking the blockchain and 0s for the others.

image

Additional info:
The managed verifiers tracking the blockchain seem to actually be tracking extract from the log:

storing new vote, height=9536524, hash=fc05...50b8
freezing block [Block: v=2, height=9536524, hash=fc05...50b8, id=2614...5bd3] with standard mechanism
after registering block [Block: v=2, height=9536524, hash=fc05...50b8, id=2614...5bd3] in BlockchainMetricsManager(), cycleTransactionSum=∩10.179444, cycleLength=2319
BlockVoteManager: removing vote map of size 2301
cleaning up because block 9536524 was frozen
casting late vote for height 9536524
^^^^^^^^^^^^^^^^^^^^^ casting vote for height 9536524
sending verifier-removal vote to node at IP: 95.216.195.20
maximum cycle transaction amount=∩100000.000000, balance=∩72709327.076724
Unable to determine top verifier. This is normal; out-of-cycle verifiers are not typically notified about new-verifier votes.
^^^^^^^^^^^^^^^^^^^^^ casting vote for height 9536525
storing new vote, height=9536525, hash=fa0b...8b69
freezing block [Block: v=2, height=9536525, hash=fa0b...8b69, id=da0b...90e7] with standard mechanism
after registering block [Block: v=2, height=9536525, hash=fa0b...8b69, id=da0b...90e7] in BlockchainMetricsManager(), cycleTransactionSum=∩10.179444, cycleLength=2319
BlockVoteManager: removing vote map of size 2315
cleaning up because block 9536525 was frozen
casting late vote for height 9536525
^^^^^^^^^^^^^^^^^^^^^ casting vote for height 9536525
sending verifier-removal vote to node at IP: 159.69.105.183
maximum cycle transaction amount=∩100000.000000, balance=∩72709327.076724

neither the sentinel nor the verifiers tracking the blockchain have a full disk (resp. 3%, 63%, 72% usage) and they seem to have plenty of cpu and ram capacity available.
The sentinel has a total of 25 verifiers in its care
The sentinel version is v606 and the verifiers are a mix of v600 and v601

I will try to downgrade the sentinel version to see if it works.
I have an in-cycle verifier (not managed by the sentinel showed above) that is naked at the minute. I really need to get some sentinels to work and to protect verifiers.

@orisica
Copy link
Author

orisica commented Jan 28, 2021

Numerous reports of sentinels not working keep coming on Discord. The sentinel works for a bit and then randomly freezes on a specific block and stops protecting verifiers. Block delete and a restart fixes it temporarily until it gets stuck again after a few hours.

I can confirm that the issue is happening on the latest version (607).

@MyAltcoins
Copy link

I have tested all sentinel versions from 602 to 607, and all exhibit the same issue.
Sentinels get stuck after a few hours to a few days on a particular block and stop protecting verifiers.

It happens even when sentinels are managing in-cycle verifiers only.

A fix would be greatly appreciated.

@NeoDimi
Copy link

NeoDimi commented Apr 16, 2021

I'm now running version 587 of the sentinel. It doesn't solve the problem by any means but seems to at least give me several days of uptime at a time.

The behavior is quite strange:
I have 3 sentinels and each of my verifier is covered by 2 of them, so the sentinels are actually sharing a significant amount of verifiers. They have both verifiers in cycle and in the queue.
Typically, over 10 to 15 days, sentinel 1 my work the whole time, sentinel 2 will need a restart once or twice and sentinel 3 will need a restart every 2 days.
The next 10 or 15 days will see the same type of pattern, just now sentinel 3 is the stable one and sentinel 1 needs almost daily restarts... It seems very random to me.
It does it's job as long as I check on the sentinels twice a day.
I second MyAltcoins, a fix would be greatly appreciated.

@EggPool
Copy link

EggPool commented Apr 16, 2021

@MyAltcoins @NeoDimi
Did you had a look at the sentinel logs once stuck, same pattern in the logs as what I describe
#27 (comment)
?

@EggPool
Copy link

EggPool commented Apr 23, 2021

Update: After days of debug and logs, I was able to make significant progress and will propose a fix very soon.

@n-y-z-o
Copy link
Owner

n-y-z-o commented May 1, 2021

This issue may be fixed in version 608, if the issue that others are seeing is the same issue we have seen.

Our experience with Nyzo is actually more limited than many others in the cycle. With the joining of Argo 2080, we have only 10 in-cycle verifiers, and we have a total of 5 sentinels (including quark.nyzo.co). We did not experience this problem with any of our sentinels until doing some testing on an early version of 608 a few months ago. That doesn't mean this hasn't been a big problem -- we just haven't been able to experience the problem the way others have because of our limited setup.

We've seen a total of 3 times that a verifier became inactive. All 3 times, an invalid block was frozen. And the invalid blocks all had the same qualities in common:

  • the block signature was invalid because a meta transaction added by the block's verifier was absent
  • the verifier of the block had recently emptied its account, transferring all of its tokens to another account and reducing its balance to zero

After adding the signature check to the in-development 608 code, we added logging to see when a block with an invalid signature was received. This happened one time after adding the logging: an invalid block was received on quark.nyzo.co from Argo 752 for height 11509752. That block was verified by verifier 86eb...1898 (nickname A1200O4898). The account for 86ec...1898 had been emptied at block 11448867. Due to the signature check, the block was discarded, fetched again, and the sentinel continued without interruption.

The previous instances of stalls happened at blocks 11373452, 11438953, 11455852. If you look at these, you'll see the same pattern. They all contain meta transactions, and the accounts of the verifiers had been emptied recently before the blocks were verified. So, it makes sense to us that there's some protection causing the transactions to be removed from blocks because the accounts are mistaken to be empty. We added additional logging to try to figure out where the transactions were being stripped -- on the sender or in reassembly on the sentinel -- but we stopped receiving invalid blocks so we were unable to track further. We could have easily tested by deliberately producing invalid blocks, but this would have only further confirmed that the signature checks were performing as intended. It would not have gotten us any closer to understanding why invalid blocks were being received.

The lack of sentinel signature checks could have also been exploited maliciously, although not very effectively.

Again, we're not sure if we are even seeing the same issue that others are seeing, so this may not be helpful at all. If you are having problems with a particular sentinel, however, upgrading that sentinel to 608 might be worth considering.

@EggPool
Copy link

EggPool commented May 1, 2021

Coherent with what I traced down myself, on sentinels that were stuck very frequently
See #39 for detail

This explains more precisely why some blocks may be invalid.
I don't think it's only when the account was recently emptied. This happened way too often to me
(every time however, a meta transaction was missing)

I also believe the fix in 608 solves the issue indirectly, by denying blocks with missing txs to be frozen.
The balance list hash check, however now redundant, is an added safety.

@MyAltcoins
Copy link

Can confirm that EggPool changes completely fixed sentinel issues.

@NeoDimi
Copy link

NeoDimi commented May 4, 2021

I'm updating one of my sentinel which often needed restarts with v608. I'll give a feedback in a few days. Thx for the fix!

@NeoDimi
Copy link

NeoDimi commented May 22, 2021

I'm updating one of my sentinel which often needed restarts with v608. I'll give a feedback in a few days. Thx for the fix!

It's been almost three weeks I didn't have to restart once a sentinel using this version. Kudos and thx !

@n-y-z-o n-y-z-o closed this as completed Feb 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants