Decide what to do about Zebra bug in P2P network header behaviour #8907
Labels
A-compatibility
Area: Compatibility with other nodes or wallets, or standard rules
A-network
Area: Network protocol updates or fixes
C-bug
Category: This is a bug
C-research
Category: Engineering notes in support of design choices
P-High 🔥
S-needs-investigation
Status: Needs further investigation
In #1439, the Zebra P2P network implementation was altered to work around bitcoin/bitcoin#6755 (a bug in Bitcoin Core, inherited by
zcashd
, where if it received aheaders
message containing headers it already knew about, it would request follow-on headers that it did not need). However, this alteration meant that Zebra was intentionally violating the P2P network protocol in two ways:getheaders
message is received by a node while it is partially-synced, it should not send aheaders
message at all (the Bitcoin-inherited P2P network protocol is not request-response). A partially-synced Zebra now "correctly" computes the intersection of the provided block locator with its very-incomplete local state, resulting in a non-sensicalheaders
message that starts far in the past of the recipient.headers
message containing fewer than the maximum number of headers (2000 in Bitcoin, 160 in Zcash) is a sentinel that means "I have no more headers after this" (i.e. these headers reach the providedhashStop
, or the node's chain tip if nohashStop
was provided). By intentionally sending at most 158 headers, Zebra side-steps the Bitcoin Core bug in the case that it would produce duplicates (which occurs more frequently in partially-synced Zebra due to the first violation), but also means that every Zebra (whether partially or fully synced) tells all of its peers that it has no more data when it actually does.These two protocol violations went unnoticed because for the vast majority of Zebra's lifetime, the majority of
zcashd
nodes in the network have been synced to an equal or greater block height than Zebra (the case that the behavioural change to Zebra was targeting). However, due to other edge cases around mining in bothzcashd
and Zebra, currently testnet has Zebra nodes that are fully synced, and allzcashd
nodes are only partially synced. This leads to network header syncing becoming entirely reliant on block mining events, as the following communication pattern between a partially-syncedzcashd
(that is more than 158 blocks behind the chain tip) and a fully-syncedzebrad
shows:zcashd
connects tozebrad
, which sends an initialgetheaders
request;zcashd
correctly ignores thegetheaders
request as it is in Initial Block Download mode.zebrad
mines a block, and advertises its hash tozcashd
via aninv
message.zcashd
sends agetheaders
request containing its block locator, and the advertised block hash as thehashStop
.zebrad
correctly computes the intersection of the provided block locator with the node's current chain, determines that it has more than 160 blocks following that point, but only returns 158 following headers in violation of the P2P network protocol;zcashd
correctly interprets the P2P network sentinel from theheaders
message as thezebrad
node informing it that these headers reach thezebrad
node's chain tip (as they do not include the advertised block hash, and thushashStop
was not reached), and does not send a follow-ongetheaders
request;zcashd
is now stalled untilzebrad
mines another block.The effect is that a
zcashd
node will receive a burst of 158 blocks roughly every 75 seconds, which makes catching up to the chain tip impractical.Now, this is definitely Zebra violating the current network protocol. However, we are in the process of deprecating
zcashd
, and at that point (more specifically, at the first network upgrade that Zebra supports butzcashd
does not), whatever network protocol Zebra implements will become the de-facto correct protocol. So there are two questions:zcashd
, or do we need to make changes to Zebra's behaviour to bring it back in line with the current network protocol?The text was updated successfully, but these errors were encountered: