Decide what to do about Zebra bug in P2P network header behaviour #8907

str4d · 2024-10-03T03:05:26Z

In #1439, the Zebra P2P network implementation was altered to work around bitcoin/bitcoin#6755 (a bug in Bitcoin Core, inherited by zcashd, where if it received a headers message containing headers it already knew about, it would request follow-on headers that it did not need). However, this alteration meant that Zebra was intentionally violating the P2P network protocol in two ways:

If a getheaders message is received by a node while it is partially-synced, it should not send a headers message at all (the Bitcoin-inherited P2P network protocol is not request-response). A partially-synced Zebra now "correctly" computes the intersection of the provided block locator with its very-incomplete local state, resulting in a non-sensical headers message that starts far in the past of the recipient.
In the P2P network protocol inherited from Bitcoin Core, a headers message containing fewer than the maximum number of headers (2000 in Bitcoin, 160 in Zcash) is a sentinel that means "I have no more headers after this" (i.e. these headers reach the provided hashStop, or the node's chain tip if no hashStop was provided). By intentionally sending at most 158 headers, Zebra side-steps the Bitcoin Core bug in the case that it would produce duplicates (which occurs more frequently in partially-synced Zebra due to the first violation), but also means that every Zebra (whether partially or fully synced) tells all of its peers that it has no more data when it actually does.

These two protocol violations went unnoticed because for the vast majority of Zebra's lifetime, the majority of zcashd nodes in the network have been synced to an equal or greater block height than Zebra (the case that the behavioural change to Zebra was targeting). However, due to other edge cases around mining in both zcashd and Zebra, currently testnet has Zebra nodes that are fully synced, and all zcashd nodes are only partially synced. This leads to network header syncing becoming entirely reliant on block mining events, as the following communication pattern between a partially-synced zcashd (that is more than 158 blocks behind the chain tip) and a fully-synced zebrad shows:

zcashd connects to zebrad, which sends an initial getheaders request;
zcashd correctly ignores the getheaders request as it is in Initial Block Download mode.
zebrad mines a block, and advertises its hash to zcashd via an inv message.
zcashd sends a getheaders request containing its block locator, and the advertised block hash as the hashStop.
zebrad correctly computes the intersection of the provided block locator with the node's current chain, determines that it has more than 160 blocks following that point, but only returns 158 following headers in violation of the P2P network protocol;
zcashd correctly interprets the P2P network sentinel from the headers message as the zebrad node informing it that these headers reach the zebrad node's chain tip (as they do not include the advertised block hash, and thus hashStop was not reached), and does not send a follow-on getheaders request;
zcashd is now stalled until zebrad mines another block.

The effect is that a zcashd node will receive a burst of 158 blocks roughly every 75 seconds, which makes catching up to the chain tip impractical.

Now, this is definitely Zebra violating the current network protocol. However, we are in the process of deprecating zcashd, and at that point (more specifically, at the first network upgrade that Zebra supports but zcashd does not), whatever network protocol Zebra implements will become the de-facto correct protocol. So there are two questions:

Is Zebra's current behaviour stable and safe in a Zebra-only network?
Are there changes we can make so we can safely limp along for the remaining lifetime of zcashd, or do we need to make changes to Zebra's behaviour to bring it back in line with the current network protocol?

The text was updated successfully, but these errors were encountered:

daira · 2024-10-03T11:26:10Z

My view is that this is a sufficiently serious bug for the remaining lifetime of zcashd that it makes no sense not to fix it, especially since I think a fix is likely to be relatively straightforward, at least on the Zebra side.

IMHO, #1439 was poorly motivated in the first place: zebrad should absolutely not be trying to fix zcashd's efficiency bugs. In this case, "fixing" one bug turned out to be at the expense of introducing another. The effect of #1439 should just be reverted, and then zcashd can independently fix its bug in requesting headers redundantly, probably by backporting bitcoin/bitcoin#25720 (and related PRs; see zcash/zcash#6959).

It doesn't matter if an unfixed zcashd in IBD tries to download headers from a zebrad with the effect of #1439 reverted, because that case still works, just with greater bandwidth usage (which is only for headers, not for blocks). That's not zebrad's problem! zebrad would have to handle such requests gracefully anyway if it were subject to a DoS attack.

Note that this bug doesn't only affect testnet; it makes zcashd syncing from zebrad extremely slow on both testnet and mainnet. That interacts badly with zcashd's policy of only downloading headers from a single arbitrarily chosen node during IBD, because if it chooses a zebrad node (which will happen increasingly often as the distribution of nodes shifts toward Zebra) then that will affect that zcashd node's entire IBD. Note that Bitcoin Core added a timeout in bitcoin/bitcoin#10345, which zcashd tried to add in zcash/zcash#6231 but then reverted in zcash/zcash#6276. In any case that timeout wouldn't trigger in the situation at hand, where headers are downloaded in batches of 158 according to a Poisson process with parameter 75 seconds.

In general, one node implementation trying to fix another node implementation's bugs is how you get fragile, underdocumented, hard-to-maintain protocols with many places for corner-case bugs and DoS vulnerabilities to lurk. I despise it and I think it has poisoned a generation of Internet protocols. [2]

[1] I think that policy is whacked, frankly, even with the timeout from bitcoin/bitcoin#10345. But that's still not zebrad's problem (see previous paragraph). Please use zcash/zcash#6959 for any discussion of it.

[2] This affects even the base protocols of the Internet. For example, if you thought at least TCP was reasonably well specified, read RFC 4614 section 7 and weep. The inevitable consequence of implementations trying to work around each other's bugs is layers upon layers of underdocumented cruft and attack surfaces. Just say no (but still report the bug(s)).

mpguerra · 2024-10-07T14:47:11Z

blocking #8679

daira mentioned this issue Oct 3, 2024

More robust and efficient header sync during IBD zcash/zcash#6959

Open

mpguerra assigned arya2 Oct 7, 2024

arya2 mentioned this issue Oct 7, 2024

fix(network): Respond to getheaders requests with a maximum of 160 block headers #8913

Merged

7 tasks

mergify bot closed this as completed in #8913 Oct 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decide what to do about Zebra bug in P2P network header behaviour #8907

Decide what to do about Zebra bug in P2P network header behaviour #8907

str4d commented Oct 3, 2024 •

edited

Loading

daira commented Oct 3, 2024 •

edited

Loading

mpguerra commented Oct 7, 2024

Decide what to do about Zebra bug in P2P network header behaviour #8907

Decide what to do about Zebra bug in P2P network header behaviour #8907

Comments

str4d commented Oct 3, 2024 • edited Loading

daira commented Oct 3, 2024 • edited Loading

mpguerra commented Oct 7, 2024

str4d commented Oct 3, 2024 •

edited

Loading

daira commented Oct 3, 2024 •

edited

Loading