-
Notifications
You must be signed in to change notification settings - Fork 0
headerssync: Add headers cache #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
headerssync: Add headers cache #3
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great initiative on optimizing the headers sync – definitely a worthwhile area, I'm very happy somebody's working on this, it was also on my list of things to tackle!
Welcome to the team Benchoors \:D/
The idea of caching headers to skip network redownload is solid, however, the current approach using an in-memory RAM cache with automatic sizing heuristics raises a few key concerns:
- Disk caching (raw file or even LevelDB) seems much more scalable, avoids significant RAM pressure during PRESYNC (which could be high), and might offer comparable performance gains given network latency is the main bottleneck. While RAM is faster, disk avoids complex sizing logic and scales better for the future. Have you considered caching in batches to disk instead?
- If we keep ram caching, the automatic sizing (GetFreeRAM, time estimates, various bounds) is complex and potentially fragile. Would a simpler approach, like a user-configurable
-headerscachesize=MB
(with a conservative default), be more predictable and robust? - Some of the logic seems ripe for extraction into a helper method, and some explanatory comments might be better suited for commit messages.
- Probably irrelevant, but was wondering that if we're not redownloading, do we still care about "honest nodes" (as long as it's not obvious spam)?
I will review this later in more detail, only had this much energy at the end of the weekend :)
@@ -110,3 +120,52 @@ int64_t GetStartupTime() | |||
{ | |||
return nStartupTime; | |||
} | |||
|
|||
std::optional<size_t> GetFreeRAM() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we can get away with this, in other cases the user can provide the size of the desired cache.
If we will keep it, I am very much against introducing dead code, the commit that adds a functionality should also show its use - at least in test, though I prefer real usage to be able to decide if it's the best solution to a given problem or not.
Each commit should be self-contained and its utility immediately demonstrable, the problem should come before the solution, but currently I see the solution in a commit, but not the problem description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, it is a bit of a departure (interesting potential direction for other settings as well IMO). We mostly do headers-sync on initial startup, after that the memory is freed. Maybe there is some condition where we've been running offline for a while and then reconnect that would re-trigger it too. It might be good to also provide an -arg to override the otherwise dynamic cache size though, will look into that.
I think having an independent commit "lays the breadcrumbs of the story", showing there are only dependencies in one direction (HeadersSyncState
-> GetFreeRAM()
). But I can see your point too. Merged it into where it is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still not sure why we need this, now that there's a dedicated arg for it. We're not trying to guess dbcache
size either - or whether we're on a HDD or SSD (though that would actually be a good idea :p).
We've relied on expert users instead of heuristics mostly, seems simpler to try that here as well - we can estimate it automatically in a follow-up, would go for the simpler solution first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we probably should be "guessing" dbcache
size as well. A minor part of this PR is bringing that up for discussion. But moved to a later commit for now, so it's easy to drop if there is more push-back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd do both of them in a separate PR (base memory defaults on available memory), I don't think you want to spend your time defending this part of the code in this PR, seems like a different concern. You'd have to test this on a gazillion operating system and compilers and versions...
src/headerssync.cpp
Outdated
// Don't cache more than 30 years of ~10-minute blocks for now | ||
6 * 24 * 365 * 30, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm, why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was somehow worried about consuming too much RAM.. but we already cap it at 10% below, so removed this, which simplifies things, thanks!
src/headerssync.cpp
Outdated
6 * 24 * 365 * 30, | ||
// We try to make sure not to use more than 10% of | ||
// available RAM for the headers cache. | ||
(GetFreeRAM().value_or(0) / 10) / sizeof(decltype(m_headers_cache)::value_type))))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be a bit harsh, but why are we caching to memory, why not cache to disk directly? (via a small buffer, of course, like we do with blocks, see https://github.com/bitcoin/bitcoin/blob/master/src/net_processing.cpp#L114)
We can expect SSDs to be a lot faster by the time this is merged and used - but modern ones are already very close to RAM speed, basically same order of magnitude - and still several orders of magnitude faster than redownloading them:
NVMe has hit almost 4 GB/s in transfer speeds, which is really incredible. This is no where near modern DDR4 speeds… DDR4 is generally around 20 GB/s, with higher speeds hitting 25 GB/s.
RAM latency is lower, but we don't really need that here as far as I understood, we have a predictable read/write order (given some buffering) - very similarly to the actual block download, which also saves and rereads the blocks when the windows is full - and we could likely read/write header batches of 10.000 or whatever, we don't have to store them separately, if my understanding is correct.
Saving and prefetching can even be done on a separate thread if need be (if indeed the iteration order is mostly deterministic).
Predictable read could also be fast on a lousy HDD.
While (de)serialization would add some extra, it's actually quite trivial and we header sync would mimic the pattern used by block downloads (maybe even reuse some of the setup). Could be solved automatically if we saved these to LevelDB instead of separate files - might be slow, dunno, but I think it's worth considering, since I think we can assume that the headers should always fit to disk, but not always into memory.
It may result in simpler code (since we can just assume that everything is cached, no need for 30 year and 10% memory limits and redownload fallbacks), better scaling (we expect more headers in 30 years) and basically the same speed gain compared to redownloading?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My initial goal is using resources that are available on slightly beefier nodes.
Since adding logic for caching to disk adds extra complexity, I'd prefer either doing it in a follow-up PR (saving review-energy for the other aspects in this PR), or possibly in a separate commit at the end of the PR.
Probably would just use one contiguous file without LevelDB since we don't need to do lookups, just write and read in order from genesis/start to sufficient chainwork.
Might be good to have one file for every 100'000 headers in order to be able to delete them after they have been consumed (read), as I think we start downloading and storing blocks to disk in parallel with that process.
Is there some other reason for doing something closer to block storage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some other reason for doing something closer to block storage?
We don't have to, of course, but I think we need a good reason to do it differently since it's a very similar problem.
Since adding logic for caching to disk adds extra complexity, I'd prefer either doing it in a follow-up PR (saving review-energy for the other aspects in this PR), or possibly in a separate commit at the end of the PR.
I'm not sure that's the case, quite the opposite, I'd expect it to be simpler. But you've spent more time on it than I have...
Memory is finite, not sure it scale with the blockchain. But the disk kinda' has to and I'm not sure it would be measurably slower to do it on disk, but with memory we have to worry about OOM and other attacks all the time.
hss.reset(new HeadersSyncState(0, Params().GetConsensus(), chain_start, chain_work)); | ||
(void)hss->ProcessNextHeaders(first_chain, true); | ||
BOOST_CHECK(hss->GetState() == HeadersSyncState::State::REDOWNLOAD); | ||
static void HappyPath(const std::vector<CBlockHeader>& first_chain, const CBlockIndex* chain_start) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the happy path even likely? What's the smallest thing that can ruin a happy path, a single invalid header that isn't part of the longest chain?
Won't that become a new attack vector - i.e. the lightest attack that will make the happy path unfeasible, derailing caching (which isn't problematic now, may be problematic in a few decades). It's likely a lot better than previous DOS threats, but with disk caching we can likely even handle invalid attempts (e.g. 10% invalid or whatever), which would likely be more robust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already on master: During PRESYNC, just as in REDOWNLOAD we both check hashPrevBlock
and check PermittedDifficultyTransition
(nBits
). We must be given headers that fulfill this in-order from our selected headers sync-peer.
The happy path is the most likely. Edge-cases:
- Malicious node (
SneakyRedownload
-test). - Selecting other node that hasn't fully synced the chain either, giving us insufficient work (
TooLittleWork
-test). - Connection lost.
Don't see potential for a new attack, but maybe I'm missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still need to understand what the worst case scenario is before and after this change - will try to find out myself (e.g. what kind of attacks are possible before and after)
src/headerssync.cpp
Outdated
// buffers. | ||
Assume(m_process_all_remaining_headers); | ||
} else { | ||
// If we just switched to REDOWNLOAD and m_headers_cache was |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems to me saving to disk could solve this as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not in this case I don't think, even if we were to permit more headers being stored to disk (5min blocktimes?), we would still have a cap on the cache size eventually - so this would still need to be handled.
ef1b61b
to
2de1ae5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your initial feedback!
I'm very happy somebody's working on this, it was also on my list of things to tackle!
Welcome to the team Benchoors :D/
Nice!
the current approach using an in-memory RAM cache with automatic sizing heuristics raises a few key concerns
- Disk caching
- ... automatic sizing (GetFreeRAM, time estimates, various bounds) ... Would a simpler approach, like a user-configurable
-headerscachesize=MB
Please see inline comments.
- Some of the logic seems ripe for extraction into a helper method, and some explanatory comments might be better suited for commit messages.
Extracted ComputeHeadersCacheMax()
, any other specific suggestions?
Probably irrelevant, but was wondering that if we're not redownloading, do we still care about "honest nodes" (as long as it's not obvious spam)?
We have a minimum proof of work threshold which is hardcoded for a given release. This threshold triggers the PRESYNC -> REDOWNLOAD state change. We could have a malicious node attempt to feed us an ultra-long but low work chain. If our node has the cache but is running an old release far in the future (with an old threshold), it will fill the cache and then fall back to the behavior on master of verifying headers are connected and conform to difficulty adjustments, but proceed to throw them away. So the old node is still not going to go into some pathological case unless we remove the cap on the headers cache.
A mid-level aspect I've been considering is doing things like adding an extra state PRESYNC -> +CONSUMECACHE+ -> REDOWNLOAD
. But HeadersSyncState
exposes state-specific methods publicly (GetState()
, GetPresyncHeight()
, GetPresyncTime()
, GetPresyncWork()
, NextHeadersRequestLocator()
), so I've avoided it so far. Any ideas on that level?
Changes in latest push:
- Extracted
ComputeHeadersCacheMax()
. - Changed
HeadersSyncState
-ctor to take bytes instead of header count. - Renamed
HeadersSyncState::m_download_state
->m_state
as we aren't tracking any other state enums within that class. - Extracted
ProcessPresync()
andProcessRedownload()
methods to reduce indentation. - Added
-headerssynccache=MB
arg. - Added reference to recent BitMEX blog entry in commit message.
- Adjusted comments and other minutia.
@@ -110,3 +120,52 @@ int64_t GetStartupTime() | |||
{ | |||
return nStartupTime; | |||
} | |||
|
|||
std::optional<size_t> GetFreeRAM() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, it is a bit of a departure (interesting potential direction for other settings as well IMO). We mostly do headers-sync on initial startup, after that the memory is freed. Maybe there is some condition where we've been running offline for a while and then reconnect that would re-trigger it too. It might be good to also provide an -arg to override the otherwise dynamic cache size though, will look into that.
I think having an independent commit "lays the breadcrumbs of the story", showing there are only dependencies in one direction (HeadersSyncState
-> GetFreeRAM()
). But I can see your point too. Merged it into where it is used.
src/headerssync.cpp
Outdated
// Don't cache more than 30 years of ~10-minute blocks for now | ||
6 * 24 * 365 * 30, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was somehow worried about consuming too much RAM.. but we already cap it at 10% below, so removed this, which simplifies things, thanks!
src/headerssync.cpp
Outdated
// buffers. | ||
Assume(m_process_all_remaining_headers); | ||
} else { | ||
// If we just switched to REDOWNLOAD and m_headers_cache was |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not in this case I don't think, even if we were to permit more headers being stored to disk (5min blocktimes?), we would still have a cap on the cache size eventually - so this would still need to be handled.
src/headerssync.cpp
Outdated
6 * 24 * 365 * 30, | ||
// We try to make sure not to use more than 10% of | ||
// available RAM for the headers cache. | ||
(GetFreeRAM().value_or(0) / 10) / sizeof(decltype(m_headers_cache)::value_type))))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My initial goal is using resources that are available on slightly beefier nodes.
Since adding logic for caching to disk adds extra complexity, I'd prefer either doing it in a follow-up PR (saving review-energy for the other aspects in this PR), or possibly in a separate commit at the end of the PR.
Probably would just use one contiguous file without LevelDB since we don't need to do lookups, just write and read in order from genesis/start to sufficient chainwork.
Might be good to have one file for every 100'000 headers in order to be able to delete them after they have been consumed (read), as I think we start downloading and storing blocks to disk in parallel with that process.
Is there some other reason for doing something closer to block storage?
hss.reset(new HeadersSyncState(0, Params().GetConsensus(), chain_start, chain_work)); | ||
(void)hss->ProcessNextHeaders(first_chain, true); | ||
BOOST_CHECK(hss->GetState() == HeadersSyncState::State::REDOWNLOAD); | ||
static void HappyPath(const std::vector<CBlockHeader>& first_chain, const CBlockIndex* chain_start) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already on master: During PRESYNC, just as in REDOWNLOAD we both check hashPrevBlock
and check PermittedDifficultyTransition
(nBits
). We must be given headers that fulfill this in-order from our selected headers sync-peer.
The happy path is the most likely. Edge-cases:
- Malicious node (
SneakyRedownload
-test). - Selecting other node that hasn't fully synced the chain either, giving us insufficient work (
TooLittleWork
-test). - Connection lost.
Don't see potential for a new attack, but maybe I'm missing something?
2de1ae5
to
c0c990f
Compare
Latest push:
|
@@ -23,15 +27,34 @@ constexpr size_t REDOWNLOAD_BUFFER_SIZE{14827}; // 14827/624 = ~23.8 commitments | |||
// re-calculate parameters if we compress further) | |||
static_assert(sizeof(CompressedHeader) == 48); | |||
|
|||
static size_t ComputeHeadersCacheMax(const CBlockIndex* chain_start, const std::optional<size_t> cache_bytes, size_t element_size) | |||
{ | |||
if (cache_bytes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this short-circuiting is a bit counter-intuitive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifying the size means we override the dynamic calculation, but we still need to convert bytes -> header count. Tried to clarify in latest push.
break; | ||
} | ||
switch (m_state) { | ||
case State::PRESYNC: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will have to debug this properly, I don't understand any of this yet :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll debug it properly this week, only did a lightweight sweep again to get you some early feedback. Sorry for the lack of depth, will do that shortly.
@@ -50,13 +50,11 @@ FUZZ_TARGET(headers_sync_state, .init = initialize_headers_sync_state_fuzz) | |||
{ | |||
SeedRandomStateForTest(SeedRand::ZEROS); | |||
FuzzedDataProvider fuzzed_data_provider(buffer.data(), buffer.size()); | |||
auto mock_time{ConsumeTime(fuzzed_data_provider)}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs more explanation, I don't have enough context.
"lower" than what? What happens when we're "returning"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Attempted to improve commit message:
- fuzz: Specify lower limit of mock-time instead of returning
-
- Should reduce wasted runs, through less code complexity.
+ fuzz(headerssync): Specify minimum for mock-time
+
+ That way we can always proceed instead of just exiting the fuzz target, wasting cycles. Also reduces code complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any idea why it was done like that before?
auto age = rand();
if x < 120 do shit
else return false
seems indeed simpler to do
auto age = rand(0, 120)
(you could likely commit this in a separate PR as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My guess is either the author didn't know of the min
parameter or it was added afterwards. I think it's small and non-controversial enough to sneak into this PR for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do both
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems it was already available at the time of the pr
@@ -272,7 +272,7 @@ class HeadersSyncState { | |||
bool m_process_all_remaining_headers{false}; | |||
|
|||
/** Current state of our headers sync. */ | |||
State m_download_state{State::PRESYNC}; | |||
State m_state{State::PRESYNC}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, this is what I mean X x = X:x; // sets X x to x (X)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, seems kind of naked if removing the comment completely though. Could change it to "Central tracker of state machine's state", but not sure it's enough of an improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not just the comment, state
is too general, everything is a state
.
Can we find a better name - or if not, can we refactor, not being able to find a proper name is a strong code smell
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a State m_state
field in a C++ state machine is certainly not unheard of. Would m_current_state
be nicer in your eyes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it doesn't help me in understanding the behavior better. Again, it's like a variable called "thing" ... it indicates a bigger problem
src/headerssync.cpp
Outdated
ret = ProcessRedownload(received_headers, full_headers_message); | ||
break; | ||
case State::FINAL: | ||
Assume(false); // Should never be called again after entering FINAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we rather avoid this instead? Or at least print something instead of a comment? I hate comments, it's a sign that we've given up :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed assume condition to: Assume(m_state != State::FINAL)
.
Would love proper support for error strings in asserts/assume.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is worse than before, m_state != State::FINAL
is always true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inside case State::FINAL
, m_state != State::FINAL
is always false
, triggering failure. This is a common idiom for documenting what condition has failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but ideally we shouldn't be able to get to an illegal state in the first place - though it's not your code that is at fault here, but maybe we can reduce the invalid possibilities
src/headerssync.h
Outdated
std::vector<CompressedHeader> m_headers_cache; | ||
|
||
/** Maximum cache height / entry count for m_headers_cache. */ | ||
const size_t m_headers_cache_max; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need this? We should be able to pre-size this quite precisely, maybe even make it an array and use its bounds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to use vector::capacity()
for now. Would be nice to have a terse way of preventing the vector from accidentally re-allocating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or an array, making sure it's heap allocated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I take it you mean a C array? We would end up with 3 fields in HeadersSyncState
instead of 1:
const size_t m_headers_cache_size; // Equivalent to vector::capacity()
CompressedHeader* const m_headers_cache; // Fixed memory blob, nice!
size_t m_headers_cache_count; // Equivalent to vector::size()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but passing it around as an std::span
.
I don't really have enough experience (as discussed in the other thread) about std::array
for big values, so not sure if we can force it to heap allocation or not. Let me know if you do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::array
is off the table since the size is fixed at compile time, and we will want to determine the size at runtime.
Maybe m_presync_height
can be used in place of m_headers_cache_count
as long as it is checked against m_headers_cache_size
which it may exceed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, we have to make it dynamic, for a second I thought that if it depends on dbcache size, it's not runtime anymore, lol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And there's also the upper bound of 10% mem currently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
40e9538
This is a huge commit, can you break it up meaningfully?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Broke out the RAM-querying to later commit.
I thought about making the cache_bytes
parameter to the HeadersSyncState
-ctor default to std::nullopt
or 0
and then introduce all the setting of that value in subsequent commits, but it felt too contrived, not leaving that commit in an acceptable state. Files with call-sites sending in that value are more than a handful, but they are quite simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss in person why you think that's an important part of this change
@@ -64,7 +64,8 @@ HeadersSyncState::HeadersSyncState(NodeId id, const Consensus::Params& consensus | |||
// exceeds this bound, because it's not possible for a consensus-valid | |||
// chain to be longer than this (at the current time -- in the future we | |||
// could try again, if necessary, to sync a longer chain). | |||
m_max_commitments = 6*(Ticks<std::chrono::seconds>(NodeClock::now() - NodeSeconds{std::chrono::seconds{chain_start->GetMedianTimePast()}}) + MAX_FUTURE_BLOCK_TIME) / HEADER_COMMITMENT_PERIOD; | |||
// Subtract cached headers as we don't redownload and verify commitments for them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aren't we subtracting the max possible headers (for some reason I don't yet understand)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is just the cache, hopefully dropping "max" from the variable name helps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: I don't yet understand how it can that be negative (that's why we need the std::max
now, right)? But I'll debug it, you don't have to explain it, just documenting my surprises. Breaking it up into smaller named symbols could help.
@@ -180,7 +181,7 @@ HeadersSyncState::ProcessingResult HeadersSyncState::ProcessRedownload(const | |||
// we'll return a batch of headers to the caller for processing. | |||
ret.success = true; | |||
for (const auto& hdr : received_headers) { | |||
if (!ValidateAndStoreRedownloadedHeader(hdr)) { | |||
if (!ValidateAndStoreRedownloadedHeader(hdr, /*from_cache=*/false)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"redownload" from "cache"?
That way we can always proceed instead of just exiting the fuzz target, wasting cycles. Also reduces code complexity.
* Makes tests logically separated through extraction into functions. Also enables using function-scope instances of HeadersSyncState instead of reset()-ing unique_ptr. * Also adds missing comment for part 4.
* Use BOOST_REQUIRE_EQUAL for HeadersSyncState::State - Nicer failure output and prevents continuing the test. * Verify HeadersSyncState::State directly after ProcessNextHeaders(). * CHECK for more, like Locator and result.pow_validated_headers. * Extract and capitalize constants. * Extract genesis block variable.
* m_current_chain_work -> m_presync_chain_work: See GetPresyncWork() * m_last_header_received -> m_presync_last_header_received: See GetPresyncTime() * m_current_height -> m_presync_height: See GetPresyncHeight() * m_download_state -> m_state: See GetState(). We don't have other state-enums in the class, so drop "download"-qualifier which could mislead one to associate it with the REDOWNLOAD state.
State processing will increase in complexity in upcoming commit.
While we do want to avoid OOM-attacks through [CVE-2024-52916](https://bitcoincore.org/en/2024/07/03/disclose-header-spam/), we can still serve the happy path by caching headers up to a reasonable chain height. The reasonable chain height is calculated by taking the difference between starting block and our current time, and assuming slightly faster than 10 minute block time averages[^1]. Cuts out time/bandwidth re-downloading same headers from same peer - good for both local and remote node, especially over slow/flaky connections. Even more helpful when the local node has flaky connection to all other nodes. The cache is entirely RAM-based since it is ephemeral by nature, taking less than a minute to fill on a really good connection (at contemporary block heights). Also due to the fact that we throw it away upon completion, or if we end up with a peer that has too-low-PoW or feeds us a malicious chain. The headers cache size is set by -headerssynccache=<MB>. When unset, we default to storing REDOWNLOAD_BUFFER_SIZE number of headers (= +0.68 MiB) (smaller for shorter chains). This makes us able to return a chunk of headers earlier, after having switched into the REDOWNLOAD phase. [^1]: "the Bitcoin target interval time isn’t really 10 minutes, it’s actually 10 minutes and 0.3 seconds. Not that this bug is significant, actually since Bitcoin launched, the average interval has been 9 minutes and 36 seconds, significantly less than 10 minutes. This is because on average since 2009 the hashrate has been increasing." https://blog.bitmex.com/the-timewarp-attack/ (31 Mar 2025)
As we cache rather than redownload them, it doesn't make sense to generate commitments for those and then verify against the same data. Also skip duplicate check for PermittedDifficultyTransition during REDOWNLOAD as we already did it during PRESYNC for cached headers.
Query the machine for free RAM and take up to 10% of it for the temporary cache. This really enables Bitcoin Core to make use of the hardware it is given when operated by users who don't investigate every single setting.
c0c990f
to
f1fdc84
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushing my first impressions while I'm reviewing
@@ -64,7 +64,8 @@ HeadersSyncState::HeadersSyncState(NodeId id, const Consensus::Params& consensus | |||
// exceeds this bound, because it's not possible for a consensus-valid | |||
// chain to be longer than this (at the current time -- in the future we | |||
// could try again, if necessary, to sync a longer chain). | |||
m_max_commitments = 6*(Ticks<std::chrono::seconds>(NodeClock::now() - NodeSeconds{std::chrono::seconds{chain_start->GetMedianTimePast()}}) + MAX_FUTURE_BLOCK_TIME) / HEADER_COMMITMENT_PERIOD; | |||
// Subtract cached headers as we don't redownload and verify commitments for them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: I don't yet understand how it can that be negative (that's why we need the std::max
now, right)? But I'll debug it, you don't have to explain it, just documenting my surprises. Breaking it up into smaller named symbols could help.
src/headerssync.h
Outdated
std::vector<CompressedHeader> m_headers_cache; | ||
|
||
/** Maximum cache height / entry count for m_headers_cache. */ | ||
const size_t m_headers_cache_max; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or an array, making sure it's heap allocated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss in person why you think that's an important part of this change
src/headerssync.cpp
Outdated
ret = ProcessRedownload(received_headers, full_headers_message); | ||
break; | ||
case State::FINAL: | ||
Assume(false); // Should never be called again after entering FINAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is worse than before, m_state != State::FINAL
is always true
@@ -272,7 +272,7 @@ class HeadersSyncState { | |||
bool m_process_all_remaining_headers{false}; | |||
|
|||
/** Current state of our headers sync. */ | |||
State m_download_state{State::PRESYNC}; | |||
State m_state{State::PRESYNC}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not just the comment, state
is too general, everything is a state
.
Can we find a better name - or if not, can we refactor, not being able to find a proper name is a strong code smell
@@ -56,6 +56,7 @@ FUZZ_TARGET(p2p_handshake, .init = ::initialize) | |||
PeerManager::Options{ | |||
.reconcile_txs = true, | |||
.deterministic_rng = true, | |||
.headerssync_cache_bytes = fuzzed_data_provider.ConsumeIntegralInRange(0, 20 << 20), // Up to 20MiB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use 20_MiB
here?
@@ -104,7 +104,7 @@ static void SneakyRedownload(const std::vector<CBlockHeader>& first_chain, const | |||
// initially and then the rest. | |||
headers_batch.insert(headers_batch.end(), std::next(first_chain.begin()), first_chain.end()); | |||
|
|||
HeadersSyncState hss{0, Params().GetConsensus(), chain_start, CHAIN_WORK}; | |||
HeadersSyncState hss{0, Params().GetConsensus(), chain_start, CHAIN_WORK, /*cache_bytes=*/1 * sizeof(CompressedHeader)}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I personally don't find 1 *
more readable than sizeof(CompressedHeader)
@@ -215,7 +215,7 @@ class HeadersSyncState { | |||
|
|||
/** In REDOWNLOAD, check a header's commitment (if applicable) and add to | |||
* buffer for later processing */ | |||
bool ValidateAndStoreRedownloadedHeader(const CBlockHeader& header); | |||
bool ValidateAndStoreRedownloadedHeader(const CBlockHeader& header, bool from_cache); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as stated before, we likely need to rename the method now, since we're not "downloading" from a cache anymore
@@ -110,3 +120,52 @@ int64_t GetStartupTime() | |||
{ | |||
return nStartupTime; | |||
} | |||
|
|||
std::optional<size_t> GetFreeRAM() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd do both of them in a separate PR (base memory defaults on available memory), I don't think you want to spend your time defending this part of the code in this PR, seems like a different concern. You'd have to test this on a gazillion operating system and compilers and versions...
@@ -50,13 +50,11 @@ FUZZ_TARGET(headers_sync_state, .init = initialize_headers_sync_state_fuzz) | |||
{ | |||
SeedRandomStateForTest(SeedRand::ZEROS); | |||
FuzzedDataProvider fuzzed_data_provider(buffer.data(), buffer.size()); | |||
auto mock_time{ConsumeTime(fuzzed_data_provider)}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems it was already available at the time of the pr
SneakyRedownload(first_chain, second_chain, chain_start, chain_work); | ||
HappyPath(first_chain, chain_start, chain_work); | ||
TooLittleWork(second_chain, chain_start, chain_work); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a lot better than commenting all overt the place, love it!
TooLittleWork(second_chain, chain_start, chain_work); | ||
} | ||
|
||
static void SneakyRedownload(const std::vector<CBlockHeader>& first_chain, const std::vector<CBlockHeader>& second_chain, const CBlockIndex* chain_start, const arith_uint256& chain_work) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: would be great to avoid all these const vector&
references and use std::span
instead, but that would require prod code change as well, so maybe next time
@@ -63,15 +63,19 @@ BOOST_FIXTURE_TEST_SUITE(headers_sync_chainwork_tests, HeadersGeneratorSetup) | |||
// updates to the REDOWNLOAD phase successfully. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: HeadersGeneratorSetup::GenerateHeaders
seems to have a leftover return
at the end (and ;;
)
|
||
headers_batch.clear(); | ||
std::vector<CBlockHeader> headers_batch; | ||
headers_batch.insert(headers_batch.end(), std::next(second_chain.begin(), 1), second_chain.end()); | ||
// Tell the sync logic that the headers message was not full, implying no | ||
// more headers can be requested. For a low-work-chain, this should causes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// more headers can be requested. For a low-work-chain, this should causes | |
// more headers can be requested. For a low-work-chain, this should cause |
|
||
/** Height of m_last_header_received */ | ||
int64_t m_current_height{0}; | ||
/** During phase 1 (PRESYNC), we cache a reasonable amount of received headers so |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: what's the role of the lookahead buffer
after this change?
@@ -183,6 +183,11 @@ class HeadersSyncState { | |||
const unsigned m_commit_offset; | |||
|
|||
private: | |||
ProcessingResult ProcessPresync(const std::vector<CBlockHeader>& |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this could come after the test breakup (or maybe even the same commit), it seems to follow the same logic
@@ -23,6 +23,10 @@ void ApplyArgsManOptions(const ArgsManager& argsman, PeerManager::Options& optio | |||
if (auto value{argsman.GetBoolArg("-capturemessages")}) options.capture_messages = *value; | |||
|
|||
if (auto value{argsman.GetBoolArg("-blocksonly")}) options.ignore_incoming_txs = *value; | |||
|
|||
if (auto value{argsman.GetIntArg("-headerssynccache")}) { | |||
options.headerssync_cache_bytes = *value << 20; // MiB -> byte conversion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could be problematic on 32 bit system, we should likely err if this is too big for some reason:
options.headerssync_cache_bytes = *value << 20; // MiB -> byte conversion | |
options.headerssync_cache_bytes = *value * 1_MiB; |
@@ -43,7 +67,7 @@ HeadersSyncState::HeadersSyncState(NodeId id, const Consensus::Params& consensus | |||
// could try again, if necessary, to sync a longer chain). | |||
m_max_commitments = 6*(Ticks<std::chrono::seconds>(NodeClock::now() - NodeSeconds{std::chrono::seconds{chain_start->GetMedianTimePast()}}) + MAX_FUTURE_BLOCK_TIME) / HEADER_COMMITMENT_PERIOD; | |||
|
|||
LogDebug(BCLog::NET, "Initial headers sync started with peer=%d: height=%i, max_commitments=%i, min_work=%s\n", m_id, m_presync_height, m_max_commitments, m_minimum_required_work.ToString()); | |||
LogDebug(BCLog::NET, "Initial headers sync started with peer=%d: height=%i, max_commitments=%i, min_work=%s, cache=%d headers (%.1f MiB)", m_id, m_presync_height, m_max_commitments, m_minimum_required_work.ToString(), m_headers_cache.capacity(), (m_headers_cache.capacity() * sizeof(decltype(m_headers_cache)::value_type)) / (1024.0f * 1024.0f)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to avoid guarding m_headers_cache
for additional pushes and querying .capacity
in this awkward way, we could hide it behind a struct, something like:
template<typename T>
class FixedCache {
std::vector<T> data;
std::size_t const cap;
@@ -54,6 +78,7 @@ void HeadersSyncState::Finalize() | |||
Assume(m_state != State::FINAL); | |||
ClearShrink(m_header_commitments); | |||
m_presync_last_header_received.SetNull(); | |||
std::vector<CompressedHeader>{}.swap(m_headers_cache); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this may bare a comment stating that when we detect any problem the cache is completely wiped.
And can this also be a ClearShrink
instead?
While we do want to avoid OOM-attacks through CVE-2024-52916, we can still serve the happy path by caching headers up to a reasonable chain height.
Cuts out time/bandwidth re-downloading same headers from same peer - good for both local and remote node, especially over slow/flaky connections. Even more helpful when the local node has flaky connection to all other nodes.
Queries the machine for free RAM and takes up to 10% of it for the cache. (Which is released when syncing ends).