`lightning-block-sync`: Implement serialization logic for header `Cache` types #3600

tnull · 2025-02-13T12:41:38Z

During syncing, lightning-block-sync populates as a block header Cache that is crucial to be able to disconnected previously-connected headers cleanly in case of a reorg. Moreover, the Cache can have performance benefits as subsequently synced listeners might not necessarily need to lookup all headers again from the chain source.

While this Cache is ~crucial to the clean operation of lightning-block-sync, it was previously not possible to persist it to disk due to an absence of serialization logic implementations for the corresponding sub-types. Here, we do just that (Implement said serialization logic) to allow users to persist the Cache.

Making use of the serialization logic for all the sub-types, we also switch the UnboundedCache type to be a newtype wrapper around a HashMap (rather than a straight typedef) and implement TLV-based serialization logic on it.

During syncing, `lightning-block-sync` populates as a block header `Cache` that is crucial to be able to disconnected previously-connected headers cleanly in case of a reorg. Moreover, the `Cache` can have performance benefits as subsequently synced listeners might not necessarily need to lookup all headers again from the chain source. While this `Cache` is ~crucial to the clean operation of `lightning-block-sync`, it was previously not possible to persist it to disk due to an absence of serialization logic implementations for the corresponding sub-types. Here, we do just that (Implement said serialization logic) to allow users to persist the `Cache`. Making use of the serialization logic for all the sub-types, we also switch the `UnboundedCache` type to be a newtype wrapper around a `HashMap` (rather than a straight typedef) and implement TLV-based serialization logic on it.

TheBlueMatt · 2025-02-13T15:28:40Z

I'm not sure I quite understand the desire to serialize this. We use it to keep track of things as we fetch them, but generally shouldn't ever need the cache entries after we connect the blocks (and persist the things that we connected the blocks to).

tnull · 2025-02-13T15:35:47Z

I'm not sure I quite understand the desire to serialize this. We use it to keep track of things as we fetch them, but generally shouldn't ever need the cache entries after we connect the blocks (and persist the things that we connected the blocks to).

Talked with @jkczyz offline about this: AFAIU, we use the cache to keep headers around that might have been connected to another fork. Say we synced a listener up to a certain tip. If it now goes offline, and a reorg happens in the meantime, lightning-block-sync might not have the necessary old header data around to call block_disconnected before re-connecting the blocks of the new best chain.

Additionally, it might bring quite some performance improvements for lightning_block_sync::init::synchronize_listeners which first walks the chain for each listener individually before connecting blocks to them, and only in this last step it's actually inserting into the cache. Meaning: if we sync 6 chain listeners, we're walking the header chain 6 times completely uncached on restart before entering the 'block connection phase'. If we don't persist the Cache, that is.

jkczyz · 2025-02-13T16:23:40Z

Additionally, it might bring quite some performance improvements for lightning_block_sync::init::synchronize_listeners which first walks the chain for each listener individually before connecting blocks to them, and only in this last step it's actually inserting into the cache. Meaning: if we sync 6 chain listeners, we're walking the header chain 6 times completely uncached on restart before entering the 'block connection phase'. If we don't persist the Cache, that is.

Alternatively (or in addition), we could manually update the cache at each iteration during initialization from the ChainDifference, which would mean we'd want some AppendOnlyCache instead of a ReadOnlyCache there. That way we'd only need to query the chain source once for anything not yet cached prior to restarting.

TheBlueMatt · 2025-02-13T17:18:06Z

If it now goes offline, and a reorg happens in the meantime, lightning-block-sync might not have the necessary old header data around to call block_disconnected before re-connecting the blocks of the new best chain.

This is true if you switch to a new bitcoind, but AFAIU using the same bitcoind (data dir) will always provide us with the missing block (headers).

Additionally, it might bring quite some performance improvements for lightning_block_sync::init::synchronize_listeners which first walks the chain for each listener individually before connecting blocks to them, and only in this last step it's actually inserting into the cache. Meaning: if we sync 6 chain listeners, we're walking the header chain 6 times completely uncached on restart before entering the 'block connection phase'. If we don't persist the Cache, that is.

This seems like a bug we could fix, another way, no?

tnull · 2025-02-13T18:03:08Z

Alternatively (or in addition), we could manually update the cache at each iteration during initialization from the ChainDifference, which would mean we'd want some AppendOnlyCache instead of a ReadOnlyCache there. That way we'd only need to query the chain source once for anything not yet cached prior to restarting.

I guess we could make the cache append only, but that's ~orthogonal to this PR, IMO.

This is true if you switch to a new bitcoind, but AFAIU using the same bitcoind (data dir) will always provide us with the missing block (headers).

Right, so if we want to be able to safely handle switiching chain sources, we need to at least support a minimal persisted cache (in LDK Node we restrict it to 100 entries currently, but could even get away with ~ANTI_REORG_DELAY, IIUC).

This seems like a bug we could fix, another way, no?

I don't think it's a bug per se. We need to walk for each listener separately to make sure we find the connection points reliably. It would be nice to make use of caching during it though.

TheBlueMatt · 2025-02-14T15:09:28Z

Right, so if we want to be able to safely handle switiching chain sources, we need to at least support a minimal persisted cache (in LDK Node we restrict it to 100 entries currently, but could even get away with ~ANTI_REORG_DELAY, IIUC).

If we want to support this, IMO its pretty weird to require the cache be persisted by a downstream project, rather if we need it (we should check whether we can just remove the header in the disconnect call and not worry about it, I think we probably can) and then store a few block headers/hashes going back in the listeners themselves, cause they're the things responsible for communicating their current chain location, no?

tnull · 2025-02-14T15:23:14Z

If we want to support this, IMO its pretty weird to require the cache be persisted by a downstream project, rather if we need it (we should check whether we can just remove the header in the disconnect call and not worry about it, I think we probably can) and then store a few block headers/hashes going back in the listeners themselves, cause they're the things responsible for communicating their current chain location, no?

On LDK Node's end it would be nice to maintain the cache, also as a performance optimization. And IMO we can leave it to the user to decide if/how much data they want to persist. In any case, I'm not sure why we wouldn't want to allow for the cache to be persisted?

TheBlueMatt · 2025-02-14T15:44:32Z

I'm a bit confused, though, the cache should really only be relevant when we're reorging, which shouldn't be common? Why is it a material performance difference?

tnull · 2025-03-05T09:45:58Z

I'm a bit confused, though, the cache should really only be relevant when we're reorging, which shouldn't be common? Why is it a material performance difference?

Well, a persisted cache could help performance to resume syncing in the initial 'walking the chain' step. As also discussed offline, it's also required to allow for switching chain sources, e.g., switching from one bitcoind instance to another as we can't be sure that all nodes have seen the same forks and keep the data readily available. So even if you don't follow the former reason, I'd still argue we should support the latter in LDK Node / Server to make the chain syncing logic more robust.

TheBlueMatt · 2025-03-05T17:30:04Z

Okay, but in this case we should probably persist a fixed-size cache of, let's say, 6 blocks (the depth that we assume no reorgs will happen), no?

tnull · 2025-04-10T09:20:51Z

Okay, but in this case we should probably persist a fixed-size cache of, let's say, 6 blocks (the depth that we assume no reorgs will happen), no?

Well, maybe, but a) I think that should be up to the user's Cache implementation and b) even if we go down that road in the future, we'd still need the changes in this PR.

Would be great to get this going again, as I'd like to be able to make use of cache persistence after the next LDK release in LDK Node.

TheBlueMatt · 2025-05-13T16:07:01Z

I still really strongly disagree with the approach of persisting the cache.

In the "performance improvement" case, we shouldn't be relying on downstream-of-LDK applications to fix lightning-block-sync's performance issues. We have a header cache in synchronize_listeners and we even use it for the poller passed to find_difference, the performance issue is actually that we don't use the header cache when doing header lookups in, eg, UnboundedCache. If we fixed that, then we'd only load headers for the walk-back once (as long as all objects are at the same tip, but in any case we'll only ever load headers once). That seems fine given, in the absence of reorgs, we'd actually only ever load headers we hadn't seen when we shut down.

For the "switching chain sources" issue, I'd very much prefer to look into this approach, again because LDK requiring that downstream applications work around API limitations is a really bad thing in general:

If we want to support this, IMO its pretty weird to require the cache be persisted by a downstream project, rather if we need it (we should check whether we can just remove the header in the disconnect call and not worry about it, I think we probably can) and then store a few block headers/hashes going back in the listeners themselves, cause they're the things responsible for communicating their current chain location, no?

tnull · 2025-05-27T12:56:35Z

Excuse the delay here.

In the "performance improvement" case, we shouldn't be relying on downstream-of-LDK applications to fix lightning-block-sync's performance issues. We have a header cache in synchronize_listeners and we even use it for the poller passed to find_difference, the performance issue is actually that we don't use the header cache when doing header lookups in, eg, UnboundedCache.

No, but the cache is immutable at that state on purpose to avoid essential entries being dropped (e.g., in case of a bounded cache) as we walk back to the latest known connection point. We even have ReadOnlyCache to ensure that. AFAIU, the cache was always meant to be persisted as it's crucial for reorg handling, and not adding support for this was merely an oversight. (cc @jkczyz, please correct me on this if I'm wrong).

If we fixed that, then we'd only load headers for the walk-back once (as long as all objects are at the same tip, but in any case we'll only ever load headers once). That seems fine given, in the absence of reorgs, we'd actually only ever load headers we hadn't seen when we shut down.

No, because we walk (and arguably need to walk) each listerener individually back to the last connection point they know, and during that we can't override/remove entries from the cache as it would mess with the data we need for proper reorg handling.

I don't think we need to/should fundamentally rewrite and rethink all of lightning-block-sync just to mitigate the issues at hand. This PR is really all we need.

For the "switching chain sources" issue, I'd very much prefer to look into this approach, again because LDK requiring that downstream applications work around API limitations is a really bad thing in general:

So you'd rather have lightning-block-sync take a KVStore, with all the async/blocking messiness to come, and take care of cache persistence itself, likely then as yet another thing driven by the background processor? Do we expect that many users using 'raw' lightning-block-sync going forward that prioritizing the API over clear separation of concerns of our code makes sense? Why can't I just persist the cache and give it back to lightning-block-sync after reinit, just like LDK requires me to do with all these other objects?

TheBlueMatt · 2025-06-12T22:56:30Z

No, but the cache is immutable at that state on purpose to avoid essential entries being dropped (e.g., in case of a bounded cache) as we walk back to the latest known connection point. We even have ReadOnlyCache to ensure that. AFAIU, the cache was always meant to be persisted as it's crucial for reorg handling, and not adding support for this was merely an oversight. (cc @jkczyz, please correct me on this if I'm wrong).

No, because we walk (and arguably need to walk) each listerener individually back to the last connection point they know, and during that we can't override/remove entries from the cache as it would mess with the data we need for proper reorg handling.

Mmm, this is news to me. But even if we aren't discarding previous entries, I'm still really unclear why we can't cache more? We're doing the same thing over and over again, relying on a persisted cache to avoid doing it over and over again seems really weird? We can just add a cache on the cache to HashMap while we cache and move on, no?

So you'd rather have lightning-block-sync take a KVStore, with all the async/blocking messiness to come, and take care of cache persistence itself, likely then as yet another thing driven by the background processor? Do we expect that many users using 'raw' lightning-block-sync going forward that prioritizing the API over clear separation of concerns of our code makes sense? Why can't I just persist the cache and give it back to lightning-block-sync after reinit, just like LDK requires me to do with all these other objects?

Hmm, I think we're talking past each other, no, that wasn't my proposed alternative. For the ability to walk back, my stated goal here was to instead think about removing the need to pass the headers through when we disconnect blocks so that we no longer need the headers at all. From there, we can store something akin to Bitcoin Core's block locators (ie hashes going back a ways) in each chain::Listen implementation to make looking up fork points doable even after a reorg. Note that we need to do that in general as its not a lightning-block-sync-specific concern - if we are on a chain tip, then there's a reorg while we're offline, this is an issue whether we're syncing over Esplora or Electrum too, we still have to figure out if a reorg happened and what the fork point was.

tnull · 2025-06-13T08:40:04Z

Mmm, this is news to me. But even if we aren't discarding previous entries, I'm still really unclear why we can't cache more? We're doing the same thing over and over again, relying on a persisted cache to avoid doing it over and over again seems really weird? We can just add a cache on the cache to HashMap while we cache and move on, no?

I mean, maybe, but that's unrelated to the changes in this PR. Here we try to fix the current behavior, you propose to extend it, which we could discuss, but it's a bigger lift. And, IMO, we'd want the changes in this PR either way.

Hmm, I think we're talking past each other, no, that wasn't my proposed alternative. For the ability to walk back, my stated goal here was to instead think about removing the need to pass the headers through when we disconnect blocks so that we no longer need the headers at all. From there, we can store something akin to Bitcoin Core's block locators (ie hashes going back a ways) in each chain::Listen implementation to make looking up fork points doable even after a reorg.

Right, that would be a larger refactoring of how Listen works, and I'm generally all for it, if someone finds the time to do it. I still think this is unrelated from just fixing/completing the current behavior, which is what this PR does.

Btw, when we do this, we could at least consider a tighter integration with BDK, as they do exactly the same tracking internally and hence their interface doesn't require/allow to disconnect blocks, but rather you'd always just keep appply connected blocks (cf https://docs.rs/bdk_wallet/latest/bdk_wallet/struct.Wallet.html#method.apply_block / https://docs.rs/bdk_chain/latest/bdk_chain/local_chain/struct.LocalChain.html).

Note that we need to do that in general as its not a lightning-block-sync-specific concern - if we are on a chain tip, then there's a reorg while we're offline, this is an issue whether we're syncing over Esplora or Electrum too, we still have to figure out if a reorg happened and what the fork point was.

Huh, but transaction-based syncing is working entirely differently, no? Or are you saying if you were to use Electrum/Esplora in conjunction with Listen?

TheBlueMatt

Right, that would be a larger refactoring of how Listen works, and I'm generally all for it, if someone finds the time to do it. I still think this is unrelated from just fixing/completing the current behavior, which is what this PR does.

It wasn't a very large refactor :) #3876.

I believe now it should just be a matter of keeping locators instead of best-blocks now? Probably just updating the existing BestBlock struct we use to store more than one hash (6 at least probably?) and taking a BestBlock to the existing Confirm::best_block_updated and we're done?

Btw, when we do this, we could at least consider a tighter integration with BDK, as they do exactly the same tracking internally and hence their interface doesn't require/allow to disconnect blocks, but rather you'd always just keep appply connected blocks (cf https://docs.rs/bdk_wallet/latest/bdk_wallet/struct.Wallet.html#method.apply_block / https://docs.rs/bdk_chain/latest/bdk_chain/local_chain/struct.LocalChain.html).

I think tracking it that way would be incompatible with electrum, no? If I just keep saying best_block_updated and passing new blocks it won't let us detect when there's a reorg if you are allowed to skip blocks.

tnull · 2025-06-23T11:22:57Z

I believe now it should just be a matter of keeping locators instead of best-blocks now? Probably just updating the existing BestBlock struct we use to store more than one hash (6 at least probably?) and taking a BestBlock to the existing Confirm::best_block_updated and we're done?

I'm not sure if I'm following here, what has Confirm to do with this?

I think tracking it that way would be incompatible with electrum, no? If I just keep saying best_block_updated and passing new blocks it won't let us detect when there's a reorg if you are allowed to skip blocks.

You have me confused here, what does electrum have to do with lightning-block-sync?

TheBlueMatt · 2025-06-30T17:37:02Z

I'm not sure if I'm following here, what has Confirm to do with this?

Just to make sure we have several recent blocks even if we're syncing with electrum (so we can switch to bitcoind without issue but also so we can handle reorgs better with electrum).

As I understand it, after #3876 and a tweak to make BestBlock hold several recent blocks rather than just one the issue of restarting after a reorg with a fresh bitcoind is solved as we no longer need to know anything about blocks on the original chain to reorg back.

That leaves only the question about performance, which I still think is better handled internally in lightning-block-sync rather than forcing downstream applications to Do Work.

tnull · 2025-07-01T08:41:12Z

Just to make sure we have several recent blocks even if we're syncing with electrum (so we can switch to bitcoind without issue but also so we can handle reorgs better with electrum).

As I understand it, after #3876 and a tweak to make BestBlock hold several recent blocks rather than just one the issue of restarting after a reorg with a fresh bitcoind is solved as we no longer need to know anything about blocks on the original chain to reorg back.

I'm still lost. So you want to require users to download full blocks on Electrum/Esplora and give them via a new modified BestBlock to Confirm? I'm surely misunderstanding something?

jkczyz · 2025-07-01T13:59:49Z

I'm still lost. So you want to require users to download full blocks on Electrum/Esplora and give them via a new modified BestBlock to Confirm? I'm surely misunderstanding something?

I think @TheBlueMatt meant recent block hashes as per #3600 (review).

tnull · 2025-07-02T08:02:17Z

I think @TheBlueMatt meant recent block hashes as per #3600 (review).

Ah, so the idea would be to refactor BestBlock to be more like a BestChainSuffix, i.e., hold the information about the best 6 blocks or similar?

As noted over at #3876 (comment), I think it would be good to clearly outline the remaining steps we have in mind somewhere (e.g., a tracking issue), also to keep track of what needs to land before the next release (for one so we don't break anything, but also as I'd like to get bitcoind support in LDK Node in a fully production ready state after the 0.2 upgrade).

Furthermore, given that the present discussion might impact #3867 if we really make changes to Confirm, it might be good to sync and come to an overall conclusion soon.

TheBlueMatt · 2025-07-02T21:58:42Z

See #3876 (comment) I don't think there's much that will impact Confirm aside from the BestBlock changes.

tnull requested a review from jkczyz February 13, 2025 12:41

tnull force-pushed the 2025-02-header-cache-persistence branch from bed21f8 to a403bf9 Compare February 13, 2025 13:44

tnull force-pushed the 2025-02-header-cache-persistence branch from a403bf9 to 0207228 Compare February 13, 2025 13:52

tnull requested a review from TheBlueMatt June 12, 2025 08:11

TheBlueMatt removed their request for review June 13, 2025 02:15

tnull requested a review from TheBlueMatt June 18, 2025 08:57

TheBlueMatt reviewed Jun 18, 2025

View reviewed changes

TheBlueMatt mentioned this pull request Jul 2, 2025

Drop the need for fork headers when calling Listen's disconnect #3876

Open

lightning-block-sync: Implement serialization logic for header Cache types #3600

Are you sure you want to change the base?

lightning-block-sync: Implement serialization logic for header Cache types #3600

Uh oh!

Conversation

tnull commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheBlueMatt commented Feb 13, 2025

Uh oh!

tnull commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkczyz commented Feb 13, 2025

Uh oh!

TheBlueMatt commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tnull commented Feb 13, 2025

Uh oh!

TheBlueMatt commented Feb 14, 2025

Uh oh!

tnull commented Feb 14, 2025

Uh oh!

TheBlueMatt commented Feb 14, 2025

Uh oh!

tnull commented Mar 5, 2025

Uh oh!

TheBlueMatt commented Mar 5, 2025

Uh oh!

tnull commented Apr 10, 2025

Uh oh!

TheBlueMatt commented May 13, 2025

Uh oh!

tnull commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheBlueMatt commented Jun 12, 2025

Uh oh!

tnull commented Jun 13, 2025

Uh oh!

TheBlueMatt left a comment

Choose a reason for hiding this comment

Uh oh!

tnull commented Jun 23, 2025

Uh oh!

TheBlueMatt commented Jun 30, 2025

Uh oh!

tnull commented Jul 1, 2025

Uh oh!

jkczyz commented Jul 1, 2025

Uh oh!

tnull commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheBlueMatt commented Jul 2, 2025

Uh oh!

Uh oh!

`lightning-block-sync`: Implement serialization logic for header `Cache` types #3600

`lightning-block-sync`: Implement serialization logic for header `Cache` types #3600

tnull commented Feb 13, 2025 •

edited

Loading

tnull commented Feb 13, 2025 •

edited

Loading

TheBlueMatt commented Feb 13, 2025 •

edited

Loading

tnull commented May 27, 2025 •

edited

Loading

tnull commented Jul 2, 2025 •

edited

Loading