-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: drop bbolt for chain storage #2395
Comments
I don't see any issues with anything presented. It's well thought out and, crucially, focuses on recovery from the failure modes. A couple of notes.
I could be wrong here, but I suspect that due to the variable size of records (aka may or may not have the full block hash due to a collision) not having entry sizes will potentially bite you since it will make iteration more expensive. It might not be too bad if you jump directly to the flag and use that to offset the iterator. I just recall that when doing similar work for dcrd many years ago, and even more generally when dealing with blocks, not having record sizes tends to make iteration much more painful.
It does not at all change your assessment, but for 100% correctness, I have historical data from 7 nodes that have been running uninterrupted since inception and there have been >1 depth reorgs, but very few. Concretely:
I agree with the assessment here, especially when the block index is in memory. I would say that it might be worth considering in the case the alternative design in appendix 1 is used though.
This one is probably a bit of a toss up. SSDs are largely going to ignore whatever you do since their controllers have their own built-in wear leveling algorithms which determine where to put the data anyway. It could potentially help a little on HDDs. |
Right, it wouldn't be unreasonable to add the size as a specific field (only 4MB per million blocks, so pretty light). My own reasoning for not including it in this first draft is because (this design of) the index file is not meant to be the target of seek operations: it's read once at DB startup (thus always iterated in sequence) and then only written to afterwards.
Ah, awesome! Thank you for these, always helpful to get additional data. |
Here's a first draft of the DB API surface.package lightchaindb
import (
"sync"
"github.com/decred/dcrd/chaincfg/chainhash"
"github.com/decred/dcrd/wire"
)
// DB represents the chain database.
type DB struct {
mu sync.RWMutex // Protects concurrent access
// TODO: Add fields for file handlers, arenas, etc.
}
// Open opens the chain database.
func Open(path string) (*DB, error) {
db := &DB{}
// TODO: Initialize file handlers, arenas, etc.
return db, nil
}
// Close closes the chain database.
func (db *DB) Close() error {
db.mu.Lock()
defer db.mu.Unlock()
// TODO: Close file handlers, release resources, etc.
return nil
}
// BlockLocators returns block locators, suitable for use in a getheaders wire
// message or dcrd JSON-RPC request, for the blocks in sidechain and saved in
// the wallet's main chain. For memory and lookup efficiency, many older
// hashes are skipped, with increasing gaps between included hashes.
//
// When sidechain has zero length, locators for only main chain blocks starting
// from the tip are returned. Otherwise, locators are created starting with
// the best (last) block of sidechain and sidechain[0] must be a child of a
// main chain block (sidechain may not contain orphan blocks).
//
// The height of the first block locator (i.e. either the mainchain tip height
// or height of the last sidechain block) is also returned.
func (db *DB) BlockLocators(sidechain []*wire.BlockHeader) ([]*chainhash.Hash, uint32, error) {
panic("needed")
}
// Add the given block and CFilter data to the DB. This may cause a
// reorganization of the chain. If that is the case, then the method will
// return true.
//
// This is usually called when the database has already completed the IBD
// procedure and is currently in a steady state. For improved performance during
// IBD, use the [AddHeaders] and [AddCFilters] functions.
func (db *DB) Add(h *wire.BlockHeader, cf CFilterProof) (reorged bool, err error) {
panic("needed")
}
// AddHeaders adds the passed set of headers to the sidechain forest. After
// adding this new set of headers, the best candidate header chain may have
// been modified.
//
// This is optimized for batch insertion of sequential headers during IBD, and
// as such the caller is expected to enforce certains pre-conditions on the
// passed batch of headers:
//
// - They are in ascending block height order.
// - They connect to each other (the PrevBlock of ith header correctly matches
// the actual block hash of the ith-1 header).
//
// Failure on the part of the caller to enforce these preconditions may put the
// DB into an inconsistent or otherwise invalid state.
//
// The targetSyncHeightHint parameter is used as a hint to the DB of the largest
// block height available. This is used to improve the performance of batch
// addition by assuming that at least that number of blocks will be ultimately
// added to the DB.
func (db *DB) AddHeaders(headers []*wire.BlockHeader, targetSyncHeightHint uint32) error {
panic("needed")
}
// AppendBestChainMissingCFilters appends to [s] the list of block hashes for
// blocks in the main chain that are missing CFilters. These blocks could be
// either blocks already considered in the main chain or blocks in the
// sidechain forest that are waiting to be evaluated.
func (db *DB) AppendBestChainMissingCFilters(s []*chainhash.Hash) ([]*chainhash.Hash, error) {
panic("needed")
}
// AddCFilters adds a batch of missing cfilters to the sidechain forest.
// [endHash] must be the block hash of the last block in the list of cfilters.
//
// If the filters and the commitment proof are valid and the filters are for
// blocks in the sidechain forest, then the chain may be reorged, which will
// be signalled in the resulting bool.
func (db *DB) AddCFilters(hashes []*chainhash.Hash, cfilters []CFilterProof,
targetSyncHeightHint uint32) (reorged bool, err error) {
panic("needed")
}
// TipHeader returns the header of the main chain.
func (db *DB) TipHeader() wire.BlockHeader {
panic("needed")
}
// GetHeader returns the header for the given hash that is expected to be on
// the mainchain.
func (db *DB) GetHeader(bh *chainhash.Hash) (wire.BlockHeader, error) {
panic("needed")
}
// GetBlockHashAt returns the hash of the block at the given height.
func (db *DB) GetBlockHashAt(height uint32) (chainhash.Hash, error) {
panic("needed")
}
// CatcherUp is an interface for catching up callers to the state of the DB.
//
// The parameters of the methods may be reused, therefore it is the
// responsibility of the caller to copy them if they are to be used outside of
// the lifetime of the calls.
type CatcherUp interface {
// Disconnect is called when a block has been disconnected from the
// main chain. The calls are made in reverse connection order: higher
// heights are called first.
Disconnect(bh *chainhash.Hash, b *wire.BlockHeader, cf CFilterProof) error
// Connect is called when a block has been connected to the main chain.
// The calls are made in block order.
Connect(bh *chainhash.Hash, b *wire.BlockHeader, cf CFilterProof) error
}
// Catchup synchronizes an external caller to the current main chain. The
// methods in c are called as needed in order to, first rollback, then to
// advance the caller until all blocks from oldTip up to the current main chain
// tip are visited.
//
// This process is performed with the DB blocked from writes, thus it is
// recommended to not perform actions that block for a signficant amount of
// time.
//
// One alternative to achieve this is to perform the catchup in batches, by
// returning early from one of the calls with an error, then re-starting the
// catchup again.
//
// Results are undefined if oldHeight and oldTip do not match to a valid block.
func (db *DB) Catchup(oldTip *chainhash.Hash, oldHeight uint32, c CatcherUp) error {
panic("needed")
}
// GetMainChainCFilters returns compact filters from the main chain, starting at
// startHash, copying as many as possible into the storage slice and returning a
// subslice for the total number of results. If the start hash is not in the
// main chain, this function errors. If inclusive is true, the startHash is
// included in the results, otherwise only blocks after the startHash are
// included.
func (db *DB) GetMainChainCFilters(startHash *chainhash.Hash, inclusive bool, storage []*BlockCFilter) ([]*BlockCFilter, error) {
panic("needed")
} During IBD:
Once the sync is complete,
During steady state operation, I believe these are the most important calls the proposed DB implementation needs to provide in order for it to be usable as a replacement for the wallet's DB. |
Proposal: drop bbolt for chain storage
Rationale: bbolt is the largest cpu and mem consumer during initial sync and that is entirely due to storing headers and cfilters and fetching blocks for difficulty validation. The features provided by bbolt (in particular: transaction atomicity) are not necessary for storage of chain data. That, plus a few other desirable features (listed below) could be better provided by a new chain db implementation.
NOTE: this is NOT about moving the entirety of dcrwallet off bbolt, ONLY the chain data (headers and cfilters).
Downsides of bbolt
The following measurements are from a sample chain sync of a new SPV dcrwallet instance connected to a single remote source known to have good connectivity and being up to date. Sync height was 885252.
Desired New Features
The following would be the new features desired out of a new chain db:
Overall Design
This is a rough first draft of the design for this DB:
The storage strategy for each arena may be different, according to what provides best throughput during IBD and ongoing sync (after IBD completes).
Long Term Arena
Short Term Arena
Long Term Arena File Format
Files
The long term arena is composed of four files. Before the actual data, each file starts with an 8 byte file header:
Headers File
File for storing headers. This file works as raw header storage. Each entry has the following format:
This file MAY have padding at the end, filled with zero bytes.
Due to headers having a fixed size, the last entry can easily be found when iterating backwards (it's the last one at a position modulo the header size that is not all zeros).
Headers are only put into the headers file of the LTA after they have at least 6 confs (see algorithms below). Due to this, the block hash of a header (other than the last one) can be found without requiring a hashing operation by simply looking at the
PrevBlock
field of the next header.CFilters file
File for storing CFilters. Each entry has the following format:
After the last valid entry, this file has a footer with the following data:
This file MAY have padding at the end, filled with zero bytes.
The goal of having a footer is to explicitly mark the end of (what should be) valid data. Note the footer itself is not a valid entry: the total entry size would be larger than the file itself and the height would be larger than the last header. The footer is meant to be easy to find when iterating backwards from the end of the file.
The goal of having the total entry size on both ends of the struct is to allow backwards iteration of the CFilters file: after identifying the start of the footer, the last valid entry can be correctly read by reading the final total entry size and prior entries can be iterated backwards.
This will be useful during DB startup, to validate the last entry of the CFilters file matches the last header.
Index file
File for storing the index of block hashes to block height and CFilter entry location. This can be rebuilt entirely from the headers and CFilters file if needed, so it could be an optional feature.
Each entry contains the following:
Due to being stored already sorted, the height of a block is implicit (so the first entry in the file corresponds to block 1, and so on). Due to the header sizes being fixed, the offset into the header file is implicit.
The "reduced block hash" is similar to the "short block key" in dcrd: it's the first 32 bits of the block hash.
The MSB of the flags field is used to indicate whether the full block hash is present in this entry or not, while bits 0 to 28 is the total CFilter entry size.
The full block hash is only present in instances where there is a collision in the reduced block hash. In this case, the first block where a particular reduced block hash occurs does not have the flag set, while later occurrences have it set. The wallet can reconstruct the index appropriately. On the current mainnet, as of block 885208, only 84 blocks have a collision in their 32 bit reduced block hash key.
This file contains a footer. The last entries of this file are:
The footer is intended as a marker that the last write operation of the index was finished correctly. When reading the index file, if the footer is not correctly present, the file may be assumed as corrupted and should be rebuilt from scratch. Both entries of the footer are invalid: the all zeros entry points to offset location 0 in the CFilters file, while the all ones points to an entry that will be at an offset larger than the CFilters file size.
This file MAY have padding filled with zero bytes after the last two entries. The footer, therefore, ensures the presence of at least 12 zero bytes at the end (but more are possible).
The goal for the index file is to allow the wallet to quickly load the known blockchain into an in-memory index, for fast header and CFilter lookup. The current mainnet chain as of height 885208 would require (at a minimum, disregarding overhead for pointers and other structures) about 10.6 MB of RAM to keep this index in memory.
There is an alternative design we may pursue for the index that does not require keeping any data in memory. The drawback for this design is that it has an increased disk space requirement and requires disk accesses for fetching data from the index. See Appendix 1 at the bottom for more details.
Sidechains File
File for storing blocks that were once part of the main chain but that have been reorged out.
The entries in this file are rarely needed - they are only used in certain scenarios that involve failed writes and multi-wallet usage of the DB.
Each entry in this file has the following format:
This file may be padded with zero bytes at the end.
Note that entries' fields are organized in reverse of what would be ordinarily expected: block hash and header appear at the end of the entry. This is meant to ease reverse iteration of this file, when looking for a block with specific hash or height.
There is no particular structure to this file. However, it is expected that later blocks will appear after earlier blocks. When a reorg greater than depth 1 happens, blocks that were reorged out will appear contiguously, in block sequence order.
Short Term Arena
This arena is meant to store the most recent 6 blocks: the tip block plus its five immediate ancestors. This arena is organized as a single file, meant to be used in a circular buffer fashion: a new tip block will cause one block to be evicted from this arena after it is written to the long term arena, and then the prior storage location for that block will be overwritten with the newly received block (assuming the newly received block is extending the chain as opposed to creating a sidechain).
This file starts with an 8 byte magic header: 0x6463727374613030 ("dcrsta00").
Each entry in the file has the following structure:
The CFilter storage space is pre-selected according to the max possible CFilter size. It is pre-determined in order to ensure each entry has a completely fixed size on-disk. If the max possible sizes of CFilters change due to a consensus rule change, then this would require a db upgrade.
The max number of CFilter inclusion proofs is set to 64 for a similar reason. While the currently deployed consensus rules enforce that the inclusion proof for CFilters has size 0, we opt to increase the max possible size of the inclusion proof in order to ensure future rules upgrades do not require a db upgrade. This would only be required if a very large number of new commitments were added to the header.
The special file checksum is special in the sense that it is not a standard file checksum/hash computed by processing the file in-order. Instead, the checksum is calculated by hashing the data for the block, plus its 5 immediate ancestors in block sequence, excluding their own special file checksum. This is important to note, because the blocks in the short term arena file are NOT stored in height order. The checksum is meant to ensure the last write operation completed successfully: during startup, while loading the blocks in the short term arena, the wallet may verify that the checksum of the tip block matches the expected one based on prior blocks, and if not, then it should discard all blocks in this arena and re-fetch them (or rewind until the checksum matches).
Any number of entries in this file may be completely filled with zeroes (for example, because of a reorg happened or because the file was recreated after a write failure operation).
Algorithms For Blockchain Evolution
Of particular importance is considering write failures before/during/after each step of operation, how that can be detected (i.e. what is the disk state and startup validation procedures) and what action should be taken.
Connecting new blocks
This is the procedure for connecting recently mined, individual blocks. This is the primary write function during steady-state operation of the DB (after IBD has completed) and is the most likely to fail due to lack of disk space (because it is the primary one that may require growing the files).
Assumptions:
Process:
Stage 6 can be roughly understood as the "commit" stage: if the checksum of the most recent block in the STA is correct and the most recent block of the LTA is at a 6 block delta from it, then the prior write operation was successful.
Failures before that stage can be detected by the prior validation check. Failures between stages 1 to 5 can be easily recovered: the write operation can simply be repeated (completely or starting from where it failed).
Failures on step 6 means the most recent block of the STA is damaged (the checksum is mismatched) and should be discarded: in that case, the client code (wallet) will fetch the most recent block after DB startup completes. The faulty entry may be cleaned up by being zeroed.
Adding an Alt Chain Tip
This happens purely in memory. The memory-only sidechain forest is updated to track the new chain tip, but otherwise no file modification is necessary.
Perform a <= 6 depth reorg
This is the procedure for connecting an alternative chain tip to the current tip that DOES cause a reorg. This happens episodically within the Decred chain when a new block is found that extends the current chain tip and it has more work than the current best chain tip.
Assumptions:
Step 3 is necessary so that, even if a failure happens in the software after the entire reorg procedure is carried out, the DB clients (wallet) will be able to fetch the header and CFilter of the reorged-out block in order to carry out their own reorg procedures (e.g. rewinding transactions). It is particularly necessary when a single DB is used for multiple wallets (which may not be online when the DB is updated).
Steps 4 and 5 are split so that, upon restart, it is obvious whether one of them failed and whether the STA is correct or not. Failures during step 4 mean only part of the old block data has been erased, while failures during step 5 mean only part of the new block data has been written. In either case, that entry's checksum (which is the last information written to disk for an entry) will fail to validate.
Doing steps 4 and 5 separately may be seen as superfluous (as the overwrite would erase the data anyway), and doing it in block sequence order (as opposed to file order) may be seen as taking a performance hit, but doing it in this fashion will increase the chances of having as many blocks as possible written correctly to disk, in case of write failures.
Note that long running nodes have never publicly documented a > 1 block depth reorg in Decred, and thus it is expected that no more than depth 1 reorgs will be needed (though this algorithm should support reorgs up to depth 6).
Perform a > 6 depth reorg
This is the hardest reorg to perform in the DB and the one that could most likely cause corruption. Therefore, this process is geared towards correctness as opposed to performance.
Assumptions:
Process:
The first disk operation must be adding the reorged-out blocks to the sidechains file to ensure any clients may fetch that data if it is needed to perform its own reorg procedure.
The rationale for zeroing out files before performing any of the write operations is the same as for regular reorgs: this increases the confidence that write failures will leave the DB in a state that is at least partially consistent in the sense that blocks earlier than one that is detected as broken are known to be correct.
The rationale for zeroing out the headers file first (as opposed to last) is that headers that are missing their CFilters should cause clients (i.e. wallet) to fetch the missing CFilters while extra CFilters without corresponding headers is a faulty state detected and corrected during DB startup (the extra CFilters and index entries may be erased). The first option would cause the client the potentially attempt to fetch CFilters that, at a minimum, are not needed anymore and on a worst case scenario won't be provided by any nodes anymore. The second is easily recoverable from after the extra data is erased - syncing continues as if the wallet is out of sync.
Note that if the process fails before step 8, the blocks that would be reorged-in (the new chain) is not recorded and will have to be downloaded again after startup.
Initial Block Download
During IBD, a very large number of headers and CFilters is fetched at once, in batches. It is therefore ideal to make use of that information to avoid performing redundant writes. In particular, writes to the Short Term Arena are unnecessary, because it is likely that a new batch of headers that will soon be received will overwrite it.
Writes to the LTA files can be batched: a batch of headers, CFilters and block index changes may be written directly, instead of being written block by block (assuming the clients have validated the correctness of the headers).
Additionally, the files can be grown directly in one operation, once the target sync height is known (however, care should be taken to limit the max file growth possible when syncing from untrusted sources).
The IBD batch addition process may be carried out until the tip is within 6 blocks from either the target sync height or the estimated blockchain height (calculated based on the target block production interval). After that, the DB should switch to the standard, steady-state operation mode.
DB Startup
Several validations must be performed during DB startup in order to ensure the data is consistent, and to fix faulty states in case of write failures. The following is a rough process for loading the DB from disk and creating its in-memory structures:
Final Discussion
This proposal is meant to establish whether want to move the wallet to use a new database implementation for its chain data.
If this is deemed as a reasonable goal, then the design in this proposal may be tweaked as necessary and implemented as a new independent package for testing. It can then be integrated into the wallet, with appropriate code to migrate the state of old wallets.
Appendix 1: Alternative Index File Design
Instead of storing the index by height (one entry per block), we may opt to store the index by reduced hash. Assume we use 24 bits of the reduced hash as key. We can create a file with 2^24 entries, each taking 16 bytes for a total of ~268MB. Each entry in this file would correspond to the presence of a block with that reduced hash value.
If the entry is all zeroes, that block does not exist. Otherwise, it may be set to one of two values (appropriately flagged):
The second value type is necessary when 24 bits are not sufficient to disambiguate between different blocks (i.e. more than one block shares the same reduced hash key). As of height 885208, there are 22815 blocks sharing the first 24 bits of their reduced hash key, with the max number of blocks sharing the same prefix being 4.
This design sacrifices disk space and query performance for reduced RAM usage.
Appendix 2: Use of Syscalls, MMap, and io_uring
The natural implementation for this DB would be using standard syscalls for file read/write. The writes are small enough and (apart from IBD) infrequent enough that using memory mapped files, io_uring or any other more advanced IO APIs should not be necessary (specially considering we need to fsync the final writes for durability).
However, it would be ideal if the data layer of the DB could be abstracted enough to support any IO API. This will need to be investigated.
Appendix 3: Use of Chain Data by Multiple Wallets
The same chain data could be used by multiple wallets (though not simultaneously) easily: during DB startup, the LTA and STA files would be locked by the running instance, ensuring only a single wallet instance can use them.
Then, after the wallet has gained exclusive access to these files, it can work as usual. During wallet startup (after DB startup), if it detects its rescan point (which is NOT stored in the chain DB) is earlier than the best chain header, this means some other wallet already updated the chain and the starting wallet should perform appropriate rescan and transaction sync duties.
dcrwallet
could be configured to use a separate (global) dir for chain data, such thatappdata
and this newchaindata
could be configured separately.This would help multi-wallet apps (such as Decrediton) by reducing the total amount of space needed for each wallet (about 1GB with the current bbolt implementation on mainnet).
Appendix 4: Concurrent Access by Multiple DB Instances
This is a tricky subject, in particular because it would allow the chain to be updated "externally" from the point of view of a running wallet: wallet A could add a block to the chain that wallet B would have to be notified about.
This would require much larger changes in the wallet, so that it was driven by the chain DB (as a syncer) instead of the existing syncer packages (RPC and SPV).
If this is deemed as a worthy goal, we can discuss further requirements and challenges, but this is currently listed as an additional feature.
Appendix 5: Alignment of CFilter data in disk
Needed? Desired?
Appendix 6: Choice of checksum function
Ideally, this would be a rolling hash function so that on new blocks, the data for the last block is removed while the data for new block is added. This needs to be investigated.
The text was updated successfully, but these errors were encountered: