Skip to content

[ENH][wal3] CPU parts of garbage collection #4617

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 6 additions & 0 deletions rust/log-service/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1999,6 +1999,7 @@ mod tests {
writer: "TODO".to_string(),
acc_bytes: 0,
setsum: Setsum::default(),
collected: Setsum::default(),
snapshots: vec![],
fragments: vec![Fragment {
seq_no: FragmentSeqNo(1),
Expand Down Expand Up @@ -2129,6 +2130,7 @@ mod tests {
writer: "TODO".to_string(),
acc_bytes: 0,
setsum: Setsum::default(),
collected: Setsum::default(),
snapshots: vec![],
fragments: vec![Fragment {
seq_no: FragmentSeqNo(1),
Expand All @@ -2144,6 +2146,7 @@ mod tests {
writer: "TODO".to_string(),
acc_bytes: 0,
setsum: Setsum::default(),
collected: Setsum::default(),
snapshots: vec![],
fragments: vec![Fragment {
seq_no: FragmentSeqNo(1),
Expand Down Expand Up @@ -2232,6 +2235,7 @@ mod tests {
writer: "TODO".to_string(),
acc_bytes: 0,
setsum: Setsum::default(),
collected: Setsum::default(),
snapshots: vec![],
fragments: vec![Fragment {
seq_no: FragmentSeqNo(1),
Expand All @@ -2247,6 +2251,7 @@ mod tests {
writer: "TODO".to_string(),
acc_bytes: 0,
setsum: Setsum::default(),
collected: Setsum::default(),
snapshots: vec![],
fragments: vec![Fragment {
seq_no: FragmentSeqNo(1),
Expand Down Expand Up @@ -2318,6 +2323,7 @@ mod tests {
writer: "TODO".to_string(),
acc_bytes: 0,
setsum: Setsum::default(),
collected: Setsum::default(),
snapshots: vec![],
fragments: vec![Fragment {
seq_no: FragmentSeqNo(1),
Expand Down
2 changes: 2 additions & 0 deletions rust/wal3/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -25,3 +25,5 @@ chroma-config.workspace = true

[dev-dependencies]
guacamole = { version = "0.11", default-features = false }
proptest.workspace = true
rand.workspace = true
75 changes: 56 additions & 19 deletions rust/wal3/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,17 +96,33 @@ The Manifest is a JSON file that contains the following fields:
- start: The lowest log position in the fragment. Note that this embeds time and space.
- limit: The lowest log position after the fragment. Note that this embeds time and space.
- setsum: The setsum of the log fragment.
- snapshots: A list of snapshots. These are like interior nodes of a B+ tree and refer to
fragments that are further in the past. Each snapshot contains the following fields:
- path_to_snapshot: The path to the snapshot relative to the root of the log. Similar to fragments, the
full path is specified so that any bugs or changes in the path layout don't invalidate
previously-written logs.
- depth: The maximum number of snapshots between this snapshot and the fragments that serve as
leaf nodes for the tree.
- setsum: The setsum of the snapshot. This uniquely identifies the data to the degree that
sha3 does not collide.
- start: The offset of the first record maintained by this snapshot.
- limit: The offset of the first record too new to be maintained within this snapshot.
- writer: A plain-text string for debugging which process wrote the manifest.

Invariants of the manifest:

- The setsum of all fragments in a manifest plus `pruned` must add up to the `setsum` of the
manifest.
- The setsum of all snapshots+fragments in a manifest plus `pruned` must add up to the `setsum` of
the manifest.
- fragments.seq_no is sequential.
- fragment.start < fragment.limit for all fragments.
- fragment.start is strictly increasing.
- The range (fragment.start, fragment.limit) is disjoint for all fragments in a manifest. No other
fragment will have overlap with log position.
- snapshot.start < snapshot.limit for all snapshots.
- snapshot.start is strictly increasing.
- The range (snapshot.start, snapshot.limit) is disjoint for all snapshots in a manifest. No other
snapshot will have overlap with log position. Children of the snapshot will be wholely contained
within the snapshot.

### Cursor Structure

Expand All @@ -117,6 +133,21 @@ A cursor is a JSON file that contains the following fields:
microseconds since UNIX epoch.
- writer: A plain-text string for debugging which process wrote the cursor.

### Garbage File

A garbage file specifies a set of files to delete, a set of files to replace, and the hierarchical
structure that attributes each node in the tree to its parent. Conceptually, it mirrors the tree of
fragments maintained by the manifest. This hierarchy is necessary to capture the fact that the
setsum of a snapshot includes the setsum of its children. To be able to delete a file requires
adjusting setsums up and down the tree.

The garbage file gets written by reading the manifest, writing the file, and then having the active
writer pick up the garbage file and apply it to the manifest on next write.

This is done to avoid stressing the log contention path; it is not intended for a wal3 writer to
garbage collect the same log that another wal3 writer is actively working on. The result is safe
and durable, but liveness may be impacted.

## Object Store Layout

wal3 is designed to maximize object store performance of object stores like S3 because it writes
Expand All @@ -141,6 +172,7 @@ wal3/log/Bucket=15000/FragmentSeqNo=15000.parquet
...
wal3/manifest/MANIFEST.json
wal3/snapshot/SNAPSHOT.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
wal3/garbage/GARBAGE
```

## Writer Arch Diagram
Expand Down Expand Up @@ -226,8 +258,8 @@ like:
3. Read all cursors again; if changed, goto 1.
4. Select the minimum timestamp across all cursors as the garbage collection cutoff.
5. Write a list of snapshots and fragments that hold data strictly less than the cutoff to a file
named `gc/GARBAGE.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX` where the
hex digits are the setsum of the garbage.
named `garbage/GARBAGE`. There can be only one gc in progress at a time, so gc is kicked off by
running put-if-not-exist on `garbage/GARBAGE`.
6. Wait until the writer writes a manifest that does not contain the garbage's fragments.
7. Wait a sufficiently long time so that readers cannot see the fragments.
8. Slow-delete the contents of the garbage file.
Expand Down Expand Up @@ -362,9 +394,10 @@ first level of interior nodes point to the second level, and that level points t

This is, strictly speaking, an optimization, but one that will allow us to scale the log to beyond
all forseeable current requirements. 20-25 pointers in the root, or 2kB are all that's needed to
capture a log that's more than a petabyte in size. Compare that to 5k pointers or 329kB for a
single manifest. We're dealing with kilobytes per manifest for a log that's petabytes, but when
each manifest targets < 1MB in size, the difference at write time is apparent in the latency.
capture a log that's more than a petabyte in size if the log is written at maximum batch size.
Compare that to 5k pointers or 329kB for a single manifest. We're dealing with kilobytes per
manifest for a log that's petabytes, but when each manifest targets < 1MB in size, the difference at
write time is apparent in the latency.

Consequently, the manifest and its transitive references will be a four-level tree.

Expand All @@ -385,18 +418,6 @@ root
└── fragment_9
```

### Interplay Between Garbage Collection and Snapshots

The manifest compaction strategy is designed to reduce the cost of writing the manifest, but it
incurs a cost for garbage collection. To garbage collect an arbitary prefix of the log
fragment-by-fragment would require rewriting the snapshots that partially cover the prefix and
contain data that is not to be garbage collected. This is complex.

To side-step this problem we will introduce intentional fragmentation of the manifest and snapshots
to align to the garbage collection interval. This will guarantee that at most one interval worth of
garbage that could be compacted is left uncompacted. In practice, this means constructing fragments
such that they are pre-aligned to garbage collection boundaries.

### Interplay Between Snapshots and Setsum

The setsum protects the snapshot mechanism. Each pointer to a snapshot embeds within the pointer
Expand Down Expand Up @@ -558,6 +579,22 @@ will indicate a problem.
To do this, we will construct an end-to-end, variable throughput test that we can run against wal3
to ensure that data written is readable exactly as written.

## Error Handling

Garbage collection is designed to be conservative:

- **Partial Failures**: If any step fails, no fragments are deleted
- **Verification Failures**: Setsum mismatches abort the entire operation
- **Timeout Handling**: Long-running operations have configurable timeouts
- **Retry Logic**: Transient failures trigger exponential backoff retries

## Performance Considerations

- **Batch Operations**: Fragment listing and deletion use batch APIs for efficiency
- **Incremental GC**: Large logs can be garbage collected incrementally over time
- **Background Processing**: GC runs asynchronously without blocking writers or readers
- **Resource Limits**: Configurable limits prevent GC from overwhelming object storage

# Multiple wal3 Instances

Thus far we've presented wal3 as if it is a singleton. In this section, we look at considerations
Expand Down
2 changes: 2 additions & 0 deletions rust/wal3/src/copy.rs
Original file line number Diff line number Diff line change
Expand Up @@ -116,10 +116,12 @@ pub async fn copy(
.iter()
.map(|x| x.setsum)
.fold(Setsum::default(), |x, y| x + y);
let collected = Setsum::default();
let acc_bytes = snapshots.iter().map(|x| x.num_bytes).sum::<u64>()
+ fragments.iter().map(|x| x.num_bytes).sum::<u64>();
let manifest = Manifest {
setsum,
collected,
acc_bytes,
writer: "copy task".to_string(),
snapshots,
Expand Down
Loading
Loading