Operation Log Compaction

Note: This article describes unfinished ideas and possibly bad approaches.

Problem

Deletion is difficult in distributed systems, especially in peer-to-peer environments. The Mytosis system is susceptible to unbounded growth in at least one way, and the addition of types like Map or Document would only exaggerate the problem. As a concrete example, you can add an unlimited number of records, but you can't truly delete them. You have to use a tombstone. Tombstones are lighter on space, but disk space is still unbounded.

Stating the obvious: if you truly delete a record from your database, if even a single peer didn't get the memo, they'll try to sync it back to you. There's no way to distinguish deleted data from new data. You have to propagate deletions as persisted operations (aka tombstones).

That's an attack vector. A malicious actor could submit a billion new records and the best we can do is tombstone each one. Same thing if an app goes viral. A large number of new records could be created, more than your typical client device allows, effectively ruining the application. No client device would have enough resources to run it.

If Mytosis adds support for Map or Document types, the problem moves from the disk into RAM. Values can grow in size, not just keys.

Objective

Let's assume Mytosis derives database state from operations in an ordered tree. Each operation is something like create record "id" or set "field" on record "id" to value "...". Operations are replicated between peers. The implementation of operations isn't relevant, the point is that the operation tree ("op log") is a grow-only set.

The objective is to prune operations in a replication-safe way so the op log doesn't grow infinitely in size.

Ideas

The problem is fundamentally one of consistency. We don't want to purge an operation from the disk until everyone agrees that it should be purged from the disk. Otherwise it'll just reappear the next time a device rejoins the network.

Just about every idea uses the following notion of compaction:

Operations have an ID which is the sha-256 value of the operation content.
Select a set of operations to compact and order them deterministically.
Generate an ID by concatenating all source operation IDs and taking the sha-256 value.
Using the source operation set, derive a state snapshot and create an operation which will generate it.
Assign the ID to the new operation. Replace the source operation set with the new, generated operation.

That approach has a few advantages:

Given the same set of operations, all peers will produce an identical compaction operation.
You can cryptographically verify whether a compaction was created from a specific set of operations (compactions are branches in a merkle tree).

The real challenge is figuring out when to run compaction and what to do if a conflict happens.

Here's a summary of the ideas:

Run compaction at predetermined intervals. On conflict, deterministically choose a winner.
- Problems: drops writes and causes irreconcileable merge conflicts with other peers under split-brain network partitions.
Reference count writable handles to operations. Only compact when that count drops to zero.
- Problems: compaction never runs if a peer with a writable handle drops off the network permanently.

Interval-based compaction

Hard-code a constant N.
Every N operations, compact operations from 0 to N.
Skip the first round. There should always be a gap of at least N operations between you and a compaction.

If the constant N is 100, then on the 200th operation you'd compact 0-100. The result is 101 operations (1 compaction node + N gap).

Unfortunately this creates tension with other peers. If you receive a compaction operation that disagrees with you, there still has to be a deterministic resolution. The most reliable way is to just pick one. That means that even though you maintain Strong Eventual Consistency, operations are still irreparably lost.

In addition, a malicious actor could craft a compaction operation which wins your conflict resolving function and essentially replace your entire database.

Not great.

Reference counting

You can't write data unless you created an intent-to-write operation and includ it with your read request.
The receiving peer will store your intent-to-write alongside the most recent operation.
Compaction occurs continuously unless a peer still has write access to that operation.
If the receiving peer sees an update from you at a newer logical clock, that operation will replace your old intent-to-write, dropping the reference count by 1.

Downsides: if a peer sends an intent-to-write request and then permanently drops off the network, the write reference will never be cleared and compaction is impossible.

That architecture might be ideal if you're the only user (e.g. replicating notes from a laptop to a phone), but it would be crippling in a larger system with more users.

Maybe there's a middle ground, like a centralized server that tracks nothing but client connectivity. If a client drops off the network for 6 months, permanently blacklist them after logical clock X. Purge the op log of anything newer and reject any future updates unless they renew their intent to write.

To be clear, I really don't like that idea. Mytosis loses a ton of value the instant it relies on a centralized service.

Consensus by magic

I'm not satisfied with those answers yet. Here are some fuzzier alternatives I'm still investigating:

Do a paxos exchange every N operations where all contributors for that operation set are elected voters. Requires rollback, which might not be possible 😬
Replace lamport timestamps with a fork-join logical clock (ITC for example). The intent-to-write is formalized by fork(...).
Seek inspiration from blockchain consensus models (bitcoin's longest chain doesn't fit, but there could be others).
Only delete operations locally. Keep the full op log in backups (e.g. in IPFS) and reconcile those logs on compaction conflicts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly