Skip to content

feat: Kona Rollup Node Architecture Doc #264

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
291 changes: 291 additions & 0 deletions protocol/kona-node-arch.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,291 @@
# Kona Rollup Node Architecture

# Summary

The Kona Rollup Node (aka `kona-node`) is a Rust implementation
of the OP Stack rollup node that functions as a Consensus Layer
client for OP Stack chains.

It builds upon the architectural principles of the go-based `op-node`
while introducing improvements in performance, memory safety,
and concurrency through Rust's language features. The `kona-node` will
be responsible for the same core functions as `op-node`: building,
relaying, and verifying the canonical chain of blocks, working in
conjunction with an execution layer client like `op-geth` and `op-reth`.

This design leverages existing work in the Rust ecosystem, particularly
the Kona monorepo for OP Stack types, components, and services, while
introducing a modular architecture that allows for component-level
customization and optimization. The implementation will prioritize
compatibility with the OP Stack protocol while exploring Rust-specific
optimizations for improved performance and resource utilization.

# Problem Statement + Context

The current OP Stack ecosystem relies on a single implementation of the
rollup node (`op-node`) written in Go. While functional, this creates
several challenges:

1. **Single Implementation Risk**: Relying on a single implementation
introduces systemic risk if critical bugs or vulnerabilities are discovered.
Testing the `op-node` is limited to unit tests, action tests, and integration
tests. The individuals implementing features are usually the same ones writing
tests.

2. **Proof Coverage**: "Stage 1.4" or the `kona-proof` implementation uses its own
derivation pipeline implementation. Unlike the `op-node` derivation pipeline,
Kona's derivation pipeline only has test coverage through the Optimism monorepo's
**proof** "action tests".

3. **Performance Limitations**: go-ethereum (aka "geth") was built as a singular
piece of client software. It wasn't built to be extensible for rollups. On the other
hand, reth was built as an sdk with modularity as a first-class citizen. As such,
`op-geth` and the `op-node` are limited in performance by design.

4. **Language-Specific Constraints**: Go's garbage collection and concurrency
model may introduce performance bottlenecks in specific high-throughput scenarios.
Rust's ownership model and lack of garbage collection can provide more predictable
performance characteristics, especially under load.

5. **Ecosystem Growth**: A growing number of Rust crates implement
OP Stack specifications. Some of these are used in `op-reth` as well as "op-revm".
There's an increasing security risk to rely on these implementations without having
increased coverage through an alternate rust rollup node implementation.

The OP Stack protocol itself has been designed with client diversity in mind.
Clear specifications allow for multiple implementations in different languages.
The Rollup Node specification provides a comprehensive guide for implementing a
compatible client, making a Rust implementation both feasible and valuable.

# Proposed Solution

The `kona-node` will be a modular Rust implementation of the rollup node that
adheres to the OP Stack protocol specifications while leveraging Rust's strengths
in performance, safety, and concurrency. The implementation will:

- **Maintain Protocol Compatibility**
- **Leverage Rust Ecosystem Crates**
- **Optimize for Performance**
- **Enhance Modularity**
- **Improve Testability**

## Architecture

### Building Blocks from First Principles

The Rollup Node is a core component of the OP Stack.
It is responsible for constructing the canonical safe L2 chain.
In order to produce the chain, the Rollup Node listens to two sources
of information.

1. **Data Availability Layer**: As new L1 blocks are produced, the receipts
are parsed for new L2 chain updates. L2 inputs (L2 transaction batches + deposits)
are then derived from the data availability (DA) layer.

2. **L2 Sequencer**: The L2 sequencer produces unsafe L2 blocks and sends them over
p2p gossip to other rollup nodes.

What this looks like currently is two sources feeding into derivation,
somehow producing the L2 Chain.

```
┌────────────┐
│L2 Sequencer│
│ ├────────────────────┐
│ Gossip │ │ ┌────────────┐
└────────────┘ │ │ │
├──► │ ??? │──► < L2 Chain >
┌────────────┐ ┌────────────┐ │ │ │
│ DA │ │ │ │ └────────────┘
│ ├──► │ Derivation ├──┘
│ Watcher │ │ │
└────────────┘ └────────────┘
```

From these sources, the rollup node imports "unsafe" blocks from the L2 sequencer
and safe blocks from the L2 derivation pipeline. Both unsafe and safe blocks
are imported into the L2 execution layer via the Engine API.

```
┌────────────┐
│L2 Sequencer│
│ ├────────────────────┐
│ Gossip │ │ ┌────────────┐
└────────────┘ │ │ │
├──► │ Engine API │──► < L2 Chain >
┌────────────┐ ┌────────────┐ │ │ │
│ DA │ │ │ │ └────────────┘
│ ├──► │ Derivation ├──┘
│ Watcher │ │ │
└────────────┘ └────────────┘
```

The [Holocene Hardfork][holocene] introduced steady block derivation.
This change allows a payload to be replaced with a deposits-only block
to allow the L2 chain to progress even if an invalid payload is derived.

Steady block derivation has significant architectural impacts for the
rollup node. With Holocene, the derivation component needs to be updated
if a payload is replaced with a deposits-only block. This requires the
Engine API component to send a message to the derivation component,
introducing an edge in the modular component graph.

[holocene]: https://specs.optimism.io/protocol/holocene/derivation.html

```
┌────────────┐
│L2 Sequencer│
│ ├───┐
│ Gossip │ │ ┌────────────┐ ┌────────────┐
└────────────┘ │ │ │ │ │
├──►│ Derivation │──►│ Engine API │──► < L2 Chain >
┌────────────┐ │ │ │ │ │
│ DA │ │ └────────────┘ └──────┬─────┘
│ ├───┘ ▲ │
│ Watcher │ └────────────────┘
└────────────┘
```

### A Modular Architecture

The components described above are all critical to the rollup node.
Whether these components are part of the architecture is not up for discussion.
What this document addresses is how the components communicate. Namely, do
components share memory or use an actor-based architecture with message-passing.
The remainder of this section will provide reasoning to choose the latter.

In an actor-based system, each component is constructed as an "Actor". There's
a Derivation Actor, DA Watcher Actor, P2P Actor (L2 sequencer gossip),
and Engine Actor.

Instead of having some top-level object "own" components, actors are spawned
as threads, and communication between actors happens through channels with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these channels synchronous? If not, what's the queueing policy and how do events/ message receipts handled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. The answer is it really depends on the actors and the relationships between those actors. Since the channels don't have to be uniform, I haven't specified anything here. Most likely, it'll change as we go along.

messages. In this architecture, a top-level layer "orchestrates" the various
actor to handle wiring up communication between actors. The primary benefit
of using an actor-based system is tasks can actually be parallelized. As opposed
to kona's derivation pipeline which uses an ownership model where each stage
in the derivation pipeline owns the previous stage. Because of this ownership,
derivation pipeline work is single-threaded and blocking. Pipeline stages cannot
make progress in parallel.

If the orchestration layer is called "Rollup Node", a full modular architecture
for the Kona Rollup Node looks like the following.

```
┌────────────────────────────────────────────────────────────────────┐
│ │
┌───┤ Rollup Node Service │
│ │ │
│ └──────────────────────────┬────────────────┬─────────────────┬──────┘
│ │ │ │
│ ┌────────────┐ │ │ │
│ │L2 Sequencer│ │ │ │
├─► │ ├───┐ ▼ ▼ ▼
│ │ Gossip │ │ ┌────────────┐ ┌────────────┐ ┌───────────┐
│ └────────────┘ │ │ │ │ │ │ │
│ ├──►│ Derivation │──►│ Engine API │ │ Rpc Actor │
│ ┌────────────┐ │ │ │ │ │ │ │
│ │ DA │ │ └────────────┘ └──────┬─────┘ └───────────┘
└─► │ ├───┘ ▲ │
│ Watcher │ └────────────────┘
└────────────┘
```

Notice, the "Rpc Actor" doesn't have communication lines drawn to other actors.
This is because the JSON RPC server has rpc modules that it registers which
handle various rpc requests. In effect, each actor registers its rpc module
with the Rpc Actor at construction. The rpc modules in turn grab or perform
actions on other actors via message passing. Effectively, the Rpc Actor
messages with all other actors.

### Sequencer vs Validator Mode

Up to this point, Kona's rollup node architecture has been considered
from the perspective of a _validator_ node. That is, a rollup node
that only derives the l2 chain, but doesn't _build_ (aka "sequence")
the L2 chain. In order to extend support for sequencing the L2 chain,
the rollup node needs to accept transactions, build l2 blocks, and
broadcast these blocks as gossip over the p2p network.

Part of the architecture to consider is being able to plug in
"block building" through a simple API. The `op-node` separates
the sequencer logic from the rest of the node driver, allowing
it to toggle sequencing on and off via a simple CLI flag. Kona
can take this one step further. Using a minimal API, the
`kona-node` should allow sequencing to be toggled on and off,
but also let users easily slot in their own block building and
sequencing logic using the given API.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a really interesting concept. Sounds like this would be pretty interesting for sequencers that would want to have custom MEV algorithms or those that want to do some advanced profit estimation based on transaction inclusion. Are these the use-cases you thought of? Or did you have something different in mind

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point, we should definitely seek to remove the rollup-boost shim in favor of a native layer that allows for custom block building functionality, as an example. Things like direct rbuilder integration would be incredible.


Consider sequencing for the `kona-node`. A new "Sequencer"
actor would be introduced, extending the architecture as follows.

```
┌────────────────────────────────────────────────────────────────────────────────────┐
│ │
┌───┤ Rollup Node Service │
│ │ │
│ └──────────────────────────┬────────────────┬────────────────┬────────────────┬──────┘
│ │ │ │ │
│ ┌────────────┐ │ │ │ │
│ │ DA │ │ │ │ │
├─► │ ├───┐ ▼ ▼ ▼ ▼
│ │ Watcher │ │ ┌────────────┐ ┌────────────┐ ┌───────────┐ ┌───────────┐
│ └────────────┘ │ │ │ │ │ │ │ │ │
│ ├──►│ Derivation │──►│ Engine API │ │ Sequencer │ │ Rpc Actor │
│ ┌────────────┐ │ │ │ │ │ │ │ │ │
│ │L2 Sequencer│ │ └────────────┘ └──────┬─────┘ └─┬───┬─────┘ └───────────┘
└─► │ ├───┘ ▲ │ ▲ │ │
│ Gossip │ └────────────────┘ └───────┘ │
└────────────┘ │
▲ │
└──────────────────────────────────────────────────────┤
┌───────────┐
│ │
│ Conductor │
│ │
└───────────┘
```

Similarly to the `op-node`, the `kona-node` should make use of the
Conductor abstraction (the `op-conductor` is the implementation).
The conductor API allows the sequencer actor to commit unsafe
payloads to the L2 chain. It also acts as an auxiliary service to
the Rollup Node, performing leader election and chain reconciliation.
Regardless, consensus and `op-conductor` guarantees are periphery
to this document. The key idea here is for `kona-node` to use the
same `op-conductor` abstraction.

# Alternatives Considered

Instead of allowing actors to "pass messages" between each other, it was
considered to route all messages through the orchestration service.

Concretely, this would look like a large-variant enum that holds all
message types. Actors would then only send and receive messages from the
"Rollup Node Service" orchestrator. This more closely resembles the `op-node`
architecture where "derivers" are registered via a "registry". Various
event types are emitted and broadcasted to rollup node components.

While this works effectively for the `op-node` it introduces significant
overhead and risk for Kona's Rollup Node. Since the `kona-node` is
parallized, mishandling or even spontaneous flakes where messages are dropped,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
parallized, mishandling or even spontaneous flakes where messages are dropped,
parallelized, mishandling or even spontaneous flakes where messages are dropped,

I think..?

can result in an unrecoverable deadlock. By establishing messaging channels
directly between actors, there's less "surface area" for message passing
to be improperly configured.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think insisting on the parallelization is a great idea. I was wondering if the optimism spec was built to handle internal component parallelism or if this is something we should be careful of in the implementation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll be something we need to be very careful with. The good thing is we have so many methods of stress testing this architecture through kurtosis, local syncing, action tests, etc etc against a matrix of chains. Testing early is how I hope we can find any architecture-related bugs quicker.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parallelism for op-node is in a draft PR here: ethereum-optimism/optimism#11100


# Risks & Uncertainties

Actor-based systems aren't a free lunch. Where parallelization is introduced,
so is the ability for actors to become stuck in a deadlock. Given the cycle
between the Engine Actor and Derivation Actor, this is a real risk. You can
imagine the Derivation Actor isn't reset when a holocene deposits-only block
is created. Meanwhile, the Engine Actor is waiting for safe blocks from the
derivation pipeline.

In a similar way, the `op-node`'s event system architecture also has a risk
of deadlock. With strongly typed messages and explicit well-tested message
handling and emission, this risk is mitigated. One way to extend this testing
to the Kona Rollup Node is to support node action tests like proof action
tests support running the `kona-proof`.