Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

indexer-alt: Alternative indexer architecture POC #19868

Merged
merged 33 commits into from
Oct 18, 2024
Merged

indexer-alt: Alternative indexer architecture POC #19868

merged 33 commits into from
Oct 18, 2024

Conversation

amnn
Copy link
Member

@amnn amnn commented Oct 15, 2024

Description

A ground up rewrite/proof-of-concept of an alternative indexer pipeline, with the following major changes implemented, or planned:

Ingestion

  • Instrumented to count the number of bytes, checkpoints, transactions, events, objects read and their latencies.
  • Handle transient errors with back-off and checkpoints not found with a constant delay and retry.
  • Don't impose a hard timeout as it will only lead to wasted work (drop partially fetched data and refetch).
  • Hardcoded the remote store as the only ingestion source for now.
  • No longer writes anything to disk (no watermarking, no GC) -- does tracking of progress locally, and requires help from whatever is running it to know where to pick back up from.
  • Re-uses futures::stream and tokio channels to handle communication and parallelism between ingestion processing and commit workers. This simplifies and decouples a lot of the implementation.
  • This also means there's no more reducer functionality -- with each producer having its own committer, this was not necessary.
  • Tested using a mock HTTP server.

On my local machine, the new ingestion framework very slightly out-performs the existing one: 160 checkpoints per second at checkpoint 64M, and 60 to 70 cps at checkpoint 9M compared to 130 and 50 cps respectively on the existing framework. The new framework is limited by my local bandwidth.

Writer

  • Again, lots of instrumentation: Every pipeline automatically instruments the number of checkpoints it's processed, how many rows it's created, how long that took, how long it took to write them out, how many retries it took, etc.
  • Writes to each table are split into their own concurrent pipelines.
  • Pipelines commit data for checkpoints out-of-order and rely on keeping the watermark table up-to-date to maintain consistency for readers.
  • Uniform framework for defining pipelines, where each pipeline just describes how it processes, how it commits to the DB, and how to tune the concurrency parameters for its needs.
  • Avoid cloning data as much as possible.
  • No more partition managing -- everything that was partitioned before is destined for the KV store, so we can do away with partitioning for now.

Schema

  • Remove any redundancies due to backwards compatibility etc.
  • Make naming conventions etc consistent across the codebase.
  • Fix all the schema changes we wanted to fix.
  • Avoid redundant writes: The object contents do not need to be written to objects, objects_snapshot, objects_history and full_objects_history.
  • Try as much as possible to separate tables that are overwritten and tables that are appended to, in particular, split up the objects table into a separate objects (live object IDs, versions and owners), objects_type (append-only) and coins (only changes when coins are involved) tables to avoid marking more rows dead than we need to.

To be done in the near term

Experiments that this POC now unlocks:

  • Setting up dashboards to visualise all the data coming from the indexer.
  • Run the indexer in production and tune its parameters, to see how it compares with the existing pipeline, in the best case.
  • Investigate whether @lxfind's finding around parallelism on writes to the DB holds up in this new architecture (in the old architecture 5 parallel writers yielded the best performance, does that still hold true in the new architecture? Currently it only has one writer per table).
  • COPY ... FROM -- can we use this instead of INSERT to speed up pipelines that write to "append-only" tables?
  • Splitting out key-value tables into their own DB.

To be done in the medium term

If this seems like a promising architecture, there is still much that needs to be done to make it a full indexer:

  • Pruning and watermarking
  • Starting from formal snapshot
  • Compatibility checks and running migrations
  • Integration into our testing setup
  • Integration into our existing readers
  • Completing the schema
  • Handing over between reading checkpoints from the remote store and from a fullnode, based on how close the indexer is to the tip of the network.

Test plan

Lots of experiments to do (and some new unit tests)


Release notes

Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.

For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates.

  • Protocol:
  • Nodes (Validators and Full nodes):
  • Indexer:
  • JSON-RPC:
  • GraphQL:
  • CLI:
  • Rust SDK:
  • REST API:

Copy link

vercel bot commented Oct 15, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
sui-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Oct 18, 2024 10:42am
3 Skipped Deployments
Name Status Preview Comments Updated (UTC)
multisig-toolkit ⬜️ Ignored (Inspect) Visit Preview Oct 18, 2024 10:42am
sui-kiosk ⬜️ Ignored (Inspect) Visit Preview Oct 18, 2024 10:42am
sui-typescript-docs ⬜️ Ignored (Inspect) Visit Preview Oct 18, 2024 10:42am

Copy link
Contributor

@lxfind lxfind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also curious about where do you think is the best place to instrument and log TPS in this implementation.

@amnn
Copy link
Member Author

amnn commented Oct 16, 2024

Also curious about where do you think is the best place to instrument and log TPS in this implementation.

My thinking is that you should already be able to extract rates of transaction ingestion/processing/commit from the existing metrics:

  • The ingestion service tracks the number of transactions it sees coming in.
  • Each pipeline tracks the number of rows it produces from processing, and how many have been committed.

So for pipelines whose progress is measured in transactions, it should come out naturally from looking at the commit rate for their rows. For pipelines that are not inherently transaction-based, then it makes slightly less sense to track transactions, but we can still do it, and I would do it by instrumenting the watermarking flow (i.e. each pipeline includes a gauge for its checkpoint and transaction high watermark). I guess this is why you were interested in the watermarking TODO?

crates/sui-indexer-alt/src/handlers/mod.rs Show resolved Hide resolved

let indexed = entry.get_mut();
let values = &mut indexed.values;
if batch_values.len() + values.len() > H::CHUNK_SIZE {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

today indexer changes commit data batch also for special cases when

  • epoch boundary to split them so that each batch is within one epoch, one reason to do that is to avoid moving data among partitions, it's another complexity that partition introduced, as we discussed I will experiment and see if we can get rid of that; another potential reason might be the epoch end state is needed for some governance api, do you know if we still have some left dependency?
  • intentional lag for objects_snapshot, is it handled?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave objects_snapshot aside, because the way I would solve that is by adding an explicit buffer task on an existing objects pipeline, and partitioning, because all the tables this pipeline implementation caters to are either KV tables which won't be in postgres in the long run or tables that aren't currently dynamically partitioned, but tell me more about the requirement related to the Governance API, because I don't have enough context to answer that, but it may be an interesting exercise to see how it fits into this architecture.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay we can leave partition and objects_snapshot aside for now, @emmazzz prob knows more about Governance API.

crates/sui-indexer-alt/src/handlers/mod.rs Show resolved Hide resolved
Copy link
Contributor

@wlmyng wlmyng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first pass

crates/sui-indexer-alt/src/lib.rs Show resolved Hide resolved
Comment on lines +177 to +181
// If an override has been provided, start ingestion from there, otherwise start ingestion
// from just after the lowest committer watermark across all enabled pipelines.
let first_checkpoint = self
.first_checkpoint
.unwrap_or(self.first_checkpoint_from_watermark);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we (have we) actually do this rn? We do set the watermark on adding a new pipeline, but Indexer's first_checkpoint_from_watermark doesn't seem to be updated per the min committer watermark across all pipelines

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indexer's first_checkpoint_from_watermark doesn't seem to be updated per the min committer watermark across all pipelines

What about L146-149 above?

crates/sui-indexer-alt/src/models/watermarks.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@gegaowp gegaowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me to merge as the first pr with the planned changes as followups.

## Description

Adding tests fo the ingestion service. In particular, how it handles
various error scenarios, back-pressure, buffering, etc.

## Test plan

```
sui$ cargo nextest run -p sui-indexer-alt
```
## Description

Assume that whatever store checkpoint data is being fetched from is
correct. If it was not possible to get a full response back from the
store, or that response could not be successfully deserialized, back-off
and try again.

## Test plan

New unit test:

```
sui$ cargo nextest run -p sui-indexer-alt -- retry_on_deserialization_error
```
## Description

Add a label to the transient retries counter to mark the reason why
we're retrying, in case there are trends in particular retries.
## Description

Add a helper function to log retrying a fetched checkpoint. Updates a
counter, and traces the error and reason.
## Description

Recover from failures to send a request to the remote store in the first
place (Initially, I mistakenly thought that `send` would only fail if
the request was malformed, and that we would otherwise get a Response of
some kind, but this is not true).

## Test plan

New unit test:

```
sui$ cargo nextest run -p sui-indexer-alt -- retry_on_reuest_error
```
## Description

If ingestion is stuck talking to a server that is repeatedly producing
transient errors, it can get stuck. This change gives us a way to tell
the ingestion service to give up from outside.

## Test plan

New unit tests:

```
sui$ cargo nextest run -p sui-indexer-alt -- fail_on_cancel
```
## Description

Initial set-up integrating a DB connection pool into the indexer.

## Test plan

Run the indexer and check the stats from the DB pool are propogated to
prometheus.
## Description

Handle the case where one part of the indexer panics and we need to
cleanly shutdown.

## Test plan

Introduced a panic in the dummy ingester and make sure everything winds
down correctly.
## Description

Introduce a framework for writing indexing pipelines and use it to add a
pipeline for checkpoints.

## Test plan

Tested the pipeline locally.
## Description
A pipeline to fill what was previously `full_objects_history`.

## Test plan
Pipeline was run locally.
## Description

Port ove an idealised KV transactions table:

- Removes the following fields:
  - `transaction_digest`
  - `transaction_kind`
  - `success_command_count`
- Standardise naming for `tx_sequence_number` and `cp_sequence_number`.
- Standardise representation of arrays of BCS encoded values (use a
  BCS-encoded array instead).

So that in the end, the schema matches the schema we would eventually
store in the KV sore.

This PR also re-implements balance change calculation (rather than
relying on the implementation in `sui-json-rpc`).

## Test plan

Pipeline run locally.
## Description

Introduce an `Indexer` type to reduce the boilerplate of introducing a
new pipeline, and gracefully shutting everything down.

This also prepares the codebase for watermarks support: The initial
checkpoint to start ingestion at can be calculated by taking the MIN
checkpoint in the watermark table across all pipelines.

## Test plan

Unit tests:

```
sui$ cargo nextest run -p sui-indexer-alt
```

And run the indexer locally.
## Description

Introduce the watermarks table, and update the committer to update it,
taking into account out-of-order writes to the underlying tables.

This change also allows the indexer to pick up where it left off by
consulting the watermarks table for the high watermarks it had
previously written.

It also introduces the ability to configure the range of checkpoints
being indexed (set an upper and lowerbound).

The metrics introduced by this change can be used to track per-pipeline
checkpoint rate, epoch rate and transaction rate.

Some TODOs have also been added for various tests and experiments that
should be run based on this work.

## Test plan

A lot of manual testing. Some TODOs have been left for more extensive
testing once there is more of a test set-up (depends on tempdb and
auto-migration work).
## Description

This makes it so that calls to `diesel setup` or `diesel migration run`
will not forget to add the copyright notice to the schema file.

## Test plan

Run the following:

```
sui-indexer-alt$ diesel migration run --database-url="postgres://postgres:postgrespw@localhost:5432/sui_indexer_alt" --migration-dir migrations
```

Notice that the schema doesn't change (the copyright notice is not
remove)
## Description

Add the sequence number for the checkpoint that the pipeline processed
to the tracing message, so it's easier to follow along.

## Test plan

Run the indexer locally.
## Description
Previously, the service would just hang after having wound down all
pipelines during a graceful shutdown. This is because the graceful
shutdown process was left waiting for an explicit interrupt signal to
unblock the whole shutdown process.

This change relaxes that constraint: If the service is already shutting
down, then there is no need to wait for the explicit Ctrl-C. This makes
it feasible to run the indexer for specific ranges of checkpoints.

Note that we don't want to call the global cancellation token once
ingestion is done, because that might stop a downstream task before it
has acted on the downloaded checkpoints. Instead, the ingestion service
communicates completion by closing its channel, which propagates through
the handlers and committers.

## Test plan

```
RUST_LOG="info,sui_indexer_alt=debug" cargo run -p sui-indexer-alt --release --  \
  --database-url "postgres://postgres:postgrespw@localhost:5432/sui_indexer_alt" \
  --remote-store-url https://checkpoints.mainnet.sui.io                          \
  --last-checkpoint 1000
```
## Description

This makes it easier to perform selective backfills.

## Test plan

```
sui$ RUST_LOG="info" cargo run -p sui-indexer-alt --release --                   \
  --database-url "postgres://postgres:postgrespw@localhost:5432/sui_indexer_alt" \
  --remote-store-url https://checkpoints.mainnet.sui.io                          \
  --last-checkpoint 1000                                                         \
  --pipeline kv_transactions                                                     \
  --pipeline tx_affected_objects
```

Note that for now, the system does clip off the end of the checkpoint
stream, rather than waiting for all workers to finish, and that still
needs to be fixed.
## Description

When running the indexer on a finite range of checkpoints. Make sure
commiters' buffers are completely empty before shutting down their task,
otherwise we may not fully write out all the intended rows for the range
of checkpoints provided (there may be some data left at the bottom of
the barrel).

## Test plan

Ran the following:

```
cargo run -p sui-indexer-alt --release -- \
  --database-url "postgres://postgres:postgrespw@localhost:5432/sui_indexer_alt" \
  --remote-store-url https://checkpoints.mainnet.sui.io \
  --last-checkpoint 2000
```

Corroborated that the data that results in the DB at the end:

- Stops at the expected checkpoint (not before or after)
- Matches counts of rows in the production mainnet DB for the equivalent
  tables at the same checkpoints.

This can/should be made into an automated test, but that requires tempdb
and migrations to be implemented (a comment has been added to this
effect).
## Description

Add a flag to skip writing to the watermark table. This would be useful
when an indexer is running in parallel with another indexer and it's
just not even productive to try writing to the watermarks table, because
it will always be behind.

## Test plan

Tested locally.
@amnn amnn merged commit 27595b4 into main Oct 18, 2024
50 checks passed
@amnn amnn deleted the amnn/idx-poc branch October 18, 2024 11:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants