From 424c3bab93d896b848167e37a2c1f2cc954e1b07 Mon Sep 17 00:00:00 2001 From: Miguel Palhas Date: Wed, 6 Dec 2023 22:58:38 +0100 Subject: [PATCH] Readme (#17) --- README.md | 103 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 103 insertions(+) create mode 100644 README.md diff --git a/README.md b/README.md new file mode 100644 index 0000000..7ccf6f6 --- /dev/null +++ b/README.md @@ -0,0 +1,103 @@ +# Iron Indexer + +[reth]: https://paradigmxyz.github.io/reth/intro.html +[reth-indexer]: https://github.com/joshstevens19/reth-indexer +[iron]: https://iron-wallet.xyz +[miguel]: https://twitter.com/naps62 +[cuckoo]: https://en.wikipedia.org/wiki/Cuckoo_filter + +A parallel Reth indexer. + +Reads transaction history from [reth][reth]'s DB (direct from filesystem, skipping network & JSON-RPC overhead). It's able to index from a dynamic set of addresses, which can grow at runtime, by spawning parallel self-optimizing backfill jobs. + +**Note**: Kudos to [reth-indexer][reth-indexer], which was the original implementation that served as a basis for this. + +## Disclaimer + +This is currently a prototype, and built to serve a yet-to-be-released feature of [Iron wallet][iron]. All development so far has been with that goal in mind. Don't expect a plug-and-play indexing solution for every use case (at least not right now) + +## How to use + +🚧 TODO 🚧 + +For now, check `iron-indexer.toml`, which should help you get started. Feel free to contact [me][miguel] or open issues for any questions. + +## Why + +Fetching on-chain data can be a painful process. A simple query such as _"what is the transaction history for my wallet address?"_ translates into a time-consuming walk of the entire chain. +It's also not enough to sync the `from` and `to` fields of every transaction (which would already be costly). Relevant transactions for a wallet are also based on the emitted topics, such as an ERC20 transfers. + +On top of this, most indexers require a predetermined set of topics to index, and any changes require a new full walk of the chain. + +Instead, `iron-indexer` takes a different approach: new addresses can be added to the sync list at runtime, and self-optimizing backfill jobs are registered to backfill all data for each incoming address. + +## How + +### Forward & Backfill workers + +Let's illustrate this with an example: Say we're currently indexing only `alice`'s address. A regular syncing process is running, waiting for new blocks to process. + +After block 10, `bob`'s address is added to the set. From block 11 onwards, both `alice` and `bob` will be matched. But we missed blocks 1 through 10 for `bob`. At this point we register a new backill job for the missing data. + +We're now at this state: + +| job | account set | block range | +| --------------- | -------------- | --------------- | +| **Forward** | `[alice, bob]` | waiting for #11 | +| **Backfill #1** | `[bob]` | `[1, 10]` | + +The new job starts immediately, in reverse order. + +A few moments later, `carol`'s address joins too. By now both existing jobs have advanced a bit: + +| job | account set | block range | notes | +| --------------- | -------------- | --------------- | ----------------------------------------- | +| **Forward** | `[alice, bob]` | waiting for #16 | | +| **Backfill #1** | `[bob]` | `[1, 5]` | We've synced from 10 to 6 in the meantime | +| **Backfill #2** | `[carol]` | `[1, 15]` | | + +The naive approach would be to the new job and run all 3 concurrently. +This has one drawback thought: both backfill jobs will fetch redundant blocks (1 through 5). + +Instead of starting right away, we run a [reorganization step](https://github.com/iron-wallet/indexer/blob/main/src/rearrange.rs): + +| job | account set | block range | notes | +| --------------- | -------------- | --------------- | -------------------------------------- | +| **Forward** | `[alice, bob]` | waiting for #16 | | +| **Backfill #3** | `[bob,carol]` | `[1, 5]` | The overlapping range in one job... | +| **Backfill #4** | `[carol]` | `[6, 15]` | ...And carol's unique range in another | + +This ensures we are never attempting to fetch the same block twice, therefore optimizing IO as much as possible. + +### Cuckoo filters + +We make use of [Cuckoo filters][cuckoo] for efficiently filtering data inclusion. This is similar to how Bloom filters work, with additional benefits such as ability to remove items, and lower space overhead. The particular [implementation being used](https://docs.rs/scalable_cuckoo_filter/0.2.3/scalable_cuckoo_filter/index.html) also supports automatic scaling. + +## Future Work + +### To be done next + +- [ ] Finish the API +- [ ] Add EIP-712 based authentication +- [ ] Document this a bit better +- [ ] Benchmark on a real mainnet node + +### Future optimizations + +A few potential optimizations are still yet-to-be-done, but should help improve throughput even further: + +- [ ] Split workers into producer/consumers. Currently workers alternate between fetching a block and processing. Instead, which is not optimal for IO. (question: is this worth it? or can we just saturate read capacity by setting up more workers?); +- [ ] Work-stealing. If we have a single backfill job walking N blocks, we can split it into Y jobs of N/Y blocks each. This can be done directly in the reorganization step. + +## Benchmarks + +🚧 TODO 🚧 + +## Requirements + +- A reth node running in the same node (requires access to the same filesystem) +- PostgreSQL + +## License + +[MIT](./LICENSE) License