Skip to content

Commit

Permalink
architechture doc of logtail and lock
Browse files Browse the repository at this point in the history
  • Loading branch information
lacrimosaprinz committed Oct 16, 2023
1 parent 173b404 commit d443ad0
Show file tree
Hide file tree
Showing 2 changed files with 233 additions and 0 deletions.
94 changes: 94 additions & 0 deletions docs/MatrixOne/Overview/architecture/architecture-logtail.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# Logtail Protocol Architecture

The Logtail protocol is a logging synchronization protocol between Computation Nodes (CN) and Transaction Nodes (TN). It serves as the foundation for the collaboration between CN and TN in the context of the MatrixOne cloud-native transaction and analytics engine. This article provides an in-depth look at the Logtail protocol's core purpose, contents, and generation process.

TAE, or the MatrixOne cloud-native transaction and analytics engine, currently involves the following responsibilities related to TN nodes:

- Processing committed transactions.
- Providing Logtail services to CN.
- Storing the most recent transaction data in object storage and pushing log windows.

Steps 1 and 3 lead to state changes, such as successful data writes into memory or object storage. Logtail is a logging synchronization protocol designed to synchronize partial states from TN to CN, allowing CN to reconstruct the required readable data at a low cost locally. As a key protocol in the MatrixOne storage-compute separation architecture, Logtail possesses the following characteristics:

- It is a log-replication state machine connecting CN and TN in tandem.
- Logtail supports two retrieval modes: pull and push.

- Push offers higher real-time capabilities by continuously synchronizing incremental logs from TN to CN.
- Pull supports snapshot synchronization for specified time intervals and can also synchronize subsequent incremental log information on demand.

- Logtail supports table-level subscriptions and collections, enhancing flexibility in multiple CN support and contributing to load balancing.

## Logtail Protocol Contents

The Logtail protocol consists of in-memory data and metadata, with the primary distinction being whether the data has been archived for object storage.

Updates generated by a single transaction `commit` exist in Logtail as in-memory data before being archived to object storage. Any modifications to data can be categorized as inserts and deletes:

- For inserts, Logtail information includes the data row's row-id, commit timestamp, and the columns defined in the table.
- Logtail information contains the row-id, commit timestamp, and primary vital columns for deletes. Once CN receives this in-memory data, it organizes it into a memory-based B-tree structure to facilitate query services.

In-memory data cannot be retained indefinitely in memory, as this would increase memory pressure. Due to time or capacity constraints, in-memory data is flushed to object storage, forming an object. Each object comprises one or more data blocks, with each block being the smallest storage unit for table data, and the number of rows in each block does not exceed a fixed limit, currently set at 8192 rows. Once the flush is complete, Logtail passes the metadata of the blocks to CN, which filters out the visible block list based on transaction timestamps, reads the block contents, and combines them with in-memory data to obtain a complete data view at a specific moment.

The process above outlines the basic steps. With the introduction of performance optimizations, more details will become apparent, including:

### 1. Checkpoints

When TN has been running for a while, it performs a checkpoint at a specific moment, archiving all data before that time to object storage. Therefore, all this metadata is consolidated into a "compression package." When a newly launched CN connects to TN and requests Logtail for the first time, if the subscription timestamp is greater than the checkpoint timestamp, TN can directly transmit checkpoint metadata (a string) to Logtail, allowing CN to read block information generated before the checkpoint now, reducing the network burden of sharing block metadata from scratch and relieving TN's I/O pressure.

### 2. Memory Cleanup

When TN passes block metadata to CN, it cleans up the previously transmitted in-memory data based on block identifiers. However, data updates may occur during TN transaction flushing, such as new delete operations appearing on flushed blocks. If the strategy is rollback and retried, the already written data becomes invalid. In the case of loads with frequent updates, a significant amount of rollback operations may occur, wasting TN resources. To avoid this situation, TN continues to commit, rendering in-memory data generated after the start of the flush process unremovable from CN. Logtail's block metadata includes a timestamp, indicating that in-memory data for the block can only be cleared from memory before this timestamp. These uncleared updates will be asynchronously written to disk during the next flush and subsequently removed by CN.

### 3. Faster Reads

Blocks that have already been archived to object storage may continue to generate delete operations. Therefore, when reading these blocks, in-memory deleted data must be considered. CN maintains an additional block B-tree index to determine which blocks need to be combined with in-memory data. When applying Logtail to modify this index, caution should be exercised, increasing index entries when processing in-memory data and decreasing them when processing block metadata. This index includes only the blocks that need to be checked against in-memory data, which can result in a significant performance boost for a large number of blocks.

## Logtail Generation

As mentioned earlier, Logtail can be obtained through pull and push. These two modes have different characteristics, and they will be explained separately.

### 1. Pull

As previously explained, pull effectively synchronizes table snapshots and generates incremental log information.

To achieve this, TN maintains a sorted list of transaction handles, the logtail table, ordered by transaction preparation time. Given any moment, a binary search can be performed on the transaction handles within the range, which provides which blocks have been updated by that transaction. By iterating through these blocks, a complete log can be obtained. To expedite the search process, transaction handles are paginated, with the bornTs of each page being the minimum preparation time of the transaction handles within that page. The first level of binary search is performed on these pages.

Based on the logtail table, from the moment a pull request is received, the primary workflow is as follows:

1. Adjust the request's time range based on existing checkpoints, which may provide data earlier than the request.

2. Take a snapshot of the logtail table, iterate through the relevant transaction handles within this snapshot using a RespBuilder in visitor mode, and collect the committed log information.

3. Convert the collected log information into the Logtail protocol format and return it as a response to CN.

![Pull Workflow](https://github.com/matrixorigin/artwork/blob/main/docs/overview/architecture/logtail-arch-1.png)

```
type RespBuilder interface {
OnDatabase(database *DBEntry) error
OnPostDatabase(database *DBEntry) error
OnTable(table *TableEntry) error
OnPostTable(table *TableEntry) error
OnPostSegment(segment *SegmentEntry) error
OnSegment(segment *SegmentEntry) error
OnBlock(block *BlockEntry) error
BuildResp() (api.SyncLogtailResp, error)
Close()
}
```

### 2. Push

The primary purpose of push is to synchronize incremental logs from TN to CN in a more real-time manner. The process is divided into three stages: subscription, collection, and push.

- **Subscription**: This process is necessary when a new CN starts. The CN establishes an RPC stream with the TN and subscribes to catalog-related tables as a client. The CN can provide external services when database, table, and column information has been synchronized. When TN receives a request to subscribe to

A table goes through the pull process, including all Logtail up to the previous push timestamp in the subscription response. For a CN, Logtail's subscription, unsubscription, and data sending all occur on a single RPC stream. If there is an exception, the CN enters a reconnection process until it recovers. Once the subscription is successful, subsequent Logtail operations involve pushing incremental logs.

- **Collection**: A transaction's completion within TN triggers a callback to collect Logtail within the current transaction. The primary process involves iterating through TxnEntry in the workspace, which serves as a fundamental container for transaction updates directly involved in the commit pipeline. Depending on its type, the corresponding log information is collected and transformed into a Logtail protocol data format. This collection process occurs through a pipeline and runs concurrently with WAL's fsync, reducing blocking.

- **Push**: A filtering process is mainly conducted during the push stage. If it is found that a CN has not subscribed to a specific table, it is skipped to avoid pushing data to that CN.

If a table has not been updated for a long time, how does a CN become aware of it? Here, a heartbeat mechanism is introduced, with a default of 2 ms. In TN's commit queue, a heartbeat transaction is placed, which performs no substantial work but consumes a timestamp, triggering a Logtail send to notify CN that all table data had updates sent previously, pushing the CN's timestamp watermark.

![Push Workflow](https://github.com/matrixorigin/artwork/blob/main/docs/overview/architecture/logtail-arch-2.png)
Loading

0 comments on commit d443ad0

Please sign in to comment.