Skip to content

Commit

Permalink
Add global sort related docs. (#14801)
Browse files Browse the repository at this point in the history
  • Loading branch information
Benjamin2037 authored Oct 9, 2023
1 parent 53174a7 commit 0387a25
Show file tree
Hide file tree
Showing 5 changed files with 117 additions and 0 deletions.
1 change: 1 addition & 0 deletions TOC-tidb-cloud.md
Original file line number Diff line number Diff line change
Expand Up @@ -614,6 +614,7 @@
- [Table Filter](/table-filter.md)
- [Resource Control](/tidb-resource-control.md)
- [TiDB Backend Task Distributed Execution Framework](/tidb-distributed-execution-framework.md)
- [TiDB Global Sort](/tidb-global-sort.md)
- [DDL Execution Principles and Best Practices](/ddl-introduction.md)
- [Troubleshoot Inconsistency Between Data and Indexes](/troubleshoot-data-inconsistency-errors.md)
- [Support](/tidb-cloud/tidb-cloud-support.md)
Expand Down
1 change: 1 addition & 0 deletions TOC.md
Original file line number Diff line number Diff line change
Expand Up @@ -994,6 +994,7 @@
- [Schedule Replicas by Topology Labels](/schedule-replicas-by-topology-labels.md)
- Internal Components
- [TiDB Backend Task Distributed Execution Framework](/tidb-distributed-execution-framework.md)
- [TiDB Global Sort](/tidb-global-sort.md)
- FAQs
- [FAQ Summary](/faq/faq-overview.md)
- [TiDB FAQs](/faq/tidb-faq.md)
Expand Down
Binary file added media/dist-task/global-sort.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
28 changes: 28 additions & 0 deletions system-variables.md
Original file line number Diff line number Diff line change
Expand Up @@ -1571,6 +1571,34 @@ mysql> SELECT job_info FROM mysql.analyze_jobs ORDER BY end_time DESC LIMIT 1;
- Starting from TiDB v7.2.0, the framework supports distributedly executing the [`IMPORT INTO`](https://docs.pingcap.com/tidb/v7.2/sql-statement-import-into) statement for import jobs of TiDB Self-Hosted. For TiDB Cloud, the `IMPORT INTO` statement is not applicable.
- This variable is renamed from `tidb_ddl_distribute_reorg`.

### tidb_cloud_storage_uri <span class="version-mark">New in v7.4.0</span>

> **Warning:**
>
> This feature is experimental. It is not recommended that you use it in the production environment. This feature might be changed or removed without prior notice. If you find a bug, you can report an [issue](https://github.com/pingcap/tidb/issues) on GitHub.

- Scope: GLOBAL
- Persists to cluster: Yes
- Applies to hint [SET_VAR](/optimizer-hints.md#set_varvar_namevar_value): No
- Default value: `""`

<CustomContent platform="tidb">

- This variable is used to specify the cloud storage URI to enable [Global Sort](/tidb-global-sort.md). After enabling the [distributed execution framework](/tidb-distributed-execution-framework.md), you can use the Global Sort feature by configuring the URI and pointing it to an appropriate cloud storage path with the necessary permissions to access the storage. For more details, see [URI format](/br/backup-and-restore-storages.md#uri-format).
- The following statements can use the Global Sort feature.
- The [`ADD INDEX`](/sql-statements/sql-statement-add-index.md) statement.
- The [`IMPORT INTO`](/sql-statements/sql-statement-import-into.md) statement for import jobs of TiDB Self-Hosted. For TiDB Cloud, the `IMPORT INTO` statement is not applicable.

</CustomContent>
<CustomContent platform="tidb-cloud">

- This variable is used to specify the cloud storage URI to enable [Global Sort](/tidb-global-sort.md). After enabling the [distributed execution framework](/tidb-distributed-execution-framework.md), you can use the Global Sort feature by configuring the URI and pointing it to an appropriate cloud storage path with the necessary permissions to access the storage. For more details, see [URI format](https://docs.pingcap.com/tidb/stable/backup-and-restore-storages#uri-format).
- The following statements can use the Global Sort feature.
- The [`ADD INDEX`](/sql-statements/sql-statement-add-index.md) statement.
- The [`IMPORT INTO`](https://docs.pingcap.com/tidb/stable/sql-statement-import-into) statement for import jobs of TiDB Self-Hosted. For TiDB Cloud, the `IMPORT INTO` statement is not applicable.

</CustomContent>

### tidb_ddl_error_count_limit

> **Note:**
Expand Down
87 changes: 87 additions & 0 deletions tidb-global-sort.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
---
title: TiDB Global Sort
summary: Learn the use cases, limitations, usage, and implementation principles of the TiDB Global Sort.
---

<!-- markdownlint-disable MD029 -->
<!-- markdownlint-disable MD046 -->

# TiDB Global Sort

> **Warning:**
>
> This feature is experimental. It is not recommended that you use it in the production environment. This feature might be changed or removed without prior notice. If you find a bug, you can report an [issue](https://github.com/pingcap/tidb/issues) on GitHub.
<CustomContent platform="tidb-cloud">

> **Note:**
>
> Currently, this feature is only applicable to TiDB Dedicated clusters. You cannot use it on TiDB Serverless clusters.
</CustomContent>

## Overview

The TiDB Global Sort feature enhances the stability and efficiency of data import and DDL (Data Definition Language) operations. It serves as a general operator in the [distributed execution framework](/tidb-distributed-execution-framework.md), providing a global sort service on cloud.

The Global Sort feature currently only supports using Amazon S3 as cloud storage. In future releases, it will be extended to support multiple shared storage interfaces, such as POSIX, enabling seamless integration with different storage systems. This flexibility enables efficient and adaptable data sorting for various use cases.

## Use cases

The Global Sort feature enhances the stability and efficiency of `IMPORT INTO` and `CREATE INDEX`. By globally sorting the data that are processed by the tasks, it improves the stability, controllability, and scalability of writing data to TiKV. This provides an enhanced user experience for data import and DDL tasks, as well as higher-quality services.

The Global Sort feature executes tasks within the unified distributed parallel execution framework, ensuring efficient and parallel sorting of data on a global scale.

## Limitations

Currently, the Global Sort feature is not used as a component of the query execution process responsible for sorting query results.

## Usage

To enable Global Sort, follow these steps:

1. Enable the distributed execution framework by setting the value of [`tidb_enable_dist_task`](/system-variables.md#tidb_enable_dist_task-new-in-v710) to `ON`:

```sql
SET GLOBAL tidb_enable_dist_task = ON;
```

<CustomContent platform="tidb">

2. Set [`tidb_cloud_storage_uri`](/system-variables.md#tidb_cloud_storage_uri-new-in-v740) to a correct cloud storage path. See [an example](/br/backup-and-restore-storages.md).

</CustomContent>
<CustomContent platform="tidb-cloud">

2. Set [`tidb_cloud_storage_uri`](/system-variables.md#tidb_cloud_storage_uri-new-in-v740) to a correct cloud storage path. See [an example](https://docs.pingcap.com/tidb/stable/backup-and-restore-storages).

</CustomContent>

```sql
SET GLOBAL tidb_cloud_storage_uri = 's3://my-bucket/test-data?role-arn=arn:aws:iam::888888888888:role/my-role'
```

## Implementation principles

The algorithm of the Global Sort feature is as follows:

![Algorithm of Global Sort](/media/dist-task/global-sort.jpeg)

The detailed implementation principles are as follows:

### Step 1: Scan and prepare data

1. After TiDB nodes scan a specific range of data (the data source can be either CSV data or table data in TiKV):

1. TiDB nodes encode them into Key-Value pairs.
2. TiDB nodes sort Key-Value pairs into several block data segments (the data segments are locally sorted), where each segment is one file and is uploaded into the cloud storage.

2. The TiDB node also records a serial actual Key-Value ranges for each segment (referred to as a statistics file), which is a key preparation for scalable sort implementation. These files are then uploaded into the cloud storage along with the real data.

### Step 2: Sort and distribute data

From step 1, the Global Sort program obtains a list of sorted blocks and their corresponding statistics files, which provide the number of locally sorted blocks. The program also has a real data scope that can be used by PD to split and scatter. The following steps are performed:

1. Sort the records in the statistics file to divide them into nearly equal-sized ranges, which are subtasks that will be executed in parallel.
2. Distribute the subtasks to TiDB nodes for execution.
3. Each TiDB node independently sorts the data of subtasks into ranges and ingests them into TiKV without overlap.

0 comments on commit 0387a25

Please sign in to comment.