diff --git a/alert-rules.md b/alert-rules.md index f068025b3bc9c..b7ed316d10c96 100644 --- a/alert-rules.md +++ b/alert-rules.md @@ -234,7 +234,7 @@ This section gives the alert rules for the PD component. * Description: - The number of Region replicas is smaller than the value of `max-replicas`. When a TiKV machine is down and its downtime exceeds `max-down-time`, it usually leads to missing replicas for some Regions during a period of time. + The number of Region replicas is smaller than the value of `max-replicas`. * Solution: diff --git a/best-practices/java-app-best-practices.md b/best-practices/java-app-best-practices.md index 9cf2a00bdf7a2..564050db22519 100644 --- a/best-practices/java-app-best-practices.md +++ b/best-practices/java-app-best-practices.md @@ -233,7 +233,7 @@ The application needs to return the connection after finishing using it. It is a ### Probe configuration -The connection pool maintains persistent connections to TiDB. TiDB does not proactively close client connections by default (unless an error is reported), but generally there will be network proxies such as LVS or HAProxy between the client and TiDB. Usually, these proxies will proactively clean up connections that are idle for a certain period of time. In addition to paying attention to the idle configuration of the proxies, the connection pool also needs to keep alive or probe connections. +The connection pool maintains persistent connections to TiDB. TiDB does not proactively close client connections by default (unless an error is reported), but generally there will be network proxies such as LVS or HAProxy between the client and TiDB. Usually, these proxies will proactively clean up connections that are idle for a certain period of time (controlled by the proxy's idle configuration). In addition to paying attention to the idle configuration of the proxies, the connection pool also needs to keep alive or probe connections. If you often see the following error in your Java application: diff --git a/br/backup-and-restore-use-cases.md b/br/backup-and-restore-use-cases.md index d94f1637174d4..bfbff79b4df61 100644 --- a/br/backup-and-restore-use-cases.md +++ b/br/backup-and-restore-use-cases.md @@ -87,7 +87,7 @@ The BR tool already supports self-adapting to GC. It automatically registers `ba For the detailed usage of the `br backup` command, refer to [Use BR Command-line for Backup and Restoration](/br/use-br-command-line-tool.md). 1. Before executing the `br backup` command, ensure that no DDL is running on the TiDB cluster. -2. Ensure that the storage device where the backup will be created has sufficient space. +2. Ensure that the storage device where the backup will be created has sufficient space (no less than 1/3 of the disk space of the backup cluster). ### Preparation for restoration diff --git a/daily-check.md b/daily-check.md index 351e428a5702a..62dd3ded55a8b 100644 --- a/daily-check.md +++ b/daily-check.md @@ -41,7 +41,7 @@ You can locate the slow SQL statement executed in the cluster. Then you can opti + `miss-peer-region-count`: The number of Regions without enough replicas. This value is not always greater than `0`. + `extra-peer-region-count`: The number of Regions with extra replicas. These Regions are generated during the scheduling process. + `empty-region-count`: The number of empty Regions, generated by executing the `TRUNCATE TABLE`/`DROP TABLE` statement. If this number is large, you can consider enabling `Region Merge` to merge Regions across tables. -+ `pending-peer-region-count`: The number of Regions with outdated Raft logs. It is normal that a few pending peers are generated in the scheduling process. However, it is not normal if this value is large for a period of time. ++ `pending-peer-region-count`: The number of Regions with outdated Raft logs. It is normal that a few pending peers are generated in the scheduling process. However, it is not normal if this value is large for a period of time (longer than 30 minutes). + `down-peer-region-count`: The number of Regions with an unresponsive peer reported by the Raft leader. + `offline-peer-region-count`: The number of Regions during the offline process. diff --git a/dm/deploy-a-dm-cluster-using-tiup.md b/dm/deploy-a-dm-cluster-using-tiup.md index dec90cf83a8fb..f9eece25f6744 100644 --- a/dm/deploy-a-dm-cluster-using-tiup.md +++ b/dm/deploy-a-dm-cluster-using-tiup.md @@ -16,7 +16,7 @@ TiUP supports deploying DM v2.0 or later DM versions. This document introduces h ## Prerequisites -When DM performs a full data replication task, the DM-worker is bound with only one upstream database. The DM-worker first exports the full amount of data locally, and then imports the data into the downstream database. Therefore, the worker's host needs sufficient storage space (The storage path is specified later when you create the task). +When DM performs a full data replication task, the DM-worker is bound with only one upstream database. The DM-worker first exports the full amount of data locally, and then imports the data into the downstream database. Therefore, the worker's host space must be large enough to store all upstream tables to be exported. The storage path is specified later when you create the task. In addition, you need to meet the [hardware and software requirements](/dm/dm-hardware-and-software-requirements.md) when deploying a DM cluster. diff --git a/dumpling-overview.md b/dumpling-overview.md index d74b48fd7069b..6d08bfd407050 100644 --- a/dumpling-overview.md +++ b/dumpling-overview.md @@ -318,9 +318,9 @@ When Dumpling is exporting a large single table from TiDB, Out of Memory (OOM) m + Reduce the value of `--tidb-mem-quota-query` to `8589934592` (8 GB) or lower. `--tidb-mem-quota-query` controls the memory usage of a single query statement in TiDB. + Adjust the `--params "tidb_distsql_scan_concurrency=5"` parameter. [`tidb_distsql_scan_concurrency`](/system-variables.md#tidb_distsql_scan_concurrency) is a session variable which controls the concurrency of the scan operations in TiDB. -### TiDB GC settings when exporting a large volume of data +### TiDB GC settings when exporting a large volume of data (more than 1 TB) -When exporting data from TiDB, if the TiDB version is later than or equal to v4.0.0 and Dumpling can access the PD address of the TiDB cluster, Dumpling automatically extends the GC time without affecting the original cluster. +When exporting data from TiDB (more than 1 TB), if the TiDB version is later than or equal to v4.0.0 and Dumpling can access the PD address of the TiDB cluster, Dumpling automatically extends the GC time without affecting the original cluster. In other scenarios, if the data size is very large, to avoid export failure due to GC during the export process, you can extend the GC time in advance: diff --git a/faq/manage-cluster-faq.md b/faq/manage-cluster-faq.md index 41fea527c682e..6d51685219eed 100644 --- a/faq/manage-cluster-faq.md +++ b/faq/manage-cluster-faq.md @@ -427,6 +427,6 @@ This section describes common problems you may encounter during backup and resto ### How to back up data in TiDB? -Currently, for the backup of a large volume of data, the preferred method is using [BR](/br/backup-and-restore-tool.md). Otherwise, the recommended tool is [Dumpling](/dumpling-overview.md). Although the official MySQL tool `mysqldump` is also supported in TiDB to back up and restore data, its performance is worse than [BR](/br/backup-and-restore-tool.md) and it needs much more time to back up and restore large volumes of data. +Currently, for the backup of a large volume of data (more than 1 TB), the preferred method is using [BR](/br/backup-and-restore-tool.md). Otherwise, the recommended tool is [Dumpling](/dumpling-overview.md). Although the official MySQL tool `mysqldump` is also supported in TiDB to back up and restore data, its performance is worse than [BR](/br/backup-and-restore-tool.md) and it needs much more time to back up and restore large volumes of data. For more FAQs about BR, see [BR FAQs](/br/backup-and-restore-faq.md). diff --git a/migrate-large-mysql-to-tidb.md b/migrate-large-mysql-to-tidb.md index 642cd8dd01a1f..86acbddf1baa2 100644 --- a/migrate-large-mysql-to-tidb.md +++ b/migrate-large-mysql-to-tidb.md @@ -28,7 +28,7 @@ This document describes how to migrate large datasets from MySQL to TiDB. The wh **Disk space**: -- Dumpling requires enough disk space to store the whole data source. SSD is recommended. +- Dumpling requires a disk space that can store the whole data source (or to store all upstream tables to be exported). SSD is recommended. To calculate the required space, see [Downstream storage space requirements](/tidb-lightning/tidb-lightning-requirements.md#downstream-storage-space-requirements). - During the import, TiDB Lightning needs temporary space to store the sorted key-value pairs. The disk space should be enough to hold the largest single table from the data source. - If the full data volume is large, you can increase the binlog storage time in the upstream. This is to ensure that the binlogs are not lost during the incremental replication. @@ -78,7 +78,7 @@ The target TiKV cluster must have enough disk space to store the imported data. |-`B` or `--database` |Specifies a database to be exported| |`-f` or `--filter` |Exports tables that match the pattern. Refer to [table-filter](/table-filter.md) for the syntax.| - Make sure `${data-path}` has enough space to store the exported data. To prevent the export from being interrupted by a large table consuming all the spaces, it is strongly recommended to use the `-F` option to limit the size of a single file. + Make sure `${data-path}` has the space to store all exported upstream tables. To calculate the required space, see [Downstream storage space requirements](/tidb-lightning/tidb-lightning-requirements.md#downstream-storage-space-requirements). To prevent the export from being interrupted by a large table consuming all the spaces, it is strongly recommended to use the `-F` option to limit the size of a single file. 2. View the `metadata` file in the `${data-path}` directory. This is a Dumpling-generated metadata file. Record the binlog position information, which is required for the incremental replication in Step 3. diff --git a/pd-control.md b/pd-control.md index f8766babeab76..2391378d23cb5 100644 --- a/pd-control.md +++ b/pd-control.md @@ -258,19 +258,19 @@ Usage: >> config set region-schedule-limit 2 // 2 tasks of Region scheduling at the same time at most ``` -- `replica-schedule-limit` controls the number of tasks scheduling the replica at the same time. This value affects the scheduling speed when the node is down or removed. A larger value means a higher speed and setting the value to 0 closes the scheduling. Usually the replica scheduling has a large load, so do not set a too large value. +- `replica-schedule-limit` controls the number of tasks scheduling the replica at the same time. This value affects the scheduling speed when the node is down or removed. A larger value means a higher speed and setting the value to 0 closes the scheduling. Usually the replica scheduling has a large load, so do not set a too large value. Note that this configuration item is usually kept at the default value. If you want to change the value, you need to try a few values to see which one works best according to the real situation. ```bash >> config set replica-schedule-limit 4 // 4 tasks of replica scheduling at the same time at most ``` -- `merge-schedule-limit` controls the number of Region Merge scheduling tasks. Setting the value to 0 closes Region Merge. Usually the Merge scheduling has a large load, so do not set a too large value. +- `merge-schedule-limit` controls the number of Region Merge scheduling tasks. Setting the value to 0 closes Region Merge. Usually the Merge scheduling has a large load, so do not set a too large value. Note that this configuration item is usually kept at the default value. If you want to change the value, you need to try a few values to see which one works best according to the real situation. ```bash >> config set merge-schedule-limit 16 // 16 tasks of Merge scheduling at the same time at most ``` -- `hot-region-schedule-limit` controls the hot Region scheduling tasks that are running at the same time. Setting its value to `0` means to disable the scheduling. It is not recommended to set a too large value, otherwise it might affect the system performance. +- `hot-region-schedule-limit` controls the hot Region scheduling tasks that are running at the same time. Setting its value to `0` means disabling the scheduling. It is not recommended to set a too large value. Otherwise, it might affect the system performance. Note that this configuration item is usually kept at the default value. If you want to change the value, you need to try a few values to see which one works best according to the real situation. ```bash >> config set hot-region-schedule-limit 4 // 4 tasks of hot Region scheduling at the same time at most diff --git a/releases/release-5.3.0.md b/releases/release-5.3.0.md index 1bf11d5bc3890..d8d77147bf68f 100644 --- a/releases/release-5.3.0.md +++ b/releases/release-5.3.0.md @@ -117,7 +117,7 @@ In v5.3, the key new features or improvements are as follows: Support the `ALTER TABLE [PARTITION] ATTRIBUTES` statement that allows you to set attributes for a table or partition. Currently, TiDB only supports setting the `merge_option` attribute. By adding this attribute, you can explicitly control the Region merge behavior. - User scenarios: When you perform the `SPLIT TABLE` operation, if no data is inserted after a certain period of time, the empty Regions are automatically merged by default. In this case, you can set the table attribute to `merge_option=deny` to avoid the automatic merging of Regions. + User scenarios: When you perform the `SPLIT TABLE` operation, if no data is inserted after a certain period of time (controlled by the PD parameter [`split-merge-interval`](/pd-configuration-file.md#split-merge-interval)), the empty Regions are automatically merged by default. In this case, you can set the table attribute to `merge_option=deny` to avoid the automatic merging of Regions. [User document](/table-attributes.md), [#3839](https://github.com/tikv/pd/issues/3839) diff --git a/schedule-replicas-by-topology-labels.md b/schedule-replicas-by-topology-labels.md index fb8c741aef0d3..4aaf4c4b91aa6 100644 --- a/schedule-replicas-by-topology-labels.md +++ b/schedule-replicas-by-topology-labels.md @@ -168,7 +168,7 @@ Then, assume that the number of cluster replicas is 5 (`max-replicas=5`). Becaus In the case of the 5-replica configuration, if z3 fails or is isolated as a whole, and cannot be recovered after a period of time (controlled by `max-store-down-time`), PD will make up the 5 replicas through scheduling. At this time, only 4 hosts are available. This means that host-level isolation cannot be guaranteed and that multiple replicas might be scheduled to the same host. But if the `isolation-level` value is set to `zone` instead of being left empty, this specifies the minimum physical isolation requirements for Region replicas. That is to say, PD will ensure that replicas of the same Region are scattered among different zones. PD will not perform corresponding scheduling even if following this isolation restriction does not meet the requirement of `max-replicas` for multiple replicas. -For example, a TiKV cluster is distributed across three data zones z1, z2, and z3. Each Region has three replicas as required, and PD distributes the three replicas of the same Region to these three data zones respectively. If a power outage occurs in z1 and cannot be recovered after a period of time, PD determines that the Region replicas on z1 are no longer available. However, because `isolation-level` is set to `zone`, PD needs to strictly guarantee that different replicas of the same Region will not be scheduled on the same data zone. Because both z2 and z3 already have replicas, PD will not perform any scheduling under the minimum isolation level restriction of `isolation-level`, even if there are only two replicas at this moment. +For example, a TiKV cluster is distributed across three data zones z1, z2, and z3. Each Region has three replicas as required, and PD distributes the three replicas of the same Region to these three data zones respectively. If a power outage occurs in z1 and cannot be recovered after a period of time (controlled by [`max-store-down-time`](/pd-configuration-file.md#max-store-down-time) and 30 minutes by default), PD determines that the Region replicas on z1 are no longer available. However, because `isolation-level` is set to `zone`, PD needs to strictly guarantee that different replicas of the same Region will not be scheduled on the same data zone. Because both z2 and z3 already have replicas, PD will not perform any scheduling under the minimum isolation level restriction of `isolation-level`, even if there are only two replicas at this moment. Similarly, when `isolation-level` is set to `rack`, the minimum isolation level applies to different racks in the same data center. With this configuration, the isolation at the zone layer is guaranteed first if possible. When the isolation at the zone level cannot be guaranteed, PD tries to avoid scheduling different replicas to the same rack in the same zone. The scheduling works similarly when `isolation-level` is set to `host` where PD first guarantees the isolation level of rack, and then the level of host. diff --git a/shard-row-id-bits.md b/shard-row-id-bits.md index 1aba21c2417b3..9877c6d9e2255 100644 --- a/shard-row-id-bits.md +++ b/shard-row-id-bits.md @@ -11,7 +11,7 @@ This document introduces the `SHARD_ROW_ID_BITS` table attribute, which is used For the tables with a non-integer primary key or no primary key, TiDB uses an implicit auto-increment row ID. When a large number of `INSERT` operations are performed, the data is written into a single Region, causing a write hot spot. -To mitigate the hot spot issue, you can configure `SHARD_ROW_ID_BITS`. The row IDs are scattered and the data are written into multiple different Regions. But setting an overlarge value might lead to an excessively large number of RPC requests, which increases the CPU and network overheads. +To mitigate the hot spot issue, you can configure `SHARD_ROW_ID_BITS`. The row IDs are scattered and the data are written into multiple different Regions. - `SHARD_ROW_ID_BITS = 4` indicates 16 shards - `SHARD_ROW_ID_BITS = 6` indicates 64 shards diff --git a/sql-plan-management.md b/sql-plan-management.md index 8f7be10ccd162..700dc254e8858 100644 --- a/sql-plan-management.md +++ b/sql-plan-management.md @@ -373,7 +373,7 @@ Insert filtering conditions into the system table `mysql.capture_plan_baselines_ Before upgrading a TiDB cluster, you can use baseline capturing to prevent regression of execution plans by performing the following steps: -1. Enable baseline capturing and keep it working for a period of time. +1. Enable baseline capturing and keep it working. > **Note:** > diff --git a/statement-summary-tables.md b/statement-summary-tables.md index d33eb02e064cc..110cd35f0ec96 100644 --- a/statement-summary-tables.md +++ b/statement-summary-tables.md @@ -133,11 +133,11 @@ The `statements_summary_evicted` table records the recent 24 periods during whic > **Note:** > -> The `tidb_stmt_summary_history_size`, `tidb_stmt_summary_max_stmt_count`, and `tidb_stmt_summary_max_sql_length` configuration items affect memory usage. It is recommended that you adjust these configurations based on your needs. It is not recommended to set them too large values. +> The `tidb_stmt_summary_history_size`, `tidb_stmt_summary_max_stmt_count`, and `tidb_stmt_summary_max_sql_length` configuration items affect memory usage. It is recommended that you adjust these configurations based on your needs, the SQL size, SQL count, and machine configuration. It is not recommended to set them too large values. You can calculate the memory usage using `tidb_stmt_summary_history_size` \* `tidb_stmt_summary_max_stmt_count` \* `tidb_stmt_summary_max_sql_length` \* `3`. ### Set a proper size for statement summary -After the system has run for a period of time, you can check the `statement_summary` table to see whether SQL eviction has occurred. For example: +After the system has run for a period of time (depending on the system load), you can check the `statement_summary` table to see whether SQL eviction has occurred. For example: ```sql select @@global.tidb_stmt_summary_max_stmt_count; diff --git a/storage-engine/titan-overview.md b/storage-engine/titan-overview.md index d1fa7ecac018b..c31e688a440c4 100644 --- a/storage-engine/titan-overview.md +++ b/storage-engine/titan-overview.md @@ -7,7 +7,7 @@ summary: Learn the overview of the Titan storage engine. [Titan](https://github.com/pingcap/rocksdb/tree/titan-5.15) is a high-performance [RocksDB](https://github.com/facebook/rocksdb) plugin for key-value separation. Titan can reduce write amplification in RocksDB when large values are used. -When the value size in Key-Value pairs is large, Titan performs better than RocksDB in write, update, and point read scenarios. However, Titan gets a higher write performance by sacrificing storage space and range query performance. As the price of SSDs continues to decrease, this trade-off will be more and more meaningful. +When the value size in Key-Value pairs is large (larger than 1 KB or 512 B), Titan performs better than RocksDB in write, update, and point read scenarios. However, Titan gets a higher write performance by sacrificing storage space and range query performance. As the price of SSDs continues to decrease, this trade-off will be more and more meaningful. ## Key features @@ -29,7 +29,7 @@ The prerequisites for enabling Titan are as follows: - The average size of values is large, or the size of all large values accounts for much of the total value size. Currently, the size of a value greater than 1 KB is considered as a large value. In some situations, this number (1 KB) can be 512 B. Note that a single value written to TiKV cannot exceed 8 MB due to the limitation of the TiKV Raft layer. You can adjust the [`raft-entry-max-size`](/tikv-configuration-file.md#raft-entry-max-size) configuration value to relax the limit. - No range query will be performed or you do not need a high performance of range query. Because the data stored in Titan is not well-ordered, its performance of range query is poorer than that of RocksDB, especially for the query of a large range. According PingCAP's internal test, Titan's range query performance is 40% to a few times lower than that of RocksDB. -- Sufficient disk space, because Titan reduces write amplification at the cost of disk space. In addition, Titan compresses values one by one, and its compression rate is lower than that of RocksDB. RocksDB compresses blocks one by one. Therefore, Titan consumes more storage space than RocksDB, which is expected and normal. In some situations, Titan's storage consumption can be twice that of RocksDB. +- Sufficient disk space (consider reserving a space twice of the RocksDB memory consumption with the same data volume). This is because Titan reduces write amplification at the cost of disk space. In addition, Titan compresses values one by one, and its compression rate is lower than that of RocksDB. RocksDB compresses blocks one by one. Therefore, Titan consumes more storage space than RocksDB, which is expected and normal. In some situations, Titan's storage consumption can be twice that of RocksDB. If you want to improve the performance of Titan, see the blog post [Titan: A RocksDB Plugin to Reduce Write Amplification](https://pingcap.com/blog/titan-storage-engine-design-and-implementation/). diff --git a/system-variables.md b/system-variables.md index 2152c8a82b1dc..40914da1c7f6f 100644 --- a/system-variables.md +++ b/system-variables.md @@ -714,7 +714,7 @@ Constraint checking is always performed in place for pessimistic transactions (d - Unit: Rows - This variable is used to set the batch size during the `re-organize` phase of the DDL operation. For example, when TiDB executes the `ADD INDEX` operation, the index data needs to backfilled by `tidb_ddl_reorg_worker_cnt` (the number) concurrent workers. Each worker backfills the index data in batches. - If many updating operations such as `UPDATE` and `REPLACE` exist during the `ADD INDEX` operation, a larger batch size indicates a larger probability of transaction conflicts. In this case, you need to adjust the batch size to a smaller value. The minimum value is 32. - - If the transaction conflict does not exist, you can set the batch size to a large value. This can increase the speed of the backfilling data, but the write pressure on TiKV also becomes higher. + - If the transaction conflict does not exist, you can set the batch size to a large value (consider the worker count. See [Interaction Test on Online Workloads and `ADD INDEX` Operations](/benchmark/online-workloads-and-add-index-operations.md) for reference). This can increase the speed of the backfilling data, but the write pressure on TiKV also becomes higher. ### tidb_ddl_reorg_priority @@ -760,7 +760,7 @@ Constraint checking is always performed in place for pessimistic transactions (d - This variable is used to set the concurrency of the `scan` operation. - Use a bigger value in OLAP scenarios, and a smaller value in OLTP scenarios. - For OLAP scenarios, the maximum value should not exceed the number of CPU cores of all the TiKV nodes. -- If a table has a lot of partitions, you can reduce the variable value appropriately to avoid TiKV becoming out of memory (OOM). +- If a table has a lot of partitions, you can reduce the variable value appropriately (determined by the size of the data to be scanned and the frequency of the scan) to avoid TiKV becoming out of memory (OOM). ### tidb_dml_batch_size diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index 15e35e0ba066d..3d6c01d6249b8 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -103,39 +103,16 @@ A replication task might be interrupted in the following known scenarios: 2. Use the new task configuration file and add the `ignore-txn-start-ts` parameter to skip the transaction corresponding to the specified `start-ts`. 3. Stop the old replication task via HTTP API. Execute `cdc cli changefeed create` to create a new task and specify the new task configuration file. Specify `checkpoint-ts` recorded in step 1 as the `start-ts` and start a new task to resume the replication. -- In TiCDC v4.0.13 and earlier versions, when TiCDC replicates the partitioned table, it might encounter an error that leads to replication interruption. +- In TiCDC v4.0.13 and earlier versions, when TiCDC replicates the partitioned table, it might encounter an error that leads to replication interruption. - In this scenario, TiCDC saves the task information. Because TiCDC has set the service GC safepoint in PD, the data after the task checkpoint is not cleaned by TiKV GC within the valid period of `gc-ttl`. - Handling procedures: - 1. Pause the replication task by executing `cdc cli changefeed pause -c `. + 1. Pause the replication task by executing `cdc cli changefeed pause -c `. 2. Wait for about one munite, and then resume the replication task by executing `cdc cli changefeed resume -c `. ### What should I do to handle the OOM that occurs after TiCDC is restarted after a task interruption? -- Update your TiDB cluster and TiCDC cluster to the latest versions. The OOM problem has already been resolved in **v4.0.14 and later v4.0 versions, v5.0.2 and later v5.0 versions, and the latest versions**. - -- In the above updated versions, you can enable the Unified Sorter to help you sort data in the disk when the system memory is insufficient. To enable this function, you can pass `--sort-engine=unified` to the `cdc cli` command when creating a replication task. For example: - -{{< copyable "shell-regular" >}} - -```shell -cdc cli changefeed update -c --sort-engine="unified" --pd=http://10.0.10.25:2379 -``` - -If you fail to update your cluster to the above new versions, you can still enable Unified Sorter in **previous versions**. You can pass `--sort-engine=unified` and `--sort-dir=/path/to/sort_dir` to the `cdc cli` command when creating a replication task. For example: - -{{< copyable "shell-regular" >}} - -```shell -cdc cli changefeed update -c --sort-engine="unified" --sort-dir="/data/cdc/sort" --pd=http://10.0.10.25:2379 -``` - -> **Note:** -> -> + Since v4.0.9, TiCDC supports the unified sorter engine. -> + TiCDC (the 4.0 version) does not support dynamically modifying the sorting engine yet. Make sure that the changefeed has stopped before modifying the sorter settings. -> + `sort-dir` has different behaviors in different versions. Refer to [compatibility notes for`sort-dir` and `data-dir`](/ticdc/ticdc-overview.md#compatibility-notes-for-sort-dir-and-data-dir), and configure it with caution. -> + Currently, the unified sorter is an experimental feature. When the number of tables is too large (>=100), the unified sorter might cause performance issues and affect replication throughput. Therefore, it is not recommended to use it in a production environment. Before you enable the unified sorter, make sure that the machine of each TiCDC node has enough disk capacity. If the total size of unprocessed data changes might exceed 1 TB, it is not recommend to use TiCDC for replication. +- Update your TiDB cluster and TiCDC cluster to the latest versions. The OOM problem has already been resolved in **v4.0.14 and later v4.0 versions, v5.0.2 and later v5.0 versions, and the latest versions**. ## What is `gc-ttl` in TiCDC? diff --git a/tidb-configuration-file.md b/tidb-configuration-file.md index 7aa065f1a3894..87f9adac0c558 100644 --- a/tidb-configuration-file.md +++ b/tidb-configuration-file.md @@ -15,7 +15,7 @@ The TiDB configuration file supports more options than command-line parameters. - Determines whether to create a separate Region for each table. - Default value: `true` -- It is recommended to set it to `false` if you need to create a large number of tables. +- It is recommended to set it to `false` if you need to create a large number of tables (for example, more than 100 thousand tables). ### `token-limit` @@ -676,7 +676,7 @@ For pessimistic transaction usage, refer to [TiDB Pessimistic Transaction Mode]( + Determines the transaction mode that the auto-commit transaction uses when the pessimistic transaction mode is globally enabled (`tidb_txn_mode='pessimistic'`). By default, even if the pessimistic transaction mode is globally enabled, the auto-commit transaction still uses the optimistic transaction mode. After enabling `pessimistic-auto-commit` (set to `true`), the auto-commit transaction also uses pessimistic mode, which is consistent with the other explicitly committed pessimistic transactions. + For scenarios with conflicts, after enabling this configuration, TiDB includes auto-commit transactions into the global lock-waiting management, which avoids deadlocks and mitigates the latency spike brought by deadlock-causing conflicts. -+ For scenarios with no conflicts, if there are many auto-commit transactions, and a single transaction operates a large data volume, enabling this configuration causes performance regression. For example, the auto-commit `INSERT INTO SELECT` statement. ++ For scenarios with no conflicts, if there are many auto-commit transactions (the specific number is determined by the real scenarios. For example, the number of auto-commit transactions accounts for more than half of the total number of applications), and a single transaction operates a large data volume, enabling this configuration causes performance regression. For example, the auto-commit `INSERT INTO SELECT` statement. + Default value: `false` ## experimental diff --git a/tidb-lightning/tidb-lightning-distributed-import.md b/tidb-lightning/tidb-lightning-distributed-import.md index 4ff9da8c93226..f4cc8b3e75a14 100644 --- a/tidb-lightning/tidb-lightning-distributed-import.md +++ b/tidb-lightning/tidb-lightning-distributed-import.md @@ -134,7 +134,7 @@ nohup tiup tidb-lightning -config tidb-lightning.toml > nohup.out & During parallel import, TiDB Lightning automatically performs the following checks after starting the task. -- Check whether there is enough space on the local disk and on the TiKV cluster for importing data. TiDB Lightning samples the data sources and estimates the percentage of the index size from the sample result. Because indexes are included in the estimation, there may be cases where the size of the source data is less than the available space on the local disk, but still the check fails. +- Check whether there is enough space on the local disk (controlled by the `sort-kv-dir` configuration) and on the TiKV cluster for importing data. To learn the required disk space, see [Downstream storage space requirements](/tidb-lightning/tidb-lightning-requirements.md#downstream-storage-space-requirements) and [Resource requirements](/tidb-lightning/tidb-lightning-requirements.md#resource-requirements). TiDB Lightning samples the data sources and estimates the percentage of the index size from the sample result. Because indexes are included in the estimation, there might be cases where the size of the source data is less than the available space on the local disk, but still the check fails. - Check whether the regions in the TiKV cluster are distributed evenly and whether there are too many empty regions. If the number of empty regions exceeds max(1000, number of tables * 3), i.e. greater than the bigger one of "1000" or "3 times the number of tables ", then the import cannot be executed. - Check whether the data is imported in order from the data sources. The size of `mydumper.batch-size` is automatically adjusted based on the result of the check. Therefore, the `mydumper.batch-size` configuration is no longer available. diff --git a/tidb-lightning/tidb-lightning-prechecks.md b/tidb-lightning/tidb-lightning-prechecks.md index 7d75972b37501..c4e50292934aa 100644 --- a/tidb-lightning/tidb-lightning-prechecks.md +++ b/tidb-lightning/tidb-lightning-prechecks.md @@ -12,7 +12,7 @@ The following table describes each check item and detailed explanation. | Check Items | Supported Version| Description | | ---- | ---- |---- | | Cluster version and status| >= 5.3.0 | Check whether the cluster can be connected in the configuration, and whether the TiKV/PD/TiFlash version supports the Local import mode when the backend mode is Local. | -| Disk space | >= 5.3.0 | Check whether there is enough space on the local disk and on the TiKV cluster for importing data. TiDB Lightning samples the data sources and estimates the percentage of the index size from the sample result. Because indexes are included in the estimation, there may be cases where the size of the source data is less than the available space on the local disk, but still the check fails. When the backend is Local, it also checks whether the local storage is sufficient because external sorting needs to be done locally. | +| Disk space | >= 5.3.0 | Check whether there is enough space on the local disk and on the TiKV cluster for importing data. TiDB Lightning samples the data sources and estimates the percentage of the index size from the sample result. Because indexes are included in the estimation, there might be cases where the size of the source data is less than the available space on the local disk, but still the check fails. When the backend is Local, it also checks whether the local storage is sufficient because external sorting needs to be done locally. For more details about the TiKV cluster space and local storage space (controlled by `sort-kv-dir`), see [Downstream storage space requirements](/tidb-lightning/tidb-lightning-requirements.md#downstream-storage-space-requirements) and [Resource requirements](/tidb-lightning/tidb-lightning-requirements.md#resource-requirements). | | Region distribution status | >= 5.3.0 | Check whether the Regions in the TiKV cluster are distributed evenly and whether there are too many empty Regions. If the number of empty Regions exceeds max(1000, number of tables * 3), i.e. greater than the bigger one of "1000" or "3 times the number of tables ", then the import cannot be executed. | | Exceedingly Large CSV files in the data file | >= 5.3.0 | When there are CSV files larger than 10 GiB in the backup file and auto-slicing is not enabled (StrictFormat=false), it will impact the import performance. The purpose of this check is to remind you to ensure the data is in the right format and to enable auto-slicing. | | Recovery from breakpoints | >= 5.3.0 | This check ensures that no changes are made to the source file or schema in the database during the breakpoint recovery process that would result in importing the wrong data. | diff --git a/tidb-troubleshooting-map.md b/tidb-troubleshooting-map.md index b4a53d1c1b129..f77e5f6457016 100644 --- a/tidb-troubleshooting-map.md +++ b/tidb-troubleshooting-map.md @@ -14,7 +14,7 @@ This document summarizes common issues in TiDB and other components. You can use - 1.1.1 The `Region is Unavailable` error is usually because a Region is not available for a period of time. You might encounter `TiKV server is busy`, or the request to TiKV fails due to `not leader` or `epoch not match`, or the request to TiKV time out. In such cases, TiDB performs a `backoff` retry mechanism. When the `backoff` exceeds a threshold (20s by default), the error will be sent to the client. Within the `backoff` threshold, this error is not visible to the client. -- 1.1.2 Multiple TiKV instances are OOM at the same time, which causes no Leader in a Region for a period of time. See [case-991](https://github.com/pingcap/tidb-map/blob/master/maps/diagnose-case-study/case991.md) in Chinese. +- 1.1.2 Multiple TiKV instances are OOM at the same time, which causes no Leader during the OOM period. See [case-991](https://github.com/pingcap/tidb-map/blob/master/maps/diagnose-case-study/case991.md) in Chinese. - 1.1.3 TiKV reports `TiKV server is busy`, and exceeds the `backoff` time. For more details, refer to [4.3](#43-the-client-reports-the-server-is-busy-error). `TiKV server is busy` is a result of the internal flow control mechanism and should not be counted in the `backoff` time. This issue will be fixed. diff --git a/tiflash/troubleshoot-tiflash.md b/tiflash/troubleshoot-tiflash.md index 1b7a11c83026a..7946a14f8f977 100644 --- a/tiflash/troubleshoot-tiflash.md +++ b/tiflash/troubleshoot-tiflash.md @@ -211,10 +211,6 @@ After deploying a TiFlash node and starting replication (by performing the ALTER - If the keyword is found, the PD schedules properly. - If not, the PD does not schedule properly. Contact PingCAP technical support for help. -> **Note:** -> -> When there are many small Regions in the table to be replicated, and the `region merge` parameter is enabled or set to a large value, the replication progress might stay unchanged or be reduced in a period of time. - ## Data replication gets stuck If data replication on TiFlash starts normally but then all or some data fails to be replicated after a period of time, you can confirm or resolve the issue by performing the following steps: diff --git a/tikv-configuration-file.md b/tikv-configuration-file.md index 634640386de40..0386d47af216b 100644 --- a/tikv-configuration-file.md +++ b/tikv-configuration-file.md @@ -1620,7 +1620,7 @@ For pessimistic transaction usage, refer to [TiDB Pessimistic Transaction Mode]( Configuration items related to Quota Limiter. -Suppose that your machine on which TiKV is deployed has limited resources, for example, with only 4v CPU and 16 G memory. In this situation, if the foreground of TiKV processes too many read and write requests, the CPU resources used by the background are occupied to help process such requests, which affects the performance stability of TiKV. To avoid this situation, you can use the quota-related configuration items to limit the CPU resources to be used by the foreground. When a request triggers Quota Limiter, the request is forced to wait for a while for TiKV to free up CPU resources. The exact waiting time depends on the number of requests, and the maximum waiting time is no longer than the value of [`max-delay-duration`](#max-delay-duration-new-in-v600). +Suppose that your machine on which TiKV is deployed has limited resources, for example, with only 4v CPU and 16 G memory. In this situation, the foreground of TiKV might process too many read and write requests so that the CPU resources used by the background are occupied to help process such requests, which affects the performance stability of TiKV. To avoid this situation, you can use the quota-related configuration items to limit the CPU resources to be used by the foreground. When a request triggers Quota Limiter, the request is forced to wait for a while for TiKV to free up CPU resources. The exact waiting time depends on the number of requests, and the maximum waiting time is no longer than the value of [`max-delay-duration`](#max-delay-duration-new-in-v600). > **Warning:** > diff --git a/troubleshoot-hot-spot-issues.md b/troubleshoot-hot-spot-issues.md index a35f2ce280637..3f5f4868066e8 100644 --- a/troubleshoot-hot-spot-issues.md +++ b/troubleshoot-hot-spot-issues.md @@ -89,7 +89,7 @@ Hover over the bright block, you can see what table or index has a heavy load. F For a non-integer primary key or a table without a primary key or a joint primary key, TiDB uses an implicit auto-increment RowID. When a large number of `INSERT` operations exist, the data is written into a single Region, resulting in a write hotspot. -By setting `SHARD_ROW_ID_BITS`, RowID are scattered and written into multiple Regions, which can alleviates the write hotspot issue. However, if you set `SHARD_ROW_ID_BITS` to an over large value, the number of RPC requests will be enlarged, increasing CPU and network overhead. +By setting `SHARD_ROW_ID_BITS`, row IDs are scattered and written into multiple Regions, which can alleviate the write hotspot issue. ``` SHARD_ROW_ID_BITS = 4 # Represents 16 shards.