Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rehash of statistics page #16869

Merged
merged 36 commits into from
Apr 16, 2024
Merged
Changes from 13 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
1ae708f
commit-message: Rehash of statistics page
terry1purcell Mar 27, 2024
61ad158
review updates1
terry1purcell Mar 27, 2024
5a78732
Merge branch 'pingcap:master' into statsupdate
terry1purcell Mar 28, 2024
0d6b563
Merge branch 'pingcap:master' into statsupdate
terry1purcell Mar 29, 2024
93d978a
Merge branch 'pingcap:master' into statsupdate
terry1purcell Apr 2, 2024
e37a89b
review comments april 2
terry1purcell Apr 2, 2024
5462e05
review comments april 2 #2
terry1purcell Apr 2, 2024
d65b6bf
review comments april 2 #3
terry1purcell Apr 2, 2024
8610eb1
review comments april 2 #4
terry1purcell Apr 2, 2024
f7bff7f
Merge branch 'pingcap:master' into statsupdate
terry1purcell Apr 3, 2024
f0de3ad
review comments april 3
terry1purcell Apr 3, 2024
b87c17c
Merge branch 'statsupdate' of github.com:terry1purcell/docs into stat…
terry1purcell Apr 3, 2024
0806236
Merge branch 'pingcap:master' into statsupdate
terry1purcell Apr 7, 2024
1b65128
Merge branch 'pingcap:master' into statsupdate
terry1purcell Apr 8, 2024
5c5c7c6
Merge branch 'pingcap:master' into statsupdate
terry1purcell Apr 9, 2024
8c8c5fc
Merge branch 'pingcap:master' into statsupdate
terry1purcell Apr 10, 2024
735ca8a
Merge branch 'pingcap:master' into statsupdate
terry1purcell Apr 11, 2024
71a533d
Apply suggestions from code review
lilin90 Apr 11, 2024
310a4fd
Update format and fix typos
lilin90 Apr 11, 2024
b5c39fd
Update wording and format
lilin90 Apr 11, 2024
70fb323
Merge branch 'pingcap:master' into statsupdate
terry1purcell Apr 12, 2024
1a617b9
review comments april 11
terry1purcell Apr 12, 2024
8d3e5d5
Merge branch 'pingcap:master' into statsupdate
terry1purcell Apr 12, 2024
89bbc2f
review comments april 11 #2
terry1purcell Apr 12, 2024
89e72c1
Merge branch 'statsupdate' of github.com:terry1purcell/docs into stat…
terry1purcell Apr 12, 2024
9f0f8a7
Apply suggestions from code review
lilin90 Apr 12, 2024
b3a823f
Update format
lilin90 Apr 12, 2024
6b7f21f
Remove unnecessary list indentation to fix display format
lilin90 Apr 12, 2024
561b574
review comments april 12
terry1purcell Apr 12, 2024
85930f7
Merge branch 'statsupdate' of github.com:terry1purcell/docs into stat…
terry1purcell Apr 12, 2024
108a6f4
Merge branch 'pingcap:master' into statsupdate
terry1purcell Apr 12, 2024
012d730
Merge branch 'pingcap:master' into statsupdate
terry1purcell Apr 15, 2024
7ee7dee
review comments april 15
terry1purcell Apr 15, 2024
049d460
Apply suggestions from code review
lilin90 Apr 16, 2024
9393f2d
Update list format
lilin90 Apr 16, 2024
0539cbb
Remove an extra line
lilin90 Apr 16, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 29 additions & 32 deletions statistics.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ TiDB uses statistics as input to the optimizer to estimate the number of rows pr

### Automatic update

For the `INSERT`, `DELETE`, or `UPDATE` statements, TiDB automatically updates the number of rows and modified rows.
For the `INSERT`, `DELETE`, or `UPDATE` statements, TiDB automatically updates the number of rows and modified rows in statistics.

<CustomContent platform="tidb">

Expand All @@ -26,9 +26,7 @@ TiDB persists the update information every 60 seconds.

</CustomContent>

Based upon the number of changes to a table, TiDB will automatically schedule `ANALYZE` to collect statistics on those tables. This is controlled by the [`tidb_enable_auto_anlyze`](/system-variables.md#tidb_enable_auto_analyze-new-in-v610) system variable and `tidb_auto_analyze%` variables.

The following system variables related to automatic update of statistics are as follows:
Based upon the number of changes to a table, TiDB will automatically schedule [`ANALYZE`](/sql-statements/sql-statement-analyze-table.md) to collect statistics on those tables. This is controlled by the [`tidb_enable_auto_anlyze`](/system-variables.md#tidb_enable_auto_analyze-new-in-v610) system variable and the following `tidb_auto_analyze%` variables.

| System Variable | Default Value | Description |
|---|---|---|
Expand Down Expand Up @@ -68,15 +66,15 @@ You can perform full collection using the following syntax.

`WITH NUM SAMPLES` and `WITH FLOAT_NUM SAMPLERATE` correspond to two different algorithms of collecting samples.

See [Histograms](#histogram), [Top-N](#top-n-values) and [CMSketch](#count-min-sketch) (Count-Min Sketch) for detailed explanations. For `SAMPLES`/`SAMPLERATE`, see [Improving collection performance](#improving-collection-performance).
See [Histograms](#histogram), [Top-N](#top-n-values) and [CMSketch](#count-min-sketch) (Count-Min Sketch) for detailed explanations. For `SAMPLES`/`SAMPLERATE`, see [Improve collection performance](#improve-collection-performance).

For information on persisting the options used to allow for easier reuse, refer to [Persist ANALYZE configurations](#persist-analyze-configurations).
For information on persisting the options for easier reuse, see [Persist ANALYZE configurations](#persist-analyze-configurations).

## Types of statistics

### Histogram

Histogram statistics are used by the optimizer to estimate selectivity of an interval or range predicate, and may be used to determine the number of distinct values for a column for estimation of equal/IN predicates in Version 2 of statistics (refer to [Versions of Statistics](#versions-of-statistics)).
Histogram statistics are used by the optimizer to estimate selectivity of an interval or range predicate, and might also be used to determine the number of distinct values within a column for estimation of equal/IN predicates in Version 2 of statistics (refer to [Versions of Statistics](#versions-of-statistics)).

A histogram is an approximate representation of the distribution of data. It divides the entire range of values into a series of buckets, and uses simple data to describe each bucket, such as the number of values ​​falling in the bucket. In TiDB, an equal-depth histogram is created for the specific columns of each table. The equal-depth histogram can be used to estimate the interval query.

Expand All @@ -89,7 +87,8 @@ For details about the parameter that determines the upper limit to the number of
### Count-Min Sketch

> **Note:**
terry1purcell marked this conversation as resolved.
Show resolved Hide resolved
> Count-Min Sketch is used in statistics Version 1 only for equal/IN predicate selectivity estimation. In Version 2, other available statistics are used due to challenges in management of count-min sketch to avoid collisions as discussed below.
>
> Count-Min Sketch is used in statistics Version 1 only for equal/IN predicate selectivity estimation. In Version 2, other statistics are used due to challenges in managing Count-Min sketch to avoid collisions as discussed below.

Count-Min Sketch is a hash structure. When an equivalence query contains `a = 1` or `IN` query (for example, `a IN (1, 2, 3)`), TiDB uses this data structure for estimation.

Expand All @@ -102,7 +101,7 @@ A hash collision might occur since Count-Min Sketch is a hash structure. In the

Top-N values are values with the top N occurrences in a column or index. Top-N statistics are often referred to as frequency statistics or data skew.

TiDB records the values and occurrences of Top-N values. The default value is 20, meaning the top 20 most frequent values are collected. The maximum value is 1024. For details about the parameter that determines the number of values collected, refer to [Manual collection](#manual-collection).
TiDB records the values and occurrences of Top-N values. The default value is 20, meaning the top 20 most frequent values are collected. The maximum value is 1024. For details about the parameter that determines the number of values collected, see [Manual collection](#manual-collection).

## Selective statistics collection

Expand All @@ -124,7 +123,7 @@ When `IndexNameList` is empty, this syntax collects statistics on all indexes in

In most cases, the optimizer only uses statistics on columns in the `WHERE`, `JOIN`, `ORDER BY`, and `GROUP BY` statements. These columns can be referred to as `PREDICATE COLUMNS`.

If a table has many columns, collecting statistics on all the columns can cause a large overhead. To reduce the overhead, you can collect statistics on only specific columns (that you choose) or `PREDICATE COLUMNS` to be used by the optimizer. To persist the column list of any subset of columns for reuse in future, refer to [Persisting column configurations](#persisting-column-configurations).
If a table has many columns, collecting statistics on all the columns can cause a large overhead. To reduce the overhead, you can collect statistics on only specific columns (that you choose) or `PREDICATE COLUMNS` to be used by the optimizer. To persist the column list of any subset of columns for reuse in future, see [Persist column configurations](#persist-column-configurations).

> **Note:**
>
Expand Down Expand Up @@ -218,33 +217,32 @@ If a table has many columns, collecting statistics on all the columns can cause

When accessing partitioned tables in [dynamic pruning mode](/partitioned-table.md#dynamic-pruning-mode) (which is the default since v6.3.0), TiDB collects table-level statistics, which is called GlobalStats. Currently, GlobalStats is aggregated from statistics of all partitions. In dynamic pruning mode, a statistics update of any partitioned table can trigger the GlobalStats to be updated.

If partitions are empty, or columns for some partitions are missing, then the collection behavior is controlled by variable [`tidb_skip_missing_partition_stats`](/system-variables.md#tidb_skip_missing_partition_stats-new-in-v730):
- When GlobalStats update is triggered and [`tidb_skip_missing_partition_stats`](/system-variables.md#tidb_skip_missing_partition_stats-new-in-v730) is `OFF`:
If partitions are empty, or columns for some partitions are missing, then the collection behavior is controlled by the [`tidb_skip_missing_partition_stats`](/system-variables.md#tidb_skip_missing_partition_stats-new-in-v730) variable:

- If some partitions have no statistics (such as a new partition that has never been analyzed), GlobalStats generation is interrupted and a warning message is displayed saying that no statistics are available on partitions.
lilin90 marked this conversation as resolved.
Show resolved Hide resolved
lilin90 marked this conversation as resolved.
Show resolved Hide resolved

lilin90 marked this conversation as resolved.
Show resolved Hide resolved
- If some partitions have no statistics (such as a new partition that has never been analyzed), GlobalStats generation is interrupted and a warning message is displayed saying that no statistics are available on partitions.
- If statistics of some columns are absent in specific partitions (different columns are specified for analyzing in these partitions), GlobalStats generation is interrupted when statistics of these columns are aggregated, and a warning message is displayed saying that statistics of some columns are absent in specific partitions.
- If statistics of some columns are absent in specific partitions (different columns are specified for analyzing in these partitions), GlobalStats generation is interrupted when statistics of these columns are aggregated, and a warning message is displayed saying that statistics of some columns are absent in specific partitions.
lilin90 marked this conversation as resolved.
Show resolved Hide resolved

- When GlobalStats update is triggered and [`tidb_skip_missing_partition_stats`](/system-variables.md#tidb_skip_missing_partition_stats-new-in-v730) is `ON`:
When GlobalStats update is triggered and [`tidb_skip_missing_partition_stats`](/system-variables.md#tidb_skip_missing_partition_stats-new-in-v730) is `ON`:
lilin90 marked this conversation as resolved.
Show resolved Hide resolved

If statistics of all or some columns are missing for some partitions, TiDB skips these missing partition statistics when generating GlobalStats so the generation of GlobalStats is not affected.
If statistics of all or some columns are missing for some partitions, TiDB skips these missing partition statistics when generating GlobalStats so the generation of GlobalStats is not affected.
lilin90 marked this conversation as resolved.
Show resolved Hide resolved

- In dynamic pruning mode, the Analyze configurations of partitions and tables should be the same. Therefore, if you specify the `COLUMNS` configuration following the `ANALYZE TABLE TableName PARTITION PartitionNameList` statement or the `OPTIONS` configuration following `WITH`, TiDB will ignore them and return a warning.
In dynamic pruning mode, the Analyze configurations of partitions and tables should be the same. Therefore, if you specify the `COLUMNS` configuration following the `ANALYZE TABLE TableName PARTITION PartitionNameList` statement or the `OPTIONS` configuration following `WITH`, TiDB will ignore them and return a warning.

## Improving collection performance
## Improve collection performance

> **Note:**
>
> - The execution time of `ANALYZE TABLE` in TiDB may be longer than in MySQL or InnoDB. In InnoDB, only a small number of pages are sampled, while by default in TiDB a comprehensive set of statistics are completely rebuilt.
> - Starting from v7.5.0, the [Fast Analyze feature (`tidb_enable_fast_analyze`)](/system-variables.md#tidb_enable_fast_analyze) and the [incremental collection feature](https://docs.pingcap.com/tidb/v7.4/statistics#incremental-collection) for statistics are deprecated.
> - The execution time of `ANALYZE TABLE` in TiDB might be longer than that in MySQL or InnoDB. In InnoDB, only a small number of pages are sampled, while by default in TiDB a comprehensive set of statistics are completely rebuilt.

TiDB provides two options to improve the performance of statistics collection:

1. Collecting statistics on a subset of the columns. See [Collecting Statistics on Some Columns](#collect-statistics-on-some-columns).
2. Sampling.
- Collecting statistics on a subset of the columns. See [Collecting statistics on some columns](#collect-statistics-on-some-columns).
- Sampling.

### Statistics sampling

Sampling is available via two separate options of the ANALYZE statement - with each corresponding to a different collection algorithm:
Sampling is available via two separate options of the `ANALYZE` statement - with each corresponding to a different collection algorithm:

- `WITH NUM SAMPLES` specifies the size of the sampling set, which is implemented in the reservoir sampling method in TiDB. When a table is large, it is not recommended to use this method to collect statistics. Because the intermediate result set of the reservoir sampling contains redundant results, it causes additional pressure on resources such as memory.
- `WITH FLOAT_NUM SAMPLERATE` is a sampling method introduced in v5.3.0. With the value range `(0, 1]`, this parameter specifies the sampling rate. It is implemented in the way of Bernoulli sampling in TiDB, which is more suitable for sampling larger tables and performs better in collection efficiency and resource usage.
Expand Down Expand Up @@ -282,6 +280,7 @@ To set a proper value of `tidb_mem_quota_analyze`, consider the data size of the
> **Note:**
>
> The following suggestions are for reference only. You need to configure the values based on the real scenario.
>
> - Minimum value: should be greater than the maximum memory usage when TiDB collects statistics from the table with the most columns. An approximate reference: when TiDB collects statistics from a table with 20 columns using the default configuration, the maximum memory usage is about 800 MiB; when TiDB collects statistics from a table with 160 columns using the default configuration, the maximum memory usage is about 5 GiB.
terry1purcell marked this conversation as resolved.
Show resolved Hide resolved
> - Maximum value: should be less than the available memory when TiDB is not collecting statistics.

Expand Down Expand Up @@ -324,7 +323,7 @@ SELECT sample_num, sample_rate, buckets, topn, column_choice, column_ids FROM my

TiDB will overwrite the previously recorded persistent configuration using the new configurations specified by the latest `ANALYZE` statement. For example, if you run `ANALYZE TABLE t WITH 200 TOPN;`, it will set the top 200 values in the `ANALYZE` statement. Subsequently, executing `ANALYZE TABLE t WITH 0.1 SAMPLERATE;` will set both the top 200 values and a sampling rate of 0.1 for auto `ANALYZE` statements, similar to `ANALYZE TABLE t WITH 200 TOPN, 0.1 SAMPLERATE;`.

#### Disable ANALYZE configuration persistence
### Disable ANALYZE configuration persistence

To disable the `ANALYZE` configuration persistence feature, set the `tidb_persist_analyze_options` system variable to `OFF`. Because the `ANALYZE` configuration persistence feature is not applicable to `tidb_analyze_version = 1`, setting `tidb_analyze_version = 1` can also disable the feature.

Expand All @@ -334,7 +333,7 @@ After disabling the `ANALYZE` configuration persistence feature, TiDB does not c
>
> When you enable the `ANALYZE` configuration persistence feature again, if the previously recorded persistence configurations are no longer applicable to the latest data, you need to execute the `ANALYZE` statement manually and specify the new persistence configurations.

### Persisting column configurations
### Persist column configurations

If you want to persist the column configuration in the `ANALYZE` statement (including `COLUMNS ColumnNameList`, `PREDICATE COLUMNS`, and `ALL COLUMNS`), set the value of the `tidb_persist_analyze_options` system variable to `ON` to enable the [ANALYZE configuration persistence](#persist-analyze-configurations) feature. After enabling the ANALYZE configuration persistence feature:

Expand Down Expand Up @@ -406,7 +405,7 @@ The `tidb_analyze_version` variable controls the statistics collected by TiDB. C
- For TiDB Cloud, the default value of this variable changes from `1` to `2` starting from v6.5.0.
- If your cluster is upgraded from an earlier version, the default value of `tidb_analyze_version` does not change after the upgrade.

Version 2 is preferred, and will continue to be enhanced to ultimately replace Version 1 completely. Compared to Version 1, Version 2 improves the accuracy of many of the statistics collected for larger data volumes. Version 2 also improves collection performance by removing the need to collect Count-Min sketch statistics for predicate selectivity estimation, and also supporting automated collection only on selected columns (see section on [Collecting Statistics on Some Columns](#collect-statistics-on-some-columns)).
Version 2 is preferred, and will continue to be enhanced to ultimately replace Version 1 completely. Compared to Version 1, Version 2 improves the accuracy of many of the statistics collected for larger data volumes. Version 2 also improves collection performance by removing the need to collect Count-Min sketch statistics for predicate selectivity estimation, and also supporting automated collection only on selected columns (see [Collecting statistics on some columns](#collect-statistics-on-some-columns)).

The following table lists the information collected by each version for usage in the optimizer estimates:

Expand All @@ -419,11 +418,11 @@ The following table lists the information collected by each version for usage in
| The average length of columns | √ | √ |
| The average length of indexes | √ | √ |

### Switching between statistics versions
### Switch between statistics versions

It is recommended to ensure that all tables/indexes (and partitions) utilize statistics collection from the same version. Version 2 is recommended, however it is not recommended to switch from one version to another without a justifiable reason such as an issue experienced with the version in use. A switch between versions may involve a period of time when no statistics are available until all tables have been analyzed with the new version, which may negatively impact the optimizer plan choices if statistics aren't available.
It is recommended to ensure that all tables/indexes (and partitions) utilize statistics collection from the same version. Version 2 is recommended, however, it is not recommended to switch from one version to another without a justifiable reason such as an issue experienced with the version in use. A switch between versions might take a period of time when no statistics are available until all tables have been analyzed with the new version, which might negatively affect the optimizer plan choices if statistics are not available.

Examples of justifications to switch may include - with Version 1, there could be inaccuracies in equal/IN predicate estimation due to hash collisions when collecting count-min sketch statistics. Solutions are listed in the section on [Count-Min Sketch](#count-min-sketch). Alternatively, setting `tidb_analyze_version = 2` and rerunning `ANALYZE` on all objects is also a solution. Early release of Version 2, there was a risk of memory overflow after `ANALYZE`. This issue is resolved, but initially, one solution was to `set tidb_analyze_version = 1` and rerunning `ANALYZE` on all objects.
Examples of justifications to switch might include - with Version 1, there could be inaccuracies in equal/IN predicate estimation due to hash collisions when collecting Count-Min sketch statistics. Solutions are listed in the [Count-Min Sketch](#count-min-sketch) section. Alternatively, setting `tidb_analyze_version = 2` and rerunning `ANALYZE` on all objects is also a solution. In the early release of Version 2, there was a risk of memory overflow after `ANALYZE`. This issue is resolved, but initially, one solution was to `set tidb_analyze_version = 1` and rerun `ANALYZE` on all objects.

To prepare `ANALYZE` for switching between versions:

Expand Down Expand Up @@ -923,16 +922,14 @@ The following table describes the behaviors of locking statistics:
| A partitioned table and the whole table is locked | The lock is invalid | The lock is invalid because TiDB deletes the old table, so the lock information is also deleted | The old partition lock information is invalid, and the new partition is automatically locked | The new partition is automatically locked | The lock information of the deleted partition is cleared, and the lock of the whole table continues to take effect | The lock information of the deleted partition is cleared, and the new partition is automatically locked | The lock information is transferred to the exchanged table, and the new partition is automatically locked |
| A partitioned table and only some partitions are locked | The lock is invalid | The lock is invalid because TiDB deletes the old table, so the lock information is also deleted | The lock is invalid because TiDB deletes the old table, so the lock information is also deleted | / | The deleted partition lock information is cleared | The deleted partition lock information is cleared | The lock information is transferred to the exchanged table |

## Managing ANALYZE tasks and concurrency
## Manage ANALYZE tasks and concurrency
lilin90 marked this conversation as resolved.
Show resolved Hide resolved

### Terminate background `ANALYZE` tasks

Since TiDB v6.0, TiDB supports using the `KILL` statement to terminate an `ANALYZE` task running in the background. If you find that an `ANALYZE` task running in the background consumes a lot of resources and affects your application, you can terminate the `ANALYZE` task by taking the following steps:

1. Execute the following SQL statement:

{{< copyable "sql" >}}

```sql
SHOW ANALYZE STATUS
```
Expand Down
Loading