Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stats, sql: fix format and update wording #18192

Merged
merged 14 commits into from
Jul 19, 2024
2 changes: 1 addition & 1 deletion extended-statistics.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ Other TiDB nodes will read this change and delete the object in their memory cac

### Export and import extended statistics

The way of exporting or importing extended statistics is the same as exporting or importing basic statistics. See [Introduction to Statistics - Import and export statistics](/statistics.md#import-and-export-statistics) for details.
The way of exporting or importing extended statistics is the same as exporting or importing basic statistics. See [Introduction to Statistics - Import and export statistics](/statistics.md#export-and-import-statistics) for details.

## Usage examples for correlation-type extended statistics

Expand Down
2 changes: 1 addition & 1 deletion sql-statements/sql-statement-show-stats-histograms.md
lilin90 marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Currently, the `SHOW STATS_HISTOGRAMS` statement returns the following columns:

| Column name | Description |
| -------- | ------------- |
| Db_name | Database name |
| `Db_name` | Database name |
| `Table_name` | The table name |
| `Partition_name` | The partition name |
| `Column_name` | The column name |
lilin90 marked this conversation as resolved.
Show resolved Hide resolved
Expand Down
56 changes: 28 additions & 28 deletions statistics.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,8 @@ Based upon the number of changes to a table, TiDB will automatically schedule [`

| System Variable | Default Value | Description |
|---|---|---|
| [`tidb_enable_auto_analyze`](/system-variables.md#tidb_enable_auto_analyze-new-in-v610) | true | Controls whether TiDB automatically executes ANALYZE. |
| [`tidb_auto_analyze_ratio`](/system-variables.md#tidb_auto_analyze_ratio) | 0.5 | The threshold value of automatic update |
| [`tidb_enable_auto_analyze`](/system-variables.md#tidb_enable_auto_analyze-new-in-v610) | `ON` | Controls whether TiDB automatically executes ANALYZE. |
lilin90 marked this conversation as resolved.
Show resolved Hide resolved
| [`tidb_auto_analyze_ratio`](/system-variables.md#tidb_auto_analyze_ratio) | `0.5` | The threshold value of automatic update |
| [`tidb_auto_analyze_start_time`](/system-variables.md#tidb_auto_analyze_start_time) | `00:00 +0000` | The start time in a day when TiDB can perform automatic update |
| [`tidb_auto_analyze_end_time`](/system-variables.md#tidb_auto_analyze_end_time) | `23:59 +0000` | The end time in a day when TiDB can perform automatic update |
| [`tidb_auto_analyze_partition_batch_size`](/system-variables.md#tidb_auto_analyze_partition_batch_size-new-in-v640) | `128` | The number of partitions that TiDB automatically analyzes when analyzing a partitioned table (that is, when automatically updating statistics on a partitioned table) |
Expand All @@ -43,11 +43,11 @@ To avoid the situation that modifying data on a small table frequently triggers

> **Note:**
>
> Currently, the automatic update does not record the configuration items input at manual `ANALYZE`. Therefore, when you use the `WITH` syntax to control the collecting behavior of `ANALYZE`, you need to manually set scheduled tasks to collect statistics.
> Currently, the automatic update does not record the configuration items input at manual `ANALYZE`. Therefore, when you use the [`WITH`](/sql-statements/sql-statement-analyze-table.md) syntax to control the collecting behavior of `ANALYZE`, you need to manually set scheduled tasks to collect statistics.

### Manual collection

Currently, TiDB collects statistical information as a full collection. You can execute the `ANALYZE TABLE` statement to collect statistics.
Currently, TiDB collects statistics as a full collection. You can execute the `ANALYZE TABLE` statement to collect statistics.

You can perform full collection using the following syntax.

Expand All @@ -68,7 +68,7 @@ You can perform full collection using the following syntax.

See [Histograms](#histogram), [Top-N](#top-n-values) and [CMSketch](#count-min-sketch) (Count-Min Sketch) for detailed explanations. For `SAMPLES`/`SAMPLERATE`, see [Improve collection performance](#improve-collection-performance).

For information on persisting the options for easier reuse, see [Persist ANALYZE configurations](#persist-analyze-configurations).
For information on persisting the options for easier reuse, see [Persist `ANALYZE` configurations](#persist-analyze-configurations).

## Types of statistics

Expand All @@ -90,7 +90,7 @@ For details about the parameter that determines the upper limit to the number of
>
> Count-Min Sketch is used in statistics Version 1 only for equal/IN predicate selectivity estimation. In Version 2, other statistics are used due to challenges in managing Count-Min sketch to avoid collisions as discussed below.
lilin90 marked this conversation as resolved.
Show resolved Hide resolved

Count-Min Sketch is a hash structure. When an equivalence query contains `a = 1` or `IN` query (for example, `a IN (1, 2, 3)`), TiDB uses this data structure for estimation.
Count-Min Sketch is a hash structure. When processing an equivalence query such as `a = 1` or an `IN` query (for example, `a IN (1, 2, 3)`), TiDB uses this data structure for estimation.

A hash collision might occur since Count-Min Sketch is a hash structure. In the [`EXPLAIN`](/sql-statements/sql-statement-explain.md) statement, if the estimate of the equivalent query deviates greatly from the actual value, it can be considered that a larger value and a smaller value have been hashed together. In this case, you can take one of the following ways to avoid the hash collision:

Expand All @@ -101,7 +101,7 @@ A hash collision might occur since Count-Min Sketch is a hash structure. In the

Top-N values are values with the top N occurrences in a column or index. Top-N statistics are often referred to as frequency statistics or data skew.

TiDB records the values and occurrences of Top-N values. The default value is 20, meaning the top 20 most frequent values are collected. The maximum value is 1024. For details about the parameter that determines the number of values collected, see [Manual collection](#manual-collection).
TiDB records the values and occurrences of Top-N values. Here `N` is controlled by the `WITH NUM TOPN` parameter. The default value is 20, meaning the top 20 most frequent values are collected. The maximum value is 1024. For details about the parameter, see [Manual collection](#manual-collection).

## Selective statistics collection

Expand Down Expand Up @@ -148,13 +148,13 @@ If a table has many columns, collecting statistics on all the columns can cause

<CustomContent platform="tidb">

After the setting, TiDB writes the `PREDICATE COLUMNS` information to the `mysql.column_stats_usage` system table every 100 * [`stats-lease`](/tidb-configuration-file.md#stats-lease).
After the setting, TiDB writes the `PREDICATE COLUMNS` information to the [`mysql.column_stats_usage`](/mysql-schema.md#statistics-system-tables) system table every 100 * [`stats-lease`](/tidb-configuration-file.md#stats-lease).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs for the mysql schema could be improved:

But this doesn't need to be fixed in this PR


</CustomContent>

<CustomContent platform="tidb-cloud">

After the setting, TiDB writes the `PREDICATE COLUMNS` information to the `mysql.column_stats_usage` system table every 300 seconds.
After the setting, TiDB writes the `PREDICATE COLUMNS` information to the [`mysql.column_stats_usage`](/mysql-schema.md#statistics-system-tables) system table every 300 seconds.

</CustomContent>

Expand All @@ -168,7 +168,7 @@ If a table has many columns, collecting statistics on all the columns can cause

> **Note:**
>
> - If the [`mysql.column_stats_usage`](/mysql-schema.md) system table does not contain any `PREDICATE COLUMNS` recorded for that table, the preceding syntax collects statistics on all columns and all indexes in that table.
> - If the [`mysql.column_stats_usage`](/mysql-schema.md#statistics-system-tables) system table does not contain any `PREDICATE COLUMNS` recorded for that table, the preceding syntax collects statistics on all columns and all indexes in that table.
> - Any columns excluded from collection (either by manually listing columns or using `PREDICATE COLUMNS`) will not have their statistics overwritten. When executing a new type of SQL query, the optimizer will use the old statistics for such columns if it exists or pseudo column statistics if columns never had statistics collected. The next ANALYZE using `PREDICATE COLUMNS` will collect the statistics on those columns.

- To collect statistics on all columns and indexes, use the following syntax:
Expand Down Expand Up @@ -217,7 +217,7 @@ If partitions are empty, or columns for some partitions are missing, then the co

- If statistics of all or some columns are missing for some partitions, TiDB skips these missing partition statistics when generating GlobalStats so the generation of GlobalStats is not affected.

In dynamic pruning mode, the Analyze configurations of partitions and tables should be the same. Therefore, if you specify the `COLUMNS` configuration following the `ANALYZE TABLE TableName PARTITION PartitionNameList` statement or the `OPTIONS` configuration following `WITH`, TiDB will ignore them and return a warning.
In dynamic pruning mode, the `ANALYZE` configurations of partitions and tables should be the same. Therefore, if you specify the `COLUMNS` configuration following the `ANALYZE TABLE TableName PARTITION PartitionNameList` statement or the `OPTIONS` configuration following `WITH`, TiDB will ignore them and return a warning.

## Improve collection performance

Expand Down Expand Up @@ -258,9 +258,9 @@ To set a proper value of `tidb_mem_quota_analyze`, consider the data size of the
> **Note:**
>
> The following suggestions are for reference only. You need to configure the values based on the real scenario.
>
> - Minimum value: should be greater than the maximum memory usage when TiDB collects statistics from the table with the most columns. An approximate reference: when TiDB collects statistics from a table with 20 columns using the default configuration, the maximum memory usage is about 800 MiB; when TiDB collects statistics from a table with 160 columns using the default configuration, the maximum memory usage is about 5 GiB.
> - Maximum value: should be less than the available memory when TiDB is not collecting statistics.

- Minimum value: should be greater than the maximum memory usage when TiDB collects statistics from the table with the most columns. An approximate reference: when TiDB collects statistics from a table with 20 columns using the default configuration, the maximum memory usage is about 800 MiB; when TiDB collects statistics from a table with 160 columns using the default configuration, the maximum memory usage is about 5 GiB.
- Maximum value: should be less than the available memory when TiDB is not collecting statistics.

## Persist ANALYZE configurations

Expand All @@ -270,10 +270,10 @@ The following are the `ANALYZE` configurations that support persistence:

| Configurations | Corresponding ANALYZE syntax |
| --- | --- |
| The number of histogram buckets | WITH NUM BUCKETS |
| The number of Top-N | WITH NUM TOPN |
| The number of samples | WITH NUM SAMPLES |
| The sampling rate | WITH FLOATNUM SAMPLERATE |
| The number of histogram buckets | `WITH NUM BUCKETS` |
| The number of Top-N | `WITH NUM TOPN` |
| The number of samples | `WITH NUM SAMPLES` |
| The sampling rate | `WITH FLOATNUM SAMPLERATE` |
| The `ANALYZE` column type | AnalyzeColumnOption ::= ( 'ALL COLUMNS' \| 'PREDICATE COLUMNS' \| 'COLUMNS' ColumnNameList ) |
| The `ANALYZE` column | ColumnNameList ::= Identifier ( ',' Identifier )* |

Expand Down Expand Up @@ -374,18 +374,18 @@ The following table lists the information collected by each version for usage in

| Information | Version 1 | Version 2|
| --- | --- | ---|
| The total number of rows in the table | | |
| Equal/IN predicate estimation | (Column/Index Top-N & Count-Min Sketch) | (Column/Index Top-N & Histogram) |
| Range predicate estimation | (Column/Index Top-N & Histogram) | (Column/Index Top-N & Histogram) |
| `NULL` predicate estimation | | |
| The average length of columns | | |
| The average length of indexes | | |
| The total number of rows in the table | | |
| Equal/IN predicate estimation | (Column/Index Top-N & Count-Min Sketch) | (Column/Index Top-N & Histogram) |
| Range predicate estimation | (Column/Index Top-N & Histogram) | (Column/Index Top-N & Histogram) |
| `NULL` predicate estimation | | |
| The average length of columns | | |
| The average length of indexes | | |

### Switch between statistics versions

It is recommended to ensure that all tables/indexes (and partitions) utilize statistics collection from the same version. Version 2 is recommended, however, it is not recommended to switch from one version to another without a justifiable reason such as an issue experienced with the version in use. A switch between versions might take a period of time when no statistics are available until all tables have been analyzed with the new version, which might negatively affect the optimizer plan choices if statistics are not available.

Examples of justifications to switch might include - with Version 1, there could be inaccuracies in equal/IN predicate estimation due to hash collisions when collecting Count-Min sketch statistics. Solutions are listed in the [Count-Min Sketch](#count-min-sketch) section. Alternatively, setting `tidb_analyze_version = 2` and rerunning `ANALYZE` on all objects is also a solution. In the early release of Version 2, there was a risk of memory overflow after `ANALYZE`. This issue is resolved, but initially, one solution was to `set tidb_analyze_version = 1` and rerun `ANALYZE` on all objects.
Examples of justifications to switch might include - with Version 1, there could be inaccuracies in equal/IN predicate estimation due to hash collisions when collecting Count-Min sketch statistics. Solutions are listed in the [Count-Min Sketch](#count-min-sketch) section. Alternatively, setting `tidb_analyze_version = 2` and rerunning `ANALYZE` on all objects is also a solution. In the early release of Version 2, there was a risk of memory overflow after `ANALYZE`. This issue is resolved, but initially, one solution was to set `tidb_analyze_version = 1` and rerun `ANALYZE` on all objects.

To prepare `ANALYZE` for switching between versions:

Expand Down Expand Up @@ -453,7 +453,7 @@ You can use the [`SHOW STATS_HISTOGRAMS`](/sql-statements/sql-statement-show-sta

### Buckets of histogram

You can use the [`SHOW STATS_BUCKETS`](/sql-statements/sql-statement-show-stats-buckets.md statement to view each bucket of the histogram.
You can use the [`SHOW STATS_BUCKETS`](/sql-statements/sql-statement-show-stats-buckets.md) statement to view each bucket of the histogram.

### Top-N information

Expand Down Expand Up @@ -503,7 +503,7 @@ After enabling the synchronously loading statistics feature, you can control how

</CustomContent>

## Import and export statistics
## Export and import statistics

<CustomContent platform="tidb-cloud">

Expand Down Expand Up @@ -548,7 +548,7 @@ Loading statistics can be done with the [`LOAD STATS`](/sql-statements/sql-state
For example:

```sql
LOAD STATS 'file_name'
LOAD STATS 'file_name';
```

`file_name` is the file name of the statistics to be imported.
Expand Down
Loading