From 7ec85ec78c8b8e3f09e7cfd997d70c9ad124de5b Mon Sep 17 00:00:00 2001 From: xixirangrang Date: Fri, 3 Nov 2023 16:05:09 +0800 Subject: [PATCH] add snappy restriction note (#15241) --- dumpling-overview.md | 4 ++++ sql-statements/sql-statement-import-into.md | 4 ++++ storage-engine/titan-overview.md | 1 + ticdc/ticdc-sink-to-kafka.md | 2 +- tidb-lightning/tidb-lightning-data-source.md | 5 +++-- tidb-lightning/troubleshoot-tidb-lightning.md | 6 +++++- tikv-configuration-file.md | 4 ++++ tune-tikv-memory-performance.md | 2 +- 8 files changed, 23 insertions(+), 5 deletions(-) diff --git a/dumpling-overview.md b/dumpling-overview.md index 6697372356c89..f03c5b931e7fa 100644 --- a/dumpling-overview.md +++ b/dumpling-overview.md @@ -156,6 +156,10 @@ You can use the `--compress ` option to compress the CSV and SQL data an - This option can save disk space, but it also slows down the export speed and increases CPU consumption. Use this option with caution in scenarios where the export speed is critical. - For TiDB Lightning v6.5.0 and later versions, you can use compressed files exported by Dumpling as the data source without additional configuration. +> **Note:** +> +> The Snappy compressed file must be in the [official Snappy format](https://github.com/google/snappy). Other variants of Snappy compression are not supported. + ### Format of exported files - `metadata`: The start time of the exported files and the position of the master binary log. diff --git a/sql-statements/sql-statement-import-into.md b/sql-statements/sql-statement-import-into.md index 4e78ec03a7e48..225a8a9fc3f10 100644 --- a/sql-statements/sql-statement-import-into.md +++ b/sql-statements/sql-statement-import-into.md @@ -149,6 +149,10 @@ The supported options are described as follows: | `.zstd`, `.zst` | ZStd compression format | | `.snappy` | snappy compression format | +> **Note:** +> +> The Snappy compressed file must be in the [official Snappy format](https://github.com/google/snappy). Other variants of Snappy compression are not supported. + ## Global sorting `IMPORT INTO` splits the data import job of a source data file into multiple sub-jobs, each sub-job independently encoding and sorting data before importing. If the encoded KV ranges of these sub-jobs have significant overlap (to learn how TiDB encodes data to KV, see [TiDB computing](/tidb-computing.md)), TiKV needs to keep compaction during import, leading to a decrease in import performance and stability. diff --git a/storage-engine/titan-overview.md b/storage-engine/titan-overview.md index 67fde1da882e6..ae7adb16d208a 100644 --- a/storage-engine/titan-overview.md +++ b/storage-engine/titan-overview.md @@ -54,6 +54,7 @@ A blob file mainly consists of blob records, meta blocks, a meta index block, an > + The Key-Value pairs in the blob file are stored in order, so that when the Iterator is implemented, the sequential reading performance can be improved via prefetching. > + Each blob record keeps a copy of the user key corresponding to the value. This way, when Titan performs Garbage Collection (GC), it can query the user key and identify whether the corresponding value is outdated. However, this process introduces some write amplification. > + BlobFile supports compression at the blob record level. Titan supports multiple compression algorithms, such as [Snappy](https://github.com/google/snappy), [LZ4](https://github.com/lz4/lz4), and [Zstd](https://github.com/facebook/zstd). Currently, the default compression algorithm Titan uses is LZ4. +> + The Snappy compressed file must be in the [official Snappy format](https://github.com/google/snappy). Other variants of Snappy compression are not supported. ### TitanTableBuilder diff --git a/ticdc/ticdc-sink-to-kafka.md b/ticdc/ticdc-sink-to-kafka.md index ff3331756ee66..d22f418e72fcc 100644 --- a/ticdc/ticdc-sink-to-kafka.md +++ b/ticdc/ticdc-sink-to-kafka.md @@ -58,7 +58,7 @@ The following are descriptions of sink URI parameters and values that can be con | `max-message-bytes` | The maximum size of data that is sent to Kafka broker each time (optional, `10MB` by default). From v5.0.6 and v4.0.6, the default value has changed from `64MB` and `256MB` to `10MB`. | | `replication-factor` | The number of Kafka message replicas that can be saved (optional, `1` by default). This value must be greater than or equal to the value of [`min.insync.replicas`](https://kafka.apache.org/33/documentation.html#brokerconfigs_min.insync.replicas) in Kafka. | | `required-acks` | A parameter used in the `Produce` request, which notifies the broker of the number of replica acknowledgements it needs to receive before responding. Value options are `0` (`NoResponse`: no response, only `TCP ACK` is provided), `1` (`WaitForLocal`: responds only after local commits are submitted successfully), and `-1` (`WaitForAll`: responds after all replicated replicas are committed successfully. You can configure the minimum number of replicated replicas using the [`min.insync.replicas`](https://kafka.apache.org/33/documentation.html#brokerconfigs_min.insync.replicas) configuration item of the broker). (Optional, the default value is `-1`). | -| `compression` | The compression algorithm used when sending messages (value options are `none`, `lz4`, `gzip`, `snappy`, and `zstd`; `none` by default). | +| `compression` | The compression algorithm used when sending messages (value options are `none`, `lz4`, `gzip`, `snappy`, and `zstd`; `none` by default). Note that the Snappy compressed file must be in the [official Snappy format](https://github.com/google/snappy). Other variants of Snappy compression are not supported.| | `protocol` | The protocol with which messages are output to Kafka. The value options are `canal-json`, `open-protocol`, `canal`, `avro` and `maxwell`. | | `auto-create-topic` | Determines whether TiCDC creates the topic automatically when the `topic-name` passed in does not exist in the Kafka cluster (optional, `true` by default). | | `enable-tidb-extension` | Optional. `false` by default. When the output protocol is `canal-json`, if the value is `true`, TiCDC sends [WATERMARK events](/ticdc/ticdc-canal-json.md#watermark-event) and adds the [TiDB extension field](/ticdc/ticdc-canal-json.md#tidb-extension-field) to Kafka messages. From v6.1.0, this parameter is also applicable to the `avro` protocol. If the value is `true`, TiCDC adds [three TiDB extension fields](/ticdc/ticdc-avro-protocol.md#tidb-extension-fields) to the Kafka message. | diff --git a/tidb-lightning/tidb-lightning-data-source.md b/tidb-lightning/tidb-lightning-data-source.md index e4073ad33fcdc..df3db7ec2e33b 100644 --- a/tidb-lightning/tidb-lightning-data-source.md +++ b/tidb-lightning/tidb-lightning-data-source.md @@ -24,7 +24,7 @@ When TiDB Lightning is running, it looks for all files that match the pattern of | Schema file | Contains the `CREATE DATABASE` DDL statement| `${db_name}-schema-create.sql` | | Data file | If the data file contains data for a whole table, the file is imported into a table named `${db_name}.${table_name}` | \${db_name}.\${table_name}.\${csv\|sql\|parquet} | | Data file | If the data for a table is split into multiple data files, each data file must be suffixed with a number in its filename | \${db_name}.\${table_name}.001.\${csv\|sql\|parquet} | -| Compressed file | If the file contains a compression suffix, such as `gzip`, `snappy`, or `zstd`, TiDB Lightning will decompress the file before importing it. | \${db_name}.\${table_name}.\${csv\|sql\|parquet}.{compress} | +| Compressed file | If the file contains a compression suffix, such as `gzip`, `snappy`, or `zstd`, TiDB Lightning will decompress the file before importing it. Note that the Snappy compressed file must be in the [official Snappy format](https://github.com/google/snappy). Other variants of Snappy compression are not supported. | \${db_name}.\${table_name}.\${csv\|sql\|parquet}.{compress} | TiDB Lightning processes data in parallel as much as possible. Because files must be read in sequence, the data processing concurrency is at the file level (controlled by `region-concurrency`). Therefore, when the imported file is large, the import performance is poor. It is recommended to limit the size of the imported file to no greater than 256 MiB to achieve the best performance. @@ -296,7 +296,8 @@ TiDB Lightning currently supports compressed files exported by Dumpling or compr > - Because TiDB Lightning cannot concurrently decompress a single large compressed file, the size of the compressed file affects the import speed. It is recommended that a source file is no greater than 256 MiB after decompression. > - TiDB Lightning only imports individually compressed data files and does not support importing a single compressed file with multiple data files included. > - TiDB Lightning does not support `parquet` files compressed through another compression tool, such as `db.table.parquet.snappy`. If you want to compress `parquet` files, you can configure the compression format for the `parquet` file writer. -> - TiDB Lightning v6.4.0 and later versions only support `.bak` files and the following compressed data files: `gzip`, `snappy`, and `zstd`. Other types of files cause errors. For those unsupported files, you need to modify the file names in advance, or move those files out of the import data directory to avoid such errors. +> - TiDB Lightning v6.4.0 and later versions only support the following compressed data files: `gzip`, `snappy`, and `zstd`. Other types of files cause errors. If an unsupported compressed file exists in the directory where the source data file is stored, this will cause the task to report an error. You can move those unsupported files out of the import data directory to avoid such errors. +> - The Snappy compressed file must be in the [official Snappy format](https://github.com/google/snappy). Other variants of Snappy compression are not supported. ## Match customized files diff --git a/tidb-lightning/troubleshoot-tidb-lightning.md b/tidb-lightning/troubleshoot-tidb-lightning.md index 2acdf0d7ee950..28d67340bab72 100644 --- a/tidb-lightning/troubleshoot-tidb-lightning.md +++ b/tidb-lightning/troubleshoot-tidb-lightning.md @@ -208,4 +208,8 @@ TiDB does not support all MySQL character sets. Therefore, TiDB Lightning report ### `invalid compression type ...` -- TiDB Lightning v6.4.0 and later versions only support `.bak` files and the following compressed data files: `gzip`, `snappy`, and `zstd`. Other types of files cause errors. For those unsupported files, you need to modify the file names in advance, or move those files out of the import data directory to avoid such errors. For more details, see [Compressed files](/tidb-lightning/tidb-lightning-data-source.md#compressed-files). +- TiDB Lightning v6.4.0 and later versions only support the following compressed data files: `gzip`, `snappy`, and `zstd`. Other types of compressed files cause errors. If an unsupported compressed file exists in the directory where the source data file is stored, this will cause the task to report an error. You can move those unsupported files out of the import data directory to avoid such errors. For more details, see [Compressed files](/tidb-lightning/tidb-lightning-data-source.md#compressed-files). + +> **Note:** +> +> The Snappy compressed file must be in the [official Snappy format](https://github.com/google/snappy). Other variants of Snappy compression are not supported. diff --git a/tikv-configuration-file.md b/tikv-configuration-file.md index f5ecd59acc2f9..8a6244b73cb30 100644 --- a/tikv-configuration-file.md +++ b/tikv-configuration-file.md @@ -1603,6 +1603,10 @@ Configuration items related to `rocksdb.defaultcf.titan`. + Optional values: `"no"`, `"snappy"`, `"zlib"`, `"bzip2"`, `"lz4"`, `"lz4hc"`, `"zstd"` + Default value: `"lz4"` +> **Note:** +> +> The Snappy compressed file must be in the [official Snappy format](https://github.com/google/snappy). Other variants of Snappy compression are not supported. + ### `blob-cache-size` + The cache size of a Blob file diff --git a/tune-tikv-memory-performance.md b/tune-tikv-memory-performance.md index c36b40162e1f0..4bd4e9d1ef00e 100644 --- a/tune-tikv-memory-performance.md +++ b/tune-tikv-memory-performance.md @@ -149,7 +149,7 @@ max-manifest-file-size = "20MB" block-size = "64KB" # The compaction mode of each layer of RocksDB data. The optional values include no, snappy, zlib, -# bzip2, lz4, lz4hc, and zstd. +# bzip2, lz4, lz4hc, and zstd. Note that the Snappy compressed file must be in the [official Snappy format](https://github.com/google/snappy). Other variants of Snappy compression are not supported. # "no:no:lz4:lz4:lz4:zstd:zstd" indicates there is no compaction of level0 and level1; lz4 compaction algorithm is used # from level2 to level4; zstd compaction algorithm is used from level5 to level6. # "no" means no compaction. "lz4" is a compaction algorithm with moderate speed and compaction ratio. The