diff --git a/tidb-lightning/tidb-lightning-data-source.md b/tidb-lightning/tidb-lightning-data-source.md index 47b58ac7a8a40..8e82fcb3a7af5 100644 --- a/tidb-lightning/tidb-lightning-data-source.md +++ b/tidb-lightning/tidb-lightning-data-source.md @@ -27,6 +27,76 @@ When TiDB Lightning is running, it looks for all files that match the pattern of TiDB Lightning processes data in parallel as much as possible. Because files must be read in sequence, the data processing concurrency is at the file level (controlled by `region-concurrency`). Therefore, when the imported file is large, the import performance is poor. It is recommended to limit the size of the imported file to no greater than 256 MiB to achieve the best performance. +## Rename databases and tables + +TiDB Lightning follows filename patterns to import data to the corresponding database and table. If the database or table names change, you can either rename the files and then import them, or use regular expressions to replace the names online. + +### Rename files in batch + +If you are using Red Hat Linux or a distribution based on Red Hat Linux, you can use the `rename` command to batch rename files in the `data-source-dir` directory. + +For example: + +```shell +rename srcdb. tgtdb. *.sql +``` + +After you modify the database name, it is recommended that you delete the `${db_name}-schema-create.sql` file that contains the `CREATE DATABASE` DDL statement from the `data-source-dir` directory. If you want to modify the table name as well, you also need to modify the table name in the `${db_name}.${table_name}-schema.sql` file that contains the `CREATE TABLE` DDL statement. + +### Use regular expressions to replace names online + +To use regular expressions to replace names online, you can use the `pattern` configuration within `[[mydumper.files]]` to match filenames, and replace `schema` and `table` with your desired names. For more information, see [Match customized files](#match-customized-files). + +The following is an example of using regular expressions to replace names online. In this example: + +- The match rule for the data file `pattern` is `^({schema_regrex})\.({table_regrex})\.({file_serial_regrex})\.(csv|parquet|sql)`. +- Specify `schema` as `'$1'`, which means that the value of the first regular expression `schema_regrex` remains unchanged. Or specify `schema` as a string, such as `'tgtdb'`, which means a fixed target database name. +- Specify `table` as `'$2'`, which means that the value of the second regular expression `table_regrex` remains unchanged. Or specify `table` as a string, such as `'t1'`, which means a fixed target table name. +- Specify `type` as `'$3'`, which means the data file type. You can specify `type` as either `"table-schema"` (representing the `schema.sql` file) or `"schema-schema"` (representing the `schema-create.sql` file). + +```toml +[mydumper] +data-source-dir = "/some-subdir/some-database/" +[[mydumper.files]] +pattern = '^(srcdb)\.(.*?)-schema-create\.sql' +schema = 'tgtdb' +type = "schema-schema" +[[mydumper.files]] +pattern = '^(srcdb)\.(.*?)-schema\.sql' +schema = 'tgtdb' +table = '$2' +type = "table-schema" +[[mydumper.files]] +pattern = '^(srcdb)\.(.*?)\.(?:[0-9]+)\.(csv|parquet|sql)' +schema = 'tgtdb' +table = '$2' +type = '$3' +``` + +If you are using `gzip` to back up data files, you need to configure the compression format accordingly. The matching rule of the data file `pattern` is `'^({schema_regrex})\.({table_regrex})\.({file_serial_regrex})\.(csv|parquet|sql)\.(gz)'`. You can specify `compression` as `'$4'` to represent the compressed file format. For example: + +```toml +[mydumper] +data-source-dir = "/some-subdir/some-database/" +[[mydumper.files]] +pattern = '^(srcdb)\.(.*?)-schema-create\.(sql)\.(gz)' +schema = 'tgtdb' +type = "schema-schema" +compression = '$4' +[[mydumper.files]] +pattern = '^(srcdb)\.(.*?)-schema\.(sql)\.(gz)' +schema = 'tgtdb' +table = '$2' +type = "table-schema" +compression = '$4' +[[mydumper.files]] +pattern = '^(srcdb)\.(.*?)\.(?:[0-9]+)\.(sql)\.(gz)' +schema = 'tgtdb' +table = '$2' +type = '$3' +compression = '$4' +``` + ## CSV ### Schema @@ -276,7 +346,7 @@ TiDB Lightning currently only supports Parquet files generated by Amazon Aurora ``` [[mydumper.files]] # The expression needed for parsing Amazon Aurora parquet files -pattern = '(?i)^(?:[^/]*/)*([a-z0-9_]+)\.([a-z0-9_]+)/(?:[^/]*/)*(?:[a-z0-9\-_.]+\.(parquet))$' +pattern = '(?i)^(?:[^/]*/)*([a-z0-9\-_]+).([a-z0-9\-_]+)/(?:[^/]*/)*(?:[a-z0-9\-_.]+\.(parquet))$' schema = '$1' table = '$2' type = '$3' @@ -308,14 +378,14 @@ Take the Aurora snapshot exported to S3 as an example. The complete path of the Usually, `data-source-dir` is set to `S3://some-bucket/some-subdir/some-database/` to import the `some-database` database. -Based on the preceding Parquet file path, you can write a regular expression like `(?i)^(?:[^/]*/)*([a-z0-9_]+)\.([a-z0-9_]+)/(?:[^/]*/)*(?:[a-z0-9\-_.]+\.(parquet))$` to match the files. In the match group, `index=1` is `some-database`, `index=2` is `some-table`, and `index=3` is `parquet`. +Based on the preceding Parquet file path, you can write a regular expression like `(?i)^(?:[^/]*/)*([a-z0-9\-_]+).([a-z0-9\-_]+)/(?:[^/]*/)*(?:[a-z0-9\-_.]+\.(parquet))$` to match the files. In the match group, `index=1` is `some-database`, `index=2` is `some-table`, and `index=3` is `parquet`. You can write the configuration file according to the regular expression and the corresponding index so that TiDB Lightning can recognize the data files that do not follow the default naming convention. For example: ```toml [[mydumper.files]] # The expression needed for parsing the Amazon Aurora parquet file -pattern = '(?i)^(?:[^/]*/)*([a-z0-9_]+)\.([a-z0-9_]+)/(?:[^/]*/)*(?:[a-z0-9\-_.]+\.(parquet))$' +pattern = '(?i)^(?:[^/]*/)*([a-z0-9\-_]+).([a-z0-9\-_]+)/(?:[^/]*/)*(?:[a-z0-9\-_.]+\.(parquet))$' schema = '$1' table = '$2' type = '$3'