title | summary |
---|---|
IMPORT INTO |
An overview of the usage of IMPORT INTO in TiDB. |
The IMPORT INTO
statement is used to import data in formats such as CSV
, SQL
, and PARQUET
into an empty table in TiDB via the Physical Import Mode of TiDB Lightning.
Note:
This feature is not available on TiDB Serverless clusters.
For TiDB Self-Hosted, IMPORT INTO
supports importing data from files stored in Amazon S3, GCS, and the TiDB local storage. For TiDB Dedicated, IMPORT INTO
supports importing data from files stored in Amazon S3 and GCS.
-
For data files stored in Amazon S3 or GCS,
IMPORT INTO
supports running in the TiDB Distributed eXecution Framework (DXF).- When this DXF is enabled (tidb_enable_dist_task is
ON
),IMPORT INTO
splits a data import job into multiple sub-jobs and distributes these sub-jobs to different TiDB nodes for execution to improve the import efficiency. - When this DXF is disabled,
IMPORT INTO
only supports running on the TiDB node where the current user is connected.
- When this DXF is enabled (tidb_enable_dist_task is
-
For data files stored locally in TiDB,
IMPORT INTO
only supports running on the TiDB node where the current user is connected. Therefore, the data files need to be placed on the TiDB node where the current user is connected. If you access TiDB through a proxy or load balancer, you cannot import data files stored locally in TiDB.
- For TiDB Self-Hosted,
IMPORT INTO
supports importing data within 10 TiB. For TiDB Dedicated,IMPORT INTO
supports importing data within 50 GiB. IMPORT INTO
only supports importing data into existing empty tables in the database.IMPORT INTO
does not support transactions or rollback. ExecutingIMPORT INTO
within an explicit transaction (BEGIN
/END
) will return an error.- The execution of
IMPORT INTO
blocks the current connection until the import is completed. To execute the statement asynchronously, you can add theDETACHED
option. IMPORT INTO
does not support working simultaneously with features such as Backup & Restore,FLASHBACK CLUSTER
, acceleration of adding indexes, data import using TiDB Lightning, data replication using TiCDC, or Point-in-Time Recovery (PITR).- Only one
IMPORT INTO
job can run on a cluster at a time. AlthoughIMPORT INTO
performs a precheck for running jobs, it is not a hard limit. Starting multiple import jobs might work when multiple clients executeIMPORT INTO
simultaneously, but you need to avoid that because it might result in data inconsistency or import failures. - During the data import process, do not perform DDL or DML operations on the target table, and do not execute
FLASHBACK DATABASE
for the target database. These operations can lead to import failures or data inconsistencies. In addition, it is NOT recommended to perform read operations during the import process, as the data being read might be inconsistent. Perform read and write operations only after the import is completed. - The import process consumes system resources significantly. For TiDB Self-Hosted, to get better performance, it is recommended to use TiDB nodes with at least 32 cores and 64 GiB of memory. TiDB writes sorted data to the TiDB temporary directory during import, so it is recommended to configure high-performance storage media for TiDB Self-Hosted, such as flash memory. For more information, see Physical Import Mode limitations.
- For TiDB Self-Hosted, the TiDB temporary directory is expected to have at least 90 GiB of available space. It is recommended to allocate storage space that is equal to or greater than the volume of data to be imported.
- One import job supports importing data into one target table only. To import data into multiple target tables, after the import for a target table is completed, you need to create a new job for the next target table.
IMPORT INTO
is not supported during TiDB cluster upgrades.- When the Global Sort feature is used for data import, the data size of a single row after encoding must not exceed 32 MiB.
- When the Global Sort feature is used for data import, if the target TiDB cluster is deleted before the import task is completed, temporary data used for global sorting might remain on Amazon S3. In this case, you need to delete the residual data manually to avoid increasing S3 storage costs.
- Ensure that the data to be imported does not contain any records with primary key or non-null unique index conflicts. Otherwise, the conflicts can result in import task failures.
- If an
IMPORT INTO
task scheduled by the Distributed eXecution Framework (DXF) is already running, it cannot be scheduled to a new TiDB node. If the TiDB node that executes the data import task is restarted, it will no longer execute the data import task, but transfers the task to another TiDB node to continue executing. However, if the imported data is from a local file, the task will not be transferred to another TiDB node to continue executing. - Known issue: the
IMPORT INTO
task might fail if the PD address in the TiDB node configuration file is inconsistent with the current PD topology of the cluster. This inconsistency can arise in situations such as that PD was scaled in previously, but the TiDB configuration file was not updated accordingly or the TiDB node was not restarted after the configuration file update.
Before using IMPORT INTO
to import data, make sure the following requirements are met:
- The target table to be imported is already created in TiDB and it is empty.
- The target cluster has sufficient space to store the data to be imported.
- For TiDB Self-Hosted, the temporary directory of the TiDB node connected to the current session has at least 90 GiB of available space. If
tidb_enable_dist_task
is enabled, also make sure that the temporary directory of each TiDB node in the cluster has sufficient disk space.
Executing IMPORT INTO
requires the SELECT
, UPDATE
, INSERT
, DELETE
, and ALTER
privileges on the target table. To import files in TiDB local storage, the FILE
privilege is also required.
ImportIntoStmt ::=
'IMPORT' 'INTO' TableName ColumnNameOrUserVarList? SetClause? FROM fileLocation Format? WithOptions?
ColumnNameOrUserVarList ::=
'(' ColumnNameOrUserVar (',' ColumnNameOrUserVar)* ')'
SetClause ::=
'SET' SetItem (',' SetItem)*
SetItem ::=
ColumnName '=' Expr
Format ::=
'CSV' | 'SQL' | 'PARQUET'
WithOptions ::=
'WITH' OptionItem (',' OptionItem)*
OptionItem ::=
optionName '=' optionVal | optionName
It specifies how each field in the data file corresponds to the columns in the target table. You can also use it to map fields to variables to skip certain fields for the import, or use it in SetClause
.
- If this parameter is not specified, the number of fields in each row of the data file must match the number of columns in the target table, and the fields will be imported to the corresponding columns in order.
- If this parameter is specified, the number of specified columns or variables must match the number of fields in each row of the data file.
It specifies how the values of target columns are calculated. In the right side of the SET
expression, you can reference the variables specified in ColumnNameOrUserVarList
.
In the left side of the SET
expression, you can only reference a column name that is not included in ColumnNameOrUserVarList
. If the target column name already exists in ColumnNameOrUserVarList
, the SET
expression is invalid.
It specifies the storage location of the data file, which can be an Amazon S3 or GCS URI path, or a TiDB local file path.
-
Amazon S3 or GCS URI path: for URI configuration details, see URI Formats of External Storage Services.
-
TiDB local file path: it must be an absolute path, and the file extension must be
.csv
,.sql
, or.parquet
. Make sure that the files corresponding to this path are stored on the TiDB node connected by the current user, and the user has theFILE
privilege.
Note:
If SEM is enabled in the target cluster, the
fileLocation
cannot be specified as a local file path.
In the fileLocation
parameter, you can specify a single file or use the *
wildcard to match multiple files for import. Note that the wildcard can only be used in the file name, because it does not match directories or recursively match files in subdirectories. Taking files stored on Amazon S3 as examples, you can configure the parameter as follows:
- Import a single file:
s3://<bucket-name>/path/to/data/foo.csv
- Import all files in a specified path:
s3://<bucket-name>/path/to/data/*
- Import all files with the
.csv
suffix in a specified path:s3://<bucket-name>/path/to/data/*.csv
- Import all files with the
foo
prefix in a specified path:s3://<bucket-name>/path/to/data/foo*
- Import all files with the
foo
prefix and the.csv
suffix in a specified path:s3://<bucket-name>/path/to/data/foo*.csv
The IMPORT INTO
statement supports three data file formats: CSV
, SQL
, and PARQUET
. If not specified, the default format is CSV
.
You can use WithOptions
to specify import options and control the data import process. For example, to execute the import asynchronously in the backend, you can enable the DETACHED
mode for the import by adding the WITH DETACHED
option to the IMPORT INTO
statement.
The supported options are described as follows:
Option name | Supported data formats | Description |
---|---|---|
CHARACTER_SET='<string>' |
CSV | Specifies the character set of the data file. The default character set is utf8mb4 . The supported character sets include binary , utf8 , utf8mb4 , gb18030 , gbk , latin1 , and ascii . |
FIELDS_TERMINATED_BY='<string>' |
CSV | Specifies the field separator. The default separator is , . |
FIELDS_ENCLOSED_BY='<char>' |
CSV | Specifies the field delimiter. The default delimiter is " . |
FIELDS_ESCAPED_BY='<char>' |
CSV | Specifies the escape character for fields. The default escape character is \ . |
FIELDS_DEFINED_NULL_BY='<string>' |
CSV | Specifies the value that represents NULL in the fields. The default value is \N . |
LINES_TERMINATED_BY='<string>' |
CSV | Specifies the line terminator. By default, IMPORT INTO automatically identifies \n , \r , or \r\n as line terminators. If the line terminator is one of these three, you do not need to explicitly specify this option. |
SKIP_ROWS=<number> |
CSV | Specifies the number of rows to skip. The default value is 0 . You can use this option to skip the header in a CSV file. If you use a wildcard to specify the source files for import, this option applies to all source files that are matched by the wildcard in fileLocation . |
SPLIT_FILE |
CSV | Splits a single CSV file into multiple smaller chunks of around 256 MiB for parallel processing to improve import efficiency. This parameter only works for non-compressed CSV files and has the same usage restrictions as that of TiDB Lightning strict-format . |
DISK_QUOTA='<string>' |
All formats | Specifies the disk space threshold that can be used during data sorting. The default value is 80% of the disk space in the TiDB temporary directory. If the total disk size cannot be obtained, the default value is 50 GiB. When specifying DISK_QUOTA explicitly, make sure that the value does not exceed 80% of the disk space in the TiDB temporary directory. |
DISABLE_TIKV_IMPORT_MODE |
All formats | Specifies whether to disable switching TiKV to import mode during the import process. By default, switching TiKV to import mode is not disabled. If there are ongoing read-write operations in the cluster, you can enable this option to avoid impact from the import process. |
THREAD=<number> |
All formats | Specifies the concurrency for import. The default value is 50% of the CPU cores, with a minimum value of 1. You can explicitly specify this option to control the resource usage, but make sure that the value does not exceed the number of CPU cores. To import data into a new cluster without any data, it is recommended to increase this concurrency appropriately to improve import performance. If the target cluster is already used in a production environment, it is recommended to adjust this concurrency according to your application requirements. |
MAX_WRITE_SPEED='<string>' |
All formats | Controls the write speed to a TiKV node. By default, there is no speed limit. For example, you can specify this option as 1MiB to limit the write speed to 1 MiB/s. |
CHECKSUM_TABLE='<string>' |
All formats | Configures whether to perform a checksum check on the target table after the import to validate the import integrity. The supported values include "required" (default), "optional" , and "off" . "required" means performing a checksum check after the import. If the checksum check fails, TiDB will return an error and the import will exit. "optional" means performing a checksum check after the import. If an error occurs, TiDB will return a warning and ignore the error. "off" means not performing a checksum check after the import. |
DETACHED |
All Formats | Controls whether to execute IMPORT INTO asynchronously. When this option is enabled, executing IMPORT INTO immediately returns the information of the import job (such as the Job_ID ), and the job is executed asynchronously in the backend. |
CLOUD_STORAGE_URI |
All formats | Specifies the target address where encoded KV data for Global Sort is stored. When CLOUD_STORAGE_URI is not specified, IMPORT INTO determines whether to use Global Sort based on the value of the system variable tidb_cloud_storage_uri . If this system variable specifies a target storage address, IMPORT INTO uses this address for Global Sort. When CLOUD_STORAGE_URI is specified with a non-empty value, IMPORT INTO uses that value as the target storage address. When CLOUD_STORAGE_URI is specified with an empty value, local sorting is enforced. Currently, the target storage address only supports S3. For details about the URI configuration, see Amazon S3 URI format. When this feature is used, all TiDB nodes must have read and write access for the target S3 bucket. |
IMPORT INTO
supports importing compressed CSV
and SQL
files. It can automatically determine whether a file is compressed and the compression format based on the file extension:
Extension | Compression format |
---|---|
.gz , .gzip |
gzip compression format |
.zstd , .zst |
ZStd compression format |
.snappy |
snappy compression format |
Note:
- The Snappy compressed file must be in the official Snappy format. Other variants of Snappy compression are not supported.
- Because TiDB Lightning cannot concurrently decompress a single large compressed file, the size of the compressed file affects the import speed. It is recommended that a source file is no greater than 256 MiB after decompression.
Warning:
The Global Sort feature is experimental. It is not recommended to use it in production environments.
IMPORT INTO
splits the data import job of a source data file into multiple sub-jobs, each sub-job independently encoding and sorting data before importing. If the encoded KV ranges of these sub-jobs have significant overlap (to learn how TiDB encodes data to KV, see TiDB computing), TiKV needs to keep compaction during import, leading to a decrease in import performance and stability.
In the following scenarios, there can be significant overlap in KV ranges:
- If rows in the data file assigned to each sub-job have overlapping primary key ranges, the data KV generated by the encoding of each sub-job will also overlap.
IMPORT INTO
splits sub-jobs based on the traversal order of data files, usually sorted by file name in lexicographic order.
- If the target table has many indexes, or the index column values are scattered in the data file, the index KV generated by the encoding of each sub-job will also overlap.
When the TiDB Distributed eXecution Framework (DXF) is enabled, you can enable Global Sort by specifying the CLOUD_STORAGE_URI
option in the IMPORT INTO
statement or by specifying the target storage address for encoded KV data using the system variable tidb_cloud_storage_uri
. Note that currently, only S3 is supported as the Global Sort storage address. When Global Sort is enabled, IMPORT INTO
writes encoded KV data to the cloud storage, performs Global Sort in the cloud storage, and then parallelly imports the globally sorted index and table data into TiKV. This prevents problems caused by KV overlap and enhances import stability.
Global Sort consumes a significant amount of memory resources. Before the data import, it is recommended to configure the tidb_server_memory_limit_gc_trigger
and tidb_server_memory_limit
variables, which avoids golang GC being frequently triggered and thus affecting the import efficiency.
SET GLOBAL tidb_server_memory_limit_gc_trigger=0.99;
SET GLOBAL tidb_server_memory_limit='88%';
Note:
- If the KV range overlap in a source data file is low, enabling Global Sort might decrease import performance. This is because when Global Sort is enabled, TiDB needs to wait for the completion of local sorting in all sub-jobs before proceeding with the Global Sort operations and subsequent import.
- After an import job using Global Sort completes, the files stored in the cloud storage for Global Sort are cleaned up asynchronously in a background thread.
When IMPORT INTO
completes the import or when the DETACHED
mode is enabled, IMPORT INTO
will return the current job information in the output, as shown in the following examples. For the description of each field, see SHOW IMPORT JOB(s)
.
When IMPORT INTO
completes the import, the example output is as follows:
IMPORT INTO t FROM '/path/to/small.csv';
+--------+--------------------+--------------+----------+-------+----------+------------------+---------------+----------------+----------------------------+----------------------------+----------------------------+------------+
| Job_ID | Data_Source | Target_Table | Table_ID | Phase | Status | Source_File_Size | Imported_Rows | Result_Message | Create_Time | Start_Time | End_Time | Created_By |
+--------+--------------------+--------------+----------+-------+----------+------------------+---------------+----------------+----------------------------+----------------------------+----------------------------+------------+
| 60002 | /path/to/small.csv | `test`.`t` | 363 | | finished | 16B | 2 | | 2023-06-08 16:01:22.095698 | 2023-06-08 16:01:22.394418 | 2023-06-08 16:01:26.531821 | root@% |
+--------+--------------------+--------------+----------+-------+----------+------------------+---------------+----------------+----------------------------+----------------------------+----------------------------+------------+
When the DETACHED
mode is enabled, executing the IMPORT INTO
statement will immediately return the job information in the output. From the output, you can see that the status of the job is pending
, which means waiting for execution.
IMPORT INTO t FROM '/path/to/small.csv' WITH DETACHED;
+--------+--------------------+--------------+----------+-------+---------+------------------+---------------+----------------+----------------------------+------------+----------+------------+
| Job_ID | Data_Source | Target_Table | Table_ID | Phase | Status | Source_File_Size | Imported_Rows | Result_Message | Create_Time | Start_Time | End_Time | Created_By |
+--------+--------------------+--------------+----------+-------+---------+------------------+---------------+----------------+----------------------------+------------+----------+------------+
| 60001 | /path/to/small.csv | `test`.`t` | 361 | | pending | 16B | NULL | | 2023-06-08 15:59:37.047703 | NULL | NULL | root@% |
+--------+--------------------+--------------+----------+-------+---------+------------------+---------------+----------------+----------------------------+------------+----------+------------+
For an import job with the DETACHED
mode enabled, you can use SHOW IMPORT
to view its current job progress.
After an import job is started, you can cancel it using CANCEL IMPORT JOB <job-id>
.
IMPORT INTO t FROM '/path/to/file.csv' WITH skip_rows=1;
IMPORT INTO t FROM '/path/to/file.csv' WITH DETACHED;
Assume that your data file is in the CSV format and its content is as follows:
id,name,age
1,Tom,23
2,Jack,44
And assume that the target table schema for the import is CREATE TABLE t(id int primary key, name varchar(100))
. To skip importing the age
field in the data file to the table t
, you can execute the following SQL statement:
IMPORT INTO t(id, name, @1) FROM '/path/to/file.csv' WITH skip_rows=1;
Assume that there are three files named file-01.csv
, file-02.csv
, and file-03.csv
in the /path/to/
directory. To import these three files into a target table t
using IMPORT INTO
, you can execute the following SQL statement:
IMPORT INTO t FROM '/path/to/file-*.csv'
-
Import data files from Amazon S3:
IMPORT INTO t FROM 's3://bucket-name/test.csv?access-key=XXX&secret-access-key=XXX';
-
Import data files from GCS:
IMPORT INTO t FROM 'gs://import/test.csv?credentials-file=${credentials-file-path}';
For details about the URI path configuration for Amazon S3 or GCS, see URI Formats of External Storage Services.
Assume that your data file is in the CSV format and its content is as follows:
id,name,val
1,phone,230
2,book,440
And assume that the target table schema for the import is CREATE TABLE t(id int primary key, name varchar(100), val int)
. If you want to multiply the val
column values by 100 during the import, you can execute the following SQL statement:
IMPORT INTO t(id, name, @1) SET val=@1*100 FROM '/path/to/file.csv' WITH skip_rows=1;
IMPORT INTO t FROM '/path/to/file.sql' FORMAT 'sql';
To limit the write speed to a TiKV node to 10 MiB/s, execute the following SQL statement:
IMPORT INTO t FROM 's3://bucket/path/to/file.parquet?access-key=XXX&secret-access-key=XXX' FORMAT 'parquet' WITH MAX_WRITE_SPEED='10MiB';
This statement is a TiDB extension to MySQL syntax.