Skip to content

Commit

Permalink
[Improve][Connector-V2] Support read archive compress file (#7633)
Browse files Browse the repository at this point in the history
  • Loading branch information
corgy-w committed Sep 20, 2024
1 parent bc0326c commit 3f98cd8
Show file tree
Hide file tree
Showing 50 changed files with 2,579 additions and 44 deletions.
12 changes: 12 additions & 0 deletions docs/en/connector-v2/source/CosFile.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ To use this connector you need put hadoop-cos-{hadoop.version}-{version}.jar and
| xml_use_attr_format | boolean | no | - |
| file_filter_pattern | string | no | - |
| compress_codec | string | no | none |
| archive_compress_codec | string | no | none |
| encoding | string | no | UTF-8 |
| common-options | | no | - |

Expand Down Expand Up @@ -284,6 +285,17 @@ The compress codec of files and the details that supported as the following show
- orc/parquet:
automatically recognizes the compression type, no additional settings required.

### archive_compress_codec [string]

The compress codec of archive files and the details that supported as the following shown:

| archive_compress_codec | file_format | archive_compress_suffix |
|------------------------|--------------------|-------------------------|
| ZIP | txt,json,excel,xml | .zip |
| TAR | txt,json,excel,xml | .tar |
| TAR_GZ | txt,json,excel,xml | .tar.gz |
| NONE | all | .* |

### encoding [string]

Only used when file_format_type is json,text,csv,xml.
Expand Down
12 changes: 12 additions & 0 deletions docs/en/connector-v2/source/FtpFile.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ If you use SeaTunnel Engine, It automatically integrated the hadoop jar when you
| xml_use_attr_format | boolean | no | - |
| file_filter_pattern | string | no | - |
| compress_codec | string | no | none |
| archive_compress_codec | string | no | none |
| encoding | string | no | UTF-8 |
| common-options | | no | - |

Expand Down Expand Up @@ -265,6 +266,17 @@ The compress codec of files and the details that supported as the following show
- orc/parquet:
automatically recognizes the compression type, no additional settings required.

### archive_compress_codec [string]

The compress codec of archive files and the details that supported as the following shown:

| archive_compress_codec | file_format | archive_compress_suffix |
|------------------------|--------------------|-------------------------|
| ZIP | txt,json,excel,xml | .zip |
| TAR | txt,json,excel,xml | .tar |
| TAR_GZ | txt,json,excel,xml | .tar.gz |
| NONE | all | .* |

### encoding [string]

Only used when file_format_type is json,text,csv,xml.
Expand Down
14 changes: 13 additions & 1 deletion docs/en/connector-v2/source/HdfsFile.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,8 @@ Read data from hdfs file system.
| xml_row_tag | string | no | - | Specifies the tag name of the data rows within the XML file, only used when file_format is xml. |
| xml_use_attr_format | boolean | no | - | Specifies whether to process data using the tag attribute format, only used when file_format is xml. |
| compress_codec | string | no | none | The compress codec of files |
| encoding | string | no | UTF-8 |
| archive_compress_codec | string | no | none |
| encoding | string | no | UTF-8 | |
| common-options | | no | - | Source plugin common parameters, please refer to [Source Common Options](../source-common-options.md) for details. |

### delimiter/field_delimiter [string]
Expand All @@ -80,6 +81,17 @@ The compress codec of files and the details that supported as the following show
- orc/parquet:
automatically recognizes the compression type, no additional settings required.

### archive_compress_codec [string]

The compress codec of archive files and the details that supported as the following shown:

| archive_compress_codec | file_format | archive_compress_suffix |
|------------------------|--------------------|-------------------------|
| ZIP | txt,json,excel,xml | .zip |
| TAR | txt,json,excel,xml | .tar |
| TAR_GZ | txt,json,excel,xml | .tar.gz |
| NONE | all | .* |

### encoding [string]

Only used when file_format_type is json,text,csv,xml.
Expand Down
12 changes: 12 additions & 0 deletions docs/en/connector-v2/source/LocalFile.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ If you use SeaTunnel Engine, It automatically integrated the hadoop jar when you
| xml_use_attr_format | boolean | no | - |
| file_filter_pattern | string | no | - |
| compress_codec | string | no | none |
| archive_compress_codec | string | no | none |
| encoding | string | no | UTF-8 |
| common-options | | no | - |
| tables_configs | list | no | used to define a multiple table task |
Expand Down Expand Up @@ -263,6 +264,17 @@ The compress codec of files and the details that supported as the following show
- orc/parquet:
automatically recognizes the compression type, no additional settings required.

### archive_compress_codec [string]

The compress codec of archive files and the details that supported as the following shown:

| archive_compress_codec | file_format | archive_compress_suffix |
|------------------------|--------------------|-------------------------|
| ZIP | txt,json,excel,xml | .zip |
| TAR | txt,json,excel,xml | .tar |
| TAR_GZ | txt,json,excel,xml | .tar.gz |
| NONE | all | .* |

### encoding [string]

Only used when file_format_type is json,text,csv,xml.
Expand Down
12 changes: 12 additions & 0 deletions docs/en/connector-v2/source/OssJindoFile.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@ It only supports hadoop version **2.9.X+**.
| xml_use_attr_format | boolean | no | - |
| file_filter_pattern | string | no | - |
| compress_codec | string | no | none |
| archive_compress_codec | string | no | none |
| encoding | string | no | UTF-8 |
| common-options | | no | - |

Expand Down Expand Up @@ -276,6 +277,17 @@ The compress codec of files and the details that supported as the following show
- orc/parquet:
automatically recognizes the compression type, no additional settings required.

### archive_compress_codec [string]

The compress codec of archive files and the details that supported as the following shown:

| archive_compress_codec | file_format | archive_compress_suffix |
|------------------------|--------------------|-------------------------|
| ZIP | txt,json,excel,xml | .zip |
| TAR | txt,json,excel,xml | .tar |
| TAR_GZ | txt,json,excel,xml | .tar.gz |
| NONE | all | .* |

### encoding [string]

Only used when file_format_type is json,text,csv,xml.
Expand Down
40 changes: 26 additions & 14 deletions docs/en/connector-v2/source/S3File.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,14 +20,14 @@ Read all the data in a split in a pollNext call. What splits are read will be sa
- [x] [parallelism](../../concept/connector-v2-features.md)
- [ ] [support user-defined split](../../concept/connector-v2-features.md)
- [x] file format type
- [x] text
- [x] csv
- [x] parquet
- [x] orc
- [x] json
- [x] excel
- [x] xml
- [x] binary
- [x] text
- [x] csv
- [x] parquet
- [x] orc
- [x] json
- [x] excel
- [x] xml
- [x] binary

## Description

Expand Down Expand Up @@ -196,7 +196,7 @@ If you assign file type to `parquet` `orc`, schema option not required, connecto

## Options

| name | type | required | default value | Description |
| name | type | required | default value | Description |
|---------------------------------|---------|----------|-------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| path | string | yes | - | The s3 path that needs to be read can have sub paths, but the sub paths need to meet certain format requirements. Specific requirements can be referred to "parse_partition_from_path" option |
| file_format_type | string | yes | - | File type, supported as the following file types: `text` `csv` `parquet` `orc` `json` `excel` `xml` `binary` |
Expand All @@ -217,8 +217,9 @@ If you assign file type to `parquet` `orc`, schema option not required, connecto
| sheet_name | string | no | - | Reader the sheet of the workbook,Only used when file_format is excel. |
| xml_row_tag | string | no | - | Specifies the tag name of the data rows within the XML file, only valid for XML files. |
| xml_use_attr_format | boolean | no | - | Specifies whether to process data using the tag attribute format, only valid for XML files. |
| compress_codec | string | no | none |
| encoding | string | no | UTF-8 |
| compress_codec | string | no | none | |
| archive_compress_codec | string | no | none | |
| encoding | string | no | UTF-8 | |
| common-options | | no | - | Source plugin common parameters, please refer to [Source Common Options](../source-common-options.md) for details. |

### delimiter/field_delimiter [string]
Expand All @@ -235,6 +236,17 @@ The compress codec of files and the details that supported as the following show
- orc/parquet:
automatically recognizes the compression type, no additional settings required.

### archive_compress_codec [string]

The compress codec of archive files and the details that supported as the following shown:

| archive_compress_codec | file_format | archive_compress_suffix |
|------------------------|--------------------|-------------------------|
| ZIP | txt,json,excel,xml | .zip |
| TAR | txt,json,excel,xml | .tar |
| TAR_GZ | txt,json,excel,xml | .tar.gz |
| NONE | all | .* |

### encoding [string]

Only used when file_format_type is json,text,csv,xml.
Expand Down Expand Up @@ -346,8 +358,8 @@ sink {
### Next version

- [Feature] Support S3A protocol ([3632](https://github.com/apache/seatunnel/pull/3632))
- Allow user to add additional hadoop-s3 parameters
- Allow the use of the s3a protocol
- Decouple hadoop-aws dependencies
- Allow user to add additional hadoop-s3 parameters
- Allow the use of the s3a protocol
- Decouple hadoop-aws dependencies
- [Feature]Set S3 AK to optional ([3688](https://github.com/apache/seatunnel/pull/))

12 changes: 12 additions & 0 deletions docs/en/connector-v2/source/SftpFile.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,7 @@ The File does not have a specific type list, and we can indicate which SeaTunnel
| xml_use_attr_format | boolean | no | - | Specifies whether to process data using the tag attribute format, only used when file_format is xml. |
| schema | Config | No | - | Please check #schema below |
| compress_codec | String | No | None | The compress codec of files and the details that supported as the following shown: <br/> - txt: `lzo` `None` <br/> - json: `lzo` `None` <br/> - csv: `lzo` `None` <br/> - orc: `lzo` `snappy` `lz4` `zlib` `None` <br/> - parquet: `lzo` `snappy` `lz4` `gzip` `brotli` `zstd` `None` <br/> Tips: excel type does Not support any compression format |
| archive_compress_codec | string | no | none |
| encoding | string | no | UTF-8 |
| common-options | | No | - | Source plugin common parameters, please refer to [Source Common Options](../source-common-options.md) for details. |

Expand Down Expand Up @@ -176,6 +177,17 @@ The compress codec of files and the details that supported as the following show
- orc/parquet:
automatically recognizes the compression type, no additional settings required.

### archive_compress_codec [string]

The compress codec of archive files and the details that supported as the following shown:

| archive_compress_codec | file_format | archive_compress_suffix |
|------------------------|--------------------|-------------------------|
| ZIP | txt,json,excel,xml | .zip |
| TAR | txt,json,excel,xml | .tar |
| TAR_GZ | txt,json,excel,xml | .tar.gz |
| NONE | all | .* |

### encoding [string]

Only used when file_format_type is json,text,csv,xml.
Expand Down
Loading

0 comments on commit 3f98cd8

Please sign in to comment.