Skip to content

Commit

Permalink
Merge branch 'dev' into fix_doc
Browse files Browse the repository at this point in the history
  • Loading branch information
jiamin13579 committed Sep 20, 2024
2 parents e91f0d2 + 4a2e272 commit 25c8252
Show file tree
Hide file tree
Showing 77 changed files with 4,911 additions and 914 deletions.
22 changes: 14 additions & 8 deletions docs/en/concept/JobEnvConfig.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,14 +21,26 @@ You can configure whether the task is in batch or stream mode through `job.mode`

### checkpoint.interval

Gets the interval in which checkpoints are periodically scheduled.
Gets the interval (milliseconds) in which checkpoints are periodically scheduled.

In `STREAMING` mode, checkpoints is required, if you do not set it, it will be obtained from the application configuration file `seatunnel.yaml`. In `BATCH` mode, you can disable checkpoints by not setting this parameter.
In `STREAMING` mode, checkpoints is required, if you do not set it, it will be obtained from the application configuration file `seatunnel.yaml`. In `BATCH` mode, you can disable checkpoints by not setting this parameter. In Zeta `STREAMING` mode, the default value is 30000 milliseconds.

### checkpoint.timeout

The timeout (in milliseconds) for a checkpoint. If the checkpoint is not completed before the timeout, the job will fail. In Zeta, the default value is 30000 milliseconds.

### parallelism

This parameter configures the parallelism of source and sink.

### shade.identifier

Specify the method of encryption, if you didn't have the requirement for encrypting or decrypting config files, this option can be ignored.

For more details, you can refer to the documentation [Config Encryption Decryption](../connector-v2/Config-Encryption-Decryption.md)

## Zeta Engine Parameter

### job.retry.times

Used to control the default retry times when a job fails. The default value is 3, and it only works in the Zeta engine.
Expand All @@ -43,12 +55,6 @@ This parameter is used to specify the location of the savemode when the job is e
The default value is `CLUSTER`, which means that the savemode is executed on the cluster. If you want to execute the savemode on the client,
you can set it to `CLIENT`. Please use `CLUSTER` mode as much as possible, because when there are no problems with `CLUSTER` mode, we will remove `CLIENT` mode.

### shade.identifier

Specify the method of encryption, if you didn't have the requirement for encrypting or decrypting config files, this option can be ignored.

For more details, you can refer to the documentation [Config Encryption Decryption](../connector-v2/Config-Encryption-Decryption.md)

## Flink Engine Parameter

Here are some SeaTunnel parameter names corresponding to the names in Flink, not all of them. Please refer to the official [Flink Documentation](https://flink.apache.org/).
Expand Down
2 changes: 1 addition & 1 deletion docs/en/connector-v2/sink/Assert.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
## Description

A flink sink plugin which can assert illegal data by user defined rules
A sink plugin which can assert illegal data by user defined rules

## Key Features

Expand Down
113 changes: 97 additions & 16 deletions docs/en/connector-v2/sink/Hudi.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,46 +14,91 @@ Used to write data to Hudi.

## Options

| name | type | required | default value |
Base configuration:

| name | type | required | default value |
|----------------------------|---------|----------|-----------------------------|
| table_dfs_path | string | yes | - |
| conf_files_path | string | no | - |
| table_list | Array | no | - |
| auto_commit | boolean | no | true |
| schema_save_mode | enum | no | CREATE_SCHEMA_WHEN_NOT_EXIST|
| common-options | Config | no | - |

Table list configuration:

| name | type | required | default value |
|----------------------------|--------|----------|---------------|
| table_name | string | yes | - |
| table_dfs_path | string | yes | - |
| conf_files_path | string | no | - |
| database | string | no | default |
| table_type | enum | no | COPY_ON_WRITE |
| op_type | enum | no | insert |
| record_key_fields | string | no | - |
| partition_fields | string | no | - |
| table_type | enum | no | copy_on_write |
| op_type | enum | no | insert |
| batch_interval_ms | Int | no | 1000 |
| batch_size | Int | no | 1000 |
| insert_shuffle_parallelism | Int | no | 2 |
| upsert_shuffle_parallelism | Int | no | 2 |
| min_commits_to_keep | Int | no | 20 |
| max_commits_to_keep | Int | no | 30 |
| common-options | config | no | - |
| index_type | enum | no | BLOOM |
| index_class_name | string | no | - |
| record_byte_size | Int | no | 1024 |

Note: When this configuration corresponds to a single table, you can flatten the configuration items in table_list to the outer layer.

### table_name [string]

`table_name` The name of hudi table.

### database [string]

`database` The database of hudi table.

### table_dfs_path [string]

`table_dfs_path` The dfs root path of hudi table,such as 'hdfs://nameserivce/data/hudi/hudi_table/'.
`table_dfs_path` The dfs root path of hudi table, such as 'hdfs://nameserivce/data/hudi/'.

### table_type [enum]

`table_type` The type of hudi table. The value is 'copy_on_write' or 'merge_on_read'.
`table_type` The type of hudi table. The value is `COPY_ON_WRITE` or `MERGE_ON_READ`.

### record_key_fields [string]

`record_key_fields` The record key fields of hudi table, its are used to generate record key. It must be configured when op_type is `UPSERT`.

### partition_fields [string]

`partition_fields` The partition key fields of hudi table, its are used to generate partition.

### index_type [string]

`index_type` The index type of hudi table. Currently, `BLOOM`, `SIMPLE`, and `GLOBAL SIMPLE` are supported.

### index_class_name [string]

`index_class_name` The customized index classpath of hudi table, example `org.apache.seatunnel.connectors.seatunnel.hudi.index.CustomHudiIndex`.

### record_byte_size [Int]

`record_byte_size` The byte size of each record, This value can be used to help calculate the approximate number of records in each hudi data file. Adjusting this value can effectively reduce the number of hudi data file write magnifications.

### conf_files_path [string]

`conf_files_path` The environment conf file path list(local path), which used to init hdfs client to read hudi table file. The example is '/home/test/hdfs-site.xml;/home/test/core-site.xml;/home/test/yarn-site.xml'.

### op_type [enum]

`op_type` The operation type of hudi table. The value is 'insert' or 'upsert' or 'bulk_insert'.
`op_type` The operation type of hudi table. The value is `insert` or `upsert` or `bulk_insert`.

### batch_interval_ms [Int]

`batch_interval_ms` The interval time of batch write to hudi table.

### batch_size [Int]

`batch_size` The size of batch write to hudi table.

### insert_shuffle_parallelism [Int]

`insert_shuffle_parallelism` The parallelism of insert data to hudi table.
Expand All @@ -70,19 +115,35 @@ Used to write data to Hudi.

`max_commits_to_keep` The max commits to keep of hudi table.

### auto_commit [boolean]

`auto_commit` Automatic transaction commit is enabled by default.

### schema_save_mode [Enum]

Before the synchronous task is turned on, different treatment schemes are selected for the existing surface structure of the target side.
Option introduction:
`RECREATE_SCHEMA` :Will create when the table does not exist, delete and rebuild when the table is saved
`CREATE_SCHEMA_WHEN_NOT_EXIST` :Will Created when the table does not exist, skipped when the table is saved
`ERROR_WHEN_SCHEMA_NOT_EXIST` :Error will be reported when the table does not exist
`IGNORE` :Ignore the treatment of the table

### common options

Source plugin common parameters, please refer to [Source Common Options](../sink-common-options.md) for details.

## Examples

### single table
```hocon
sink {
Hudi {
table_dfs_path = "hdfs://nameserivce/data/hudi/hudi_table/"
table_dfs_path = "hdfs://nameserivce/data/"
database = "st"
table_name = "test_table"
table_type = "copy_on_write"
table_type = "COPY_ON_WRITE"
conf_files_path = "/home/test/hdfs-site.xml;/home/test/core-site.xml;/home/test/yarn-site.xml"
batch_size = 10000
use.kerberos = true
kerberos.principal = "test_user@xxx"
kerberos.principal.file = "/home/test/test_user.keytab"
Expand All @@ -91,9 +152,6 @@ sink {
```

### Multiple table

#### example1

```hocon
env {
parallelism = 1
Expand All @@ -116,9 +174,32 @@ transform {
sink {
Hudi {
table_dfs_path = "hdfs://nameserivce/data/"
conf_files_path = "/home/test/hdfs-site.xml;/home/test/core-site.xml;/home/test/yarn-site.xml"
table_list = [
{
database = "st1"
table_name = "role"
table_type = "COPY_ON_WRITE"
op_type="INSERT"
batch_size = 10000
},
{
database = "st1"
table_name = "user"
table_type = "COPY_ON_WRITE"
op_type="UPSERT"
# op_type is 'UPSERT', must configured record_key_fields
record_key_fields = "user_id"
batch_size = 10000
},
{
database = "st1"
table_name = "Bucket"
table_type = "MERGE_ON_READ"
}
]
...
table_dfs_path = "hdfs://nameserivce/data/hudi/hudi_table/"
table_name = "${table_name}_test"
}
}
```
Expand Down
29 changes: 25 additions & 4 deletions docs/en/connector-v2/source/Paimon.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Read data from Apache Paimon.
## Key features

- [x] [batch](../../concept/connector-v2-features.md)
- [ ] [stream](../../concept/connector-v2-features.md)
- [x] [stream](../../concept/connector-v2-features.md)
- [ ] [exactly-once](../../concept/connector-v2-features.md)
- [ ] [column projection](../../concept/connector-v2-features.md)
- [ ] [parallelism](../../concept/connector-v2-features.md)
Expand Down Expand Up @@ -157,9 +157,30 @@ source {
```

## Changelog
If you want to read the changelog of this connector, your sink table of paimon which mast has the options named `changelog-producer=input`, then you can refer to [Paimon changelog](https://paimon.apache.org/docs/master/primary-key-table/changelog-producer/).
Currently, we only support the `input` and `none` mode of changelog producer. If the changelog producer is `input`, the streaming read of the connector will generate -U,+U,+I,+D data. But if the changelog producer is `none`, the streaming read of the connector will generate +I,+U,+D data.

### next version
### Streaming read example
```hocon
env {
parallelism = 1
job.mode = "Streaming"
}
- Add Paimon Source Connector
- Support projection for Paimon Source
source {
Paimon {
warehouse = "/tmp/paimon"
database = "full_type"
table = "st_test"
}
}
sink {
Paimon {
warehouse = "/tmp/paimon"
database = "full_type"
table = "st_test_sink"
paimon.table.primary-keys = "c_tinyint"
}
}
```
12 changes: 6 additions & 6 deletions docs/en/start-v2/docker/docker.md
Original file line number Diff line number Diff line change
Expand Up @@ -146,14 +146,14 @@ docker run --rm -it apache/seatunnel bash -c '<YOUR_FLINK_HOME>/bin/start-cluste

there has 2 ways to create cluster within docker.

### 1. Use Docker Directly
### Use Docker Directly

1. create a network
#### create a network
```shell
docker network create seatunnel-network
```

2. start the nodes
#### start the nodes
- start master node
```shell
## start master and export 5801 port
Expand Down Expand Up @@ -213,7 +213,7 @@ docker run -d --name seatunnel_worker_1 \
```


### 2. Use Docker-compose
### Use Docker-compose

> docker cluster mode is only support zeta engine.
Expand Down Expand Up @@ -368,7 +368,7 @@ and run `docker-compose up -d` command, the new worker node will start, and the

### Job Operation on cluster

1. use docker as a client
#### use docker as a client
- submit job :
```shell
docker run --name seatunnel_client \
Expand All @@ -393,7 +393,7 @@ more command please refer [user-command](../../seatunnel-engine/user-command.md)



2. use rest api
#### use rest api

please refer [Submit A Job](../../seatunnel-engine/rest-api.md#submit-a-job)

2 changes: 1 addition & 1 deletion docs/en/start-v2/locally/deployment.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ You can download the source code from the [download page](https://seatunnel.apac

```shell
cd seatunnel
sh ./mvnw clean package -DskipTests -Dskip.spotless=true
sh ./mvnw clean install -DskipTests -Dskip.spotless=true
# get the binary package
cp seatunnel-dist/target/apache-seatunnel-2.3.8-bin.tar.gz /The-Path-You-Want-To-Copy

Expand Down
22 changes: 14 additions & 8 deletions docs/zh/concept/JobEnvConfig.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,14 +21,26 @@

### checkpoint.interval

获取定时调度检查点的时间间隔。
获取定时调度检查点的时间间隔(毫秒)

`STREAMING`模式下,检查点是必须的,如果不设置,将从应用程序配置文件`seatunnel.yaml`中获取。 在`BATCH`模式下,您可以通过不设置此参数来禁用检查点。
`STREAMING`模式下,检查点是必须的,如果不设置,将从应用程序配置文件`seatunnel.yaml`中获取。 在`BATCH`模式下,您可以通过不设置此参数来禁用检查点。在Zeta `STREAMING`模式下,默认值为30000毫秒。

### checkpoint.timeout

检查点的超时时间(毫秒)。如果检查点在超时之前没有完成,作业将失败。在Zeta中,默认值为30000毫秒。

### parallelism

该参数配置source和sink的并行度。

### shade.identifier

指定加密方式,如果您没有加密或解密配置文件的需求,此选项可以忽略。

更多详细信息,您可以参考文档 [Config Encryption Decryption](../../en/connector-v2/Config-Encryption-Decryption.md)

## Zeta 引擎参数

### job.retry.times

用于控制作业失败时的默认重试次数。默认值为3,并且仅适用于Zeta引擎。
Expand All @@ -44,12 +56,6 @@
当值为`CLIENT`时,SaveMode操作在作业提交的过程中执行,使用shell脚本提交作业时,该过程在提交作业的shell进程中执行。使用rest api提交作业时,该过程在http请求的处理线程中执行。
请尽量使用`CLUSTER`模式,因为当`CLUSTER`模式没有问题时,我们将删除`CLIENT`模式。

### shade.identifier

指定加密方式,如果您没有加密或解密配置文件的需求,此选项可以忽略。

更多详细信息,您可以参考文档 [Config Encryption Decryption](../../en/connector-v2/Config-Encryption-Decryption.md)

## Flink 引擎参数

这里列出了一些与 Flink 中名称相对应的 SeaTunnel 参数名称,并非全部,更多内容请参考官方 [Flink Documentation](https://flink.apache.org/) for more.
Expand Down
Loading

0 comments on commit 25c8252

Please sign in to comment.