Merge branch 'dev' into seatunnel-flink-multitable-new

apache · Sep 19, 2024 · 44b5399 · 44b5399
2 parents db08187 + 4f5d27f
commit 44b5399
Show file tree

Hide file tree

Showing 59 changed files with 3,982 additions and 653 deletions.
diff --git a/docs/en/concept/JobEnvConfig.md b/docs/en/concept/JobEnvConfig.md
@@ -21,14 +21,26 @@ You can configure whether the task is in batch or stream mode through `job.mode`
 
 ### checkpoint.interval
 
-Gets the interval in which checkpoints are periodically scheduled.
+Gets the interval (milliseconds) in which checkpoints are periodically scheduled.
 
-In `STREAMING` mode, checkpoints is required, if you do not set it, it will be obtained from the application configuration file `seatunnel.yaml`. In `BATCH` mode, you can disable checkpoints by not setting this parameter.
+In `STREAMING` mode, checkpoints is required, if you do not set it, it will be obtained from the application configuration file `seatunnel.yaml`. In `BATCH` mode, you can disable checkpoints by not setting this parameter. In Zeta `STREAMING` mode, the default value is 30000 milliseconds.
+
+### checkpoint.timeout
+
+The timeout (in milliseconds) for a checkpoint. If the checkpoint is not completed before the timeout, the job will fail. In Zeta, the default value is 30000 milliseconds.
 
 ### parallelism
 
 This parameter configures the parallelism of source and sink.
 
+### shade.identifier
+
+Specify the method of encryption, if you didn't have the requirement for encrypting or decrypting config files, this option can be ignored.
+
+For more details, you can refer to the documentation [Config Encryption Decryption](../connector-v2/Config-Encryption-Decryption.md)
+
+## Zeta Engine Parameter
+
 ### job.retry.times
 
 Used to control the default retry times when a job fails. The default value is 3, and it only works in the Zeta engine.
@@ -43,12 +55,6 @@ This parameter is used to specify the location of the savemode when the job is e
 The default value is `CLUSTER`, which means that the savemode is executed on the cluster. If you want to execute the savemode on the client,
 you can set it to `CLIENT`. Please use `CLUSTER` mode as much as possible, because when there are no problems with `CLUSTER` mode, we will remove `CLIENT` mode.
 
-### shade.identifier
-
-Specify the method of encryption, if you didn't have the requirement for encrypting or decrypting config files, this option can be ignored.
-
-For more details, you can refer to the documentation [Config Encryption Decryption](../connector-v2/Config-Encryption-Decryption.md)
-
 ## Flink Engine Parameter
 
 Here are some SeaTunnel parameter names corresponding to the names in Flink, not all of them. Please refer to the official [Flink Documentation](https://flink.apache.org/).

diff --git a/docs/en/connector-v2/sink/Assert.md b/docs/en/connector-v2/sink/Assert.md
@@ -4,7 +4,7 @@
 
 ## Description
 
-A flink sink plugin which can assert illegal data by user defined rules
+A sink plugin which can assert illegal data by user defined rules
 
 ## Key Features
 

diff --git a/docs/en/connector-v2/sink/Hudi.md b/docs/en/connector-v2/sink/Hudi.md
@@ -14,46 +14,91 @@ Used to write data to Hudi.
 
 ## Options
 
-|            name            |  type  | required | default value |
+Base configuration:
+
+|            name            |  type   | required | default value               |
+|----------------------------|---------|----------|-----------------------------|
+| table_dfs_path             | string  | yes      | -                           |
+| conf_files_path            | string  | no       | -                           |
+| table_list                 | Array   | no       | -                           |
+| auto_commit                | boolean | no       | true                        |
+| schema_save_mode           | enum    | no       | CREATE_SCHEMA_WHEN_NOT_EXIST|
+| common-options             | Config  | no       | -                           |
+
+Table list configuration:
+
+|       name                 |  type  | required | default value |
 |----------------------------|--------|----------|---------------|
 | table_name                 | string | yes      | -             |
-| table_dfs_path             | string | yes      | -             |
-| conf_files_path            | string | no       | -             |
+| database                   | string | no       | default       |
+| table_type                 | enum   | no       | COPY_ON_WRITE |
+| op_type                    | enum   | no       | insert        |
 | record_key_fields          | string | no       | -             |
 | partition_fields           | string | no       | -             |
-| table_type                 | enum   | no       | copy_on_write |
-| op_type                    | enum   | no       | insert        |
 | batch_interval_ms          | Int    | no       | 1000          |
+| batch_size                 | Int    | no       | 1000          |
 | insert_shuffle_parallelism | Int    | no       | 2             |
 | upsert_shuffle_parallelism | Int    | no       | 2             |
 | min_commits_to_keep        | Int    | no       | 20            |
 | max_commits_to_keep        | Int    | no       | 30            |
-| common-options             | config | no       | -             |
+| index_type                 | enum   | no       | BLOOM         |
+| index_class_name           | string | no       | -             |
+| record_byte_size           | Int    | no       | 1024          |
+
+Note: When this configuration corresponds to a single table, you can flatten the configuration items in table_list to the outer layer.
 
 ### table_name [string]
 
 `table_name` The name of hudi table.
 
+### database [string]
+
+`database` The database of hudi table.
+
 ### table_dfs_path [string]
 
-`table_dfs_path` The dfs root path of hudi table,such as 'hdfs://nameserivce/data/hudi/hudi_table/'.
+`table_dfs_path` The dfs root path of hudi table, such as 'hdfs://nameserivce/data/hudi/'.
 
 ### table_type [enum]
 
-`table_type` The type of hudi table. The value is 'copy_on_write' or 'merge_on_read'.
+`table_type` The type of hudi table. The value is `COPY_ON_WRITE` or `MERGE_ON_READ`.
+
+### record_key_fields [string]
+
+`record_key_fields` The record key fields of hudi table, its are used to generate record key. It must be configured when op_type is `UPSERT`.
+
+### partition_fields [string]
+
+`partition_fields` The partition key fields of hudi table, its are used to generate partition.
+
+### index_type [string]
+
+`index_type` The index type of hudi table. Currently, `BLOOM`, `SIMPLE`, and `GLOBAL SIMPLE` are supported.
+
+### index_class_name [string]
+
+`index_class_name` The customized index classpath of hudi table, example `org.apache.seatunnel.connectors.seatunnel.hudi.index.CustomHudiIndex`.
+
+### record_byte_size [Int]
+
+`record_byte_size` The byte size of each record, This value can be used to help calculate the approximate number of records in each hudi data file. Adjusting this value can effectively reduce the number of hudi data file write magnifications.
 
 ### conf_files_path [string]
 
 `conf_files_path` The environment conf file path list(local path), which used to init hdfs client to read hudi table file. The example is '/home/test/hdfs-site.xml;/home/test/core-site.xml;/home/test/yarn-site.xml'.
 
 ### op_type [enum]
 
-`op_type` The operation type of hudi table. The value is 'insert' or 'upsert' or 'bulk_insert'.
+`op_type` The operation type of hudi table. The value is `insert` or `upsert` or `bulk_insert`.
 
 ### batch_interval_ms [Int]
 
 `batch_interval_ms` The interval time of batch write to hudi table.
 
+### batch_size [Int]
+
+`batch_size` The size of batch write to hudi table.
+
 ### insert_shuffle_parallelism [Int]
 
 `insert_shuffle_parallelism` The parallelism of insert data to hudi table.
@@ -70,19 +115,35 @@ Used to write data to Hudi.
 
 `max_commits_to_keep` The max commits to keep of hudi table.
 
+### auto_commit [boolean]
+
+`auto_commit` Automatic transaction commit is enabled by default.
+
+### schema_save_mode [Enum]
+
+Before the synchronous task is turned on, different treatment schemes are selected for the existing surface structure of the target side.  
+Option introduction：  
+`RECREATE_SCHEMA` ：Will create when the table does not exist, delete and rebuild when the table is saved        
+`CREATE_SCHEMA_WHEN_NOT_EXIST` ：Will Created when the table does not exist, skipped when the table is saved        
+`ERROR_WHEN_SCHEMA_NOT_EXIST` ：Error will be reported when the table does not exist  
+`IGNORE` ：Ignore the treatment of the table
+
 ### common options
 
 Source plugin common parameters, please refer to [Source Common Options](../sink-common-options.md) for details.
 
 ## Examples
 
+### single table
 ```hocon
 sink {
   Hudi {
-    table_dfs_path = "hdfs://nameserivce/data/hudi/hudi_table/"
+    table_dfs_path = "hdfs://nameserivce/data/"
+    database = "st"
     table_name = "test_table"
-    table_type = "copy_on_write"
+    table_type = "COPY_ON_WRITE"
     conf_files_path = "/home/test/hdfs-site.xml;/home/test/core-site.xml;/home/test/yarn-site.xml"
+    batch_size = 10000
     use.kerberos = true
     kerberos.principal = "test_user@xxx"
     kerberos.principal.file = "/home/test/test_user.keytab"
@@ -91,9 +152,6 @@ sink {
 ```
 
 ### Multiple table
-
-#### example1
-
 ```hocon
 env {
   parallelism = 1
@@ -116,9 +174,32 @@ transform {
 
 sink {
   Hudi {
+    table_dfs_path = "hdfs://nameserivce/data/"
+    conf_files_path = "/home/test/hdfs-site.xml;/home/test/core-site.xml;/home/test/yarn-site.xml"
+    table_list = [
+      {
+        database = "st1"
+        table_name = "role"
+        table_type = "COPY_ON_WRITE"
+        op_type="INSERT"
+        batch_size = 10000
+      },
+      {
+        database = "st1"
+        table_name = "user"
+        table_type = "COPY_ON_WRITE"
+        op_type="UPSERT"
+        # op_type is 'UPSERT', must configured record_key_fields
+        record_key_fields = "user_id"
+        batch_size = 10000
+      },
+      {
+        database = "st1"
+        table_name = "Bucket"
+        table_type = "MERGE_ON_READ"
+      }
+    ]
     ...
-    table_dfs_path = "hdfs://nameserivce/data/hudi/hudi_table/"
-    table_name = "${table_name}_test"
   }
 }
 ```

diff --git a/docs/en/start-v2/docker/docker.md b/docs/en/start-v2/docker/docker.md
@@ -146,14 +146,14 @@ docker run --rm -it apache/seatunnel bash -c '<YOUR_FLINK_HOME>/bin/start-cluste
 
 there has 2 ways to create cluster within docker.
 
-### 1. Use Docker Directly
+### Use Docker Directly
 
-1. create a network
+#### create a network
 ```shell
 docker network create seatunnel-network
 ```
 
-2. start the nodes
+#### start the nodes
 - start master node
 ```shell
 ## start master and export 5801 port 
@@ -213,7 +213,7 @@ docker run -d --name seatunnel_worker_1 \
 ```
 
 
-### 2. Use Docker-compose
+### Use Docker-compose
 
 > docker cluster mode is only support zeta engine.
 
@@ -368,7 +368,7 @@ and run `docker-compose up -d` command, the new worker node will start, and the
 
 ### Job Operation on cluster
 
-1. use docker as a client
+#### use docker as a client
 - submit job :
 ```shell
 docker run --name seatunnel_client \
@@ -393,7 +393,7 @@ more command please refer [user-command](../../seatunnel-engine/user-command.md)
 
 
 
-2. use rest api
+#### use rest api
 
 please refer [Submit A Job](../../seatunnel-engine/rest-api.md#submit-a-job)
 
diff --git a/docs/en/start-v2/locally/deployment.md b/docs/en/start-v2/locally/deployment.md
@@ -69,7 +69,7 @@ You can download the source code from the [download page](https://seatunnel.apac
 
 ```shell
 cd seatunnel
-sh ./mvnw clean package -DskipTests -Dskip.spotless=true
+sh ./mvnw clean install -DskipTests -Dskip.spotless=true
 # get the binary package
 cp seatunnel-dist/target/apache-seatunnel-2.3.8-bin.tar.gz /The-Path-You-Want-To-Copy
 

diff --git a/docs/zh/concept/JobEnvConfig.md b/docs/zh/concept/JobEnvConfig.md
@@ -21,14 +21,26 @@
 
 ### checkpoint.interval
 
-获取定时调度检查点的时间间隔。
+获取定时调度检查点的时间间隔(毫秒)。
 
-在`STREAMING`模式下，检查点是必须的，如果不设置，将从应用程序配置文件`seatunnel.yaml`中获取。 在`BATCH`模式下，您可以通过不设置此参数来禁用检查点。
+在`STREAMING`模式下，检查点是必须的，如果不设置，将从应用程序配置文件`seatunnel.yaml`中获取。 在`BATCH`模式下，您可以通过不设置此参数来禁用检查点。在Zeta `STREAMING`模式下，默认值为30000毫秒。
+
+### checkpoint.timeout
+
+检查点的超时时间(毫秒)。如果检查点在超时之前没有完成，作业将失败。在Zeta中，默认值为30000毫秒。
 
 ### parallelism
 
 该参数配置source和sink的并行度。
 
+### shade.identifier
+
+指定加密方式，如果您没有加密或解密配置文件的需求，此选项可以忽略。
+
+更多详细信息，您可以参考文档 [Config Encryption Decryption](../../en/connector-v2/Config-Encryption-Decryption.md)
+
+## Zeta 引擎参数
+
 ### job.retry.times
 
 用于控制作业失败时的默认重试次数。默认值为3，并且仅适用于Zeta引擎。
@@ -44,12 +56,6 @@
 当值为`CLIENT`时，SaveMode操作在作业提交的过程中执行，使用shell脚本提交作业时，该过程在提交作业的shell进程中执行。使用rest api提交作业时，该过程在http请求的处理线程中执行。
 请尽量使用`CLUSTER`模式，因为当`CLUSTER`模式没有问题时，我们将删除`CLIENT`模式。
 
-### shade.identifier
-
-指定加密方式，如果您没有加密或解密配置文件的需求，此选项可以忽略。
-
-更多详细信息，您可以参考文档 [Config Encryption Decryption](../../en/connector-v2/Config-Encryption-Decryption.md)
-
 ## Flink 引擎参数
 
 这里列出了一些与 Flink 中名称相对应的 SeaTunnel 参数名称，并非全部，更多内容请参考官方 [Flink Documentation](https://flink.apache.org/) for more.