From 413a9184a58771af0add9022a56fe99b98afbc77 Mon Sep 17 00:00:00 2001 From: tcodehuber Date: Fri, 12 Jul 2024 23:41:19 +0800 Subject: [PATCH] [Docs] Update transform about and faq related docs info --- docs/en/about.md | 26 +++++++++++++------------- docs/en/faq.md | 24 ++++++++++++------------ docs/en/transform-v2/common-options.md | 2 +- docs/en/transform-v2/sql-udf.md | 2 +- docs/zh/about.md | 20 ++++++++++---------- docs/zh/faq.md | 20 ++++++++++---------- 6 files changed, 47 insertions(+), 47 deletions(-) diff --git a/docs/en/about.md b/docs/en/about.md index 5164dc081c0..a2262d6355b 100644 --- a/docs/en/about.md +++ b/docs/en/about.md @@ -9,7 +9,7 @@ SeaTunnel is a very easy-to-use, ultra-high-performance, distributed data integr synchronization of massive data. It can synchronize tens of billions of data stably and efficiently every day, and has been used in production by nearly 100 companies. -## Why we need SeaTunnel +## Why We Need SeaTunnel SeaTunnel focuses on data integration and data synchronization, and is mainly designed to solve common problems in the field of data integration: @@ -18,29 +18,29 @@ SeaTunnel focuses on data integration and data synchronization, and is mainly de - High resource demand: Existing data integration and data synchronization tools often require vast computing resources or JDBC connection resources to complete real-time synchronization of massive small tables. This has increased the burden on enterprises. - Lack of quality and monitoring: Data integration and synchronization processes often experience loss or duplication of data. The synchronization process lacks monitoring, and it is impossible to intuitively understand the real situation of the data during the task process. - Complex technology stack: The technology components used by enterprises are different, and users need to develop corresponding synchronization programs for different components to complete data integration. -- Difficulty in management and maintenance: Limited to different underlying technology components (Flink/Spark), offline synchronization and real-time synchronization often have be developed and managed separately, which increases the difficulty of management and maintainance. +- Difficulty in management and maintenance: Limited to different underlying technology components (Flink/Spark), offline synchronization and real-time synchronization often have be developed and managed separately, which increases the difficulty of management and maintenance. -## Features of SeaTunnel +## Features Of SeaTunnel -- Rich and extensible Connector: SeaTunnel provides a Connector API that does not depend on a specific execution engine. Connectors (Source, Transform, Sink) developed based on this API can run on many different engines, such as SeaTunnel Engine, Flink, and Spark, that are currently supported. -- Connector plug-in: The plug-in design allows users to easily develop their own Connector and integrate it into the SeaTunnel project. Currently, SeaTunnel supports more than 100 Connectors, and the number is surging. Here is the list of [currently-supported connectors](Connector-v2-release-state.md) +- Rich and extensible Connector: SeaTunnel provides a Connector API that does not depend on a specific execution engine. Connectors (Source, Transform, Sink) developed based on this API can run on many different engines, such as SeaTunnel Engine(Zeta), Flink, and Spark. +- Connector plugin: The plugin design allows users to easily develop their own Connector and integrate it into the SeaTunnel project. Currently, SeaTunnel supports more than 100 Connectors, and the number is surging. Here is the list of [Currently Supported Connectors](Connector-v2-release-state.md) - Batch-stream integration: Connectors developed based on the SeaTunnel Connector API are perfectly compatible with offline synchronization, real-time synchronization, full-synchronization, incremental synchronization and other scenarios. They greatly reduce the difficulty of managing data integration tasks. - Supports a distributed snapshot algorithm to ensure data consistency. -- Multi-engine support: SeaTunnel uses the SeaTunnel Engine for data synchronization by default. SeaTunnel also supports the use of Flink or Spark as the execution engine of the Connector to adapt to the existing technical components of the enterprise. SeaTunnel supports multiple versions of Spark and Flink. +- Multi-engine support: SeaTunnel uses the SeaTunnel Engine(Zeta) for data synchronization by default. SeaTunnel also supports the use of Flink or Spark as the execution engine of the Connector to adapt to the enterprise's existing technical components. SeaTunnel supports multiple versions of Spark and Flink. - JDBC multiplexing, database log multi-table parsing: SeaTunnel supports multi-table or whole database synchronization, which solves the problem of over-JDBC connections; and supports multi-table or whole database log reading and parsing, which solves the need for CDC multi-table synchronization scenarios to deal with problems with repeated reading and parsing of logs. - High throughput and low latency: SeaTunnel supports parallel reading and writing, providing stable and reliable data synchronization capabilities with high throughput and low latency. - Perfect real-time monitoring: SeaTunnel supports detailed monitoring information of each step in the data synchronization process, allowing users to easily understand the number of data, data size, QPS and other information read and written by the synchronization task. - Two job development methods are supported: coding and canvas design. The SeaTunnel web project https://github.com/apache/seatunnel-web provides visual management of jobs, scheduling, running and monitoring capabilities. -## SeaTunnel work flowchart +## SeaTunnel Work Flowchart -![SeaTunnel work flowchart](../images/architecture_diagram.png) +![SeaTunnel Work Flowchart](../images/architecture_diagram.png) The runtime process of SeaTunnel is shown in the figure above. The user configures the job information and selects the execution engine to submit the job. -The Source Connector is responsible for parallel reading the data and sending the data to the downstream Transform or directly to the Sink, and the Sink writes the data to the destination. It is worth noting that Source, Transform and Sink can be easily developed and extended by yourself. +The Source Connector is responsible for parallel reading and sending the data to the downstream Transform or directly to the Sink, and the Sink writes the data to the destination. It is worth noting that Source, Transform and Sink can be easily developed and extended by yourself. SeaTunnel is an EL(T) data integration platform. Therefore, in SeaTunnel, Transform can only be used to perform some simple transformations on data, such as converting the data of a column to uppercase or lowercase, changing the column name, or splitting a column into multiple columns. @@ -48,15 +48,15 @@ The default engine use by SeaTunnel is [SeaTunnel Engine](seatunnel-engine/about ## Connector -- **Source Connectors** SeaTunnel supports reading data from various relational, graph, NoSQL, document, and memory databases; distributed file systems such as HDFS; and a variety of cloud storage solutions, such as S3 and OSS. We also support data reading of many common SaaS services. You can access the detailed list [here](connector-v2/source). If you want, You can develop your own source connector and easily integrate it into SeaTunnel. +- **Source Connectors** SeaTunnel supports reading data from various relational, graph, NoSQL, document, and memory databases; distributed file systems such as HDFS; and a variety of cloud storage solutions, such as S3 and OSS. We also support data reading of many common SaaS services. You can access the detailed list [Here](connector-v2/source). If you want, You can develop your own source connector and easily integrate it into SeaTunnel. - **Transform Connector** If the schema is different between source and Sink, You can use the Transform Connector to change the schema read from source and make it the same as the Sink schema. -- **Sink Connector** SeaTunnel supports writing data to various relational, graph, NoSQL, document, and memory databases; distributed file systems such as HDFS; and a variety of cloud storage solutions, such as S3 and OSS. We also support writing data to many common SaaS services. You can access the detailed list [here](connector-v2/sink). If you want, you can develop your own Sink connector and easily integrate it into SeaTunnel. +- **Sink Connector** SeaTunnel supports writing data to various relational, graph, NoSQL, document, and memory databases; distributed file systems such as HDFS; and a variety of cloud storage solutions, such as S3 and OSS. We also support writing data to many common SaaS services. You can access the detailed list [Here](connector-v2/sink). If you want, you can develop your own Sink connector and easily integrate it into SeaTunnel. -## Who uses SeaTunnel +## Who Uses SeaTunnel -SeaTunnel has lots of users. You can find more information about them in [users](https://seatunnel.apache.org/user). +SeaTunnel has lots of users. You can find more information about them in [Users](https://seatunnel.apache.org/user). ## Landscapes diff --git a/docs/en/faq.md b/docs/en/faq.md index 953cc2a9569..2e50c9d4618 100644 --- a/docs/en/faq.md +++ b/docs/en/faq.md @@ -6,7 +6,7 @@ SeaTunnel now uses computing engines such as Spark and Flink to complete resourc ## I have a question, and I cannot solve it by myself -I have encountered a problem when using SeaTunnel and I cannot solve it by myself. What should I do? First, search in [Issue list](https://github.com/apache/seatunnel/issues) or [mailing list](https://lists.apache.org/list.html?dev@seatunnel.apache.org) to see if someone has already asked the same question and got an answer. If you cannot find an answer to your question, you can contact community members for help in [these ways](https://github.com/apache/seatunnel#contact-us). +I have encountered a problem when using SeaTunnel and I cannot solve it by myself. What should I do? First, search in [Issue List](https://github.com/apache/seatunnel/issues) or [Mailing List](https://lists.apache.org/list.html?dev@seatunnel.apache.org) to see if someone has already asked the same question and got an answer. If you cannot find an answer to your question, you can contact community members for help in [These Ways](https://github.com/apache/seatunnel#contact-us). ## How do I declare a variable? @@ -61,7 +61,7 @@ your string 1 Refer to: [lightbend/config#456](https://github.com/lightbend/config/issues/456). -## Is SeaTunnel supportted in Azkaban, Oozie, DolphinScheduler? +## Is SeaTunnel supported in Azkaban, Oozie, DolphinScheduler? Of course! See the screenshot below: @@ -93,7 +93,7 @@ sink { ## Are there any HBase plugins? -There is an hbase input plugin. You can download it from here: https://github.com/garyelephant/waterdrop-input-hbase . +There is a HBase input plugin. You can download it from here: https://github.com/garyelephant/waterdrop-input-hbase . ## How can I use SeaTunnel to write data to Hive? @@ -184,7 +184,7 @@ The following conclusions can be drawn: 3. In general, both M and N are determined, and the conclusion can be drawn from 2: The size of `spark.streaming.kafka.maxRatePerPartition` is positively correlated with the size of `spark.executor.cores` * `spark.executor.instances`, and it can be increased while increasing the resource `maxRatePerPartition` to speed up consumption. -![kafka](../images/kafka.png) +![Kafka](../images/kafka.png) ## How can I solve the Error `Exception in thread "main" java.lang.NoSuchFieldError: INSTANCE`? @@ -203,11 +203,11 @@ spark { } ``` -## How do I specify a different JDK version for SeaTunnel on Yarn? +## How do I specify a different JDK version for SeaTunnel on YARN? For example, if you want to set the JDK version to JDK8, there are two cases: -- The Yarn cluster has deployed JDK8, but the default JDK is not JDK8. Add two configurations to the SeaTunnel config file: +- The YARN cluster has deployed JDK8, but the default JDK is not JDK8. Add two configurations to the SeaTunnel config file: ``` env { @@ -217,12 +217,12 @@ For example, if you want to set the JDK version to JDK8, there are two cases: ... } ``` -- Yarn cluster does not deploy JDK8. At this time, start SeaTunnel attached with JDK8. For detailed operations, see: +- YARN cluster does not deploy JDK8. At this time, start SeaTunnel attached with JDK8. For detailed operations, see: https://www.cnblogs.com/jasondan/p/spark-specific-jdk-version.html ## What should I do if OOM always appears when running SeaTunnel in Spark local[*] mode? -If you run in local mode, you need to modify the `start-seatunnel.sh` startup script. After `spark-submit`, add a parameter `--driver-memory 4g` . Under normal circumstances, local mode is not used in the production environment. Therefore, this parameter generally does not need to be set during On Yarn. See: [Application Properties](https://spark.apache.org/docs/latest/configuration.html#application-properties) for details. +If you run in local mode, you need to modify the `start-seatunnel.sh` startup script. After `spark-submit`, add a parameter `--driver-memory 4g` . Under normal circumstances, local mode is not used in the production environment. Therefore, this parameter generally does not need to be set during On YARN. See: [Application Properties](https://spark.apache.org/docs/latest/configuration.html#application-properties) for details. ## Where can I place self-written plugins or third-party jdbc.jars to be loaded by SeaTunnel? @@ -236,14 +236,14 @@ cp third-part.jar plugins/my_plugins/lib `my_plugins` can be any string. -## How do I configure logging-related parameters in SeaTunnel-v1(Spark)? +## How do I configure logging-related parameters in SeaTunnel-V1(Spark)? There are three ways to configure logging-related parameters (such as Log Level): - [Not recommended] Change the default `$SPARK_HOME/conf/log4j.properties`. - This will affect all programs submitted via `$SPARK_HOME/bin/spark-submit`. - [Not recommended] Modify logging related parameters directly in the Spark code of SeaTunnel. - - This is equivalent to writing dead, and each change needs to be recompiled. + - This is equivalent to hardcoding, and each change needs to be recompiled. - [Recommended] Use the following methods to change the logging configuration in the SeaTunnel configuration file (The change only takes effect if SeaTunnel >= 1.5.5 ): ``` @@ -283,7 +283,7 @@ log4j.appender.console.layout=org.apache.log4j.PatternLayout log4j.appender.console.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n ``` -## How do I configure logging related parameters in SeaTunnel-v2(Spark, Flink)? +## How do I configure logging related parameters in SeaTunnel-V2(Spark, Flink)? Currently, they cannot be set directly. you need to modify the SeaTunnel startup script. The relevant parameters are specified in the task submission command. For specific parameters, please refer to the official documents: @@ -309,7 +309,7 @@ For example, if you want to output more detailed logs of E2E Test, just downgrad In SeaTunnel, the data type will not be actively converted. After the Input reads the data, the corresponding Schema. When writing ClickHouse, the field type needs to be strictly matched, and the mismatch needs to be resolved. -Data conversion can be achieved through the following two plug-ins: +Data conversion can be achieved through the following two plugins: 1. Filter Convert plugin 2. Filter Sql plugin diff --git a/docs/en/transform-v2/common-options.md b/docs/en/transform-v2/common-options.md index ce88ce8528f..7c13bac4f00 100644 --- a/docs/en/transform-v2/common-options.md +++ b/docs/en/transform-v2/common-options.md @@ -4,7 +4,7 @@ | Name | Type | Required | Default | Description | |-------------------|--------|----------|---------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| result_table_name | String | No | - | When `source_table_name` is not specified, the current plug-in processes the data set `(dataset)` output by the previous plug-in in the configuration file;
When `source_table_name` is specified, the current plugin is processing the data set corresponding to this parameter. | +| result_table_name | String | No | - | When `source_table_name` is not specified, the current plugin processes the data set `(dataset)` output by the previous plugin in the configuration file;
When `source_table_name` is specified, the current plugin is processing the data set corresponding to this parameter. | | source_table_name | String | No | - | When `result_table_name` is not specified, the data processed by this plugin will not be registered as a data set that can be directly accessed by other plugins, or called a temporary table `(table)`;
When `result_table_name` is specified, the data processed by this plugin will be registered as a data set `(dataset)` that can be directly accessed by other plugins, or called a temporary table `(table)` . The dataset registered here can be directly accessed by other plugins by specifying `source_table_name` . | ## Task Example diff --git a/docs/en/transform-v2/sql-udf.md b/docs/en/transform-v2/sql-udf.md index 78810c11b53..df5d3b93fe5 100644 --- a/docs/en/transform-v2/sql-udf.md +++ b/docs/en/transform-v2/sql-udf.md @@ -4,7 +4,7 @@ ## Description -Use UDF SPI to extends the SQL transform functions lib. +Use UDF SPI to extend the SQL transform functions lib. ## UDF API diff --git a/docs/zh/about.md b/docs/zh/about.md index ae789d4d7f7..93c7f877168 100644 --- a/docs/zh/about.md +++ b/docs/zh/about.md @@ -7,7 +7,7 @@ SeaTunnel是一个非常易用、超高性能的分布式数据集成平台,支持实时海量数据同步。 每天可稳定高效同步数百亿数据,已被近百家企业应用于生产。 -## 我们为什么需要 SeaTunnel +## 为什么需要 SeaTunnel SeaTunnel专注于数据集成和数据同步,主要旨在解决数据集成领域的常见问题: @@ -18,21 +18,21 @@ SeaTunnel专注于数据集成和数据同步,主要旨在解决数据集成 - 技术栈复杂:企业使用的技术组件不同,用户需要针对不同组件开发相应的同步程序来完成数据集成。 - 管理和维护困难:受限于底层技术组件(Flink/Spark)不同,离线同步和实时同步往往需要分开开发和管理,增加了管理和维护的难度。 -## Features of SeaTunnel +## SeaTunnel 相关特性 -- 丰富且可扩展的Connector:SeaTunnel提供了不依赖于特定执行引擎的Connector API。 基于该API开发的Connector(Source、Transform、Sink)可以运行在很多不同的引擎上,例如目前支持的SeaTunnel Engine、Flink、Spark等。 +- 丰富且可扩展的Connector:SeaTunnel提供了不依赖于特定执行引擎的Connector API。 基于该API开发的Connector(Source、Transform、Sink)可以运行在很多不同的引擎上,例如目前支持的SeaTunnel引擎(Zeta)、Flink、Spark等。 - Connector插件:插件式设计让用户可以轻松开发自己的Connector并将其集成到SeaTunnel项目中。 目前,SeaTunnel 支持超过 100 个连接器,并且数量正在激增。 这是[当前支持的连接器]的列表(Connector-v2-release-state.md) - 批流集成:基于SeaTunnel Connector API开发的Connector完美兼容离线同步、实时同步、全量同步、增量同步等场景。 它们大大降低了管理数据集成任务的难度。 - 支持分布式快照算法,保证数据一致性。 -- 多引擎支持:SeaTunnel默认使用SeaTunnel引擎进行数据同步。 SeaTunnel还支持使用Flink或Spark作为Connector的执行引擎,以适应企业现有的技术组件。 SeaTunnel 支持 Spark 和 Flink 的多个版本。 +- 多引擎支持:SeaTunnel默认使用SeaTunnel引擎(Zeta)进行数据同步。 SeaTunnel还支持使用Flink或Spark作为Connector的执行引擎,以适应企业现有的技术组件。 SeaTunnel 支持 Spark 和 Flink 的多个版本。 - JDBC复用、数据库日志多表解析:SeaTunnel支持多表或全库同步,解决了过度JDBC连接的问题; 支持多表或全库日志读取解析,解决了CDC多表同步场景下需要处理日志重复读取解析的问题。 - 高吞吐量、低延迟:SeaTunnel支持并行读写,提供稳定可靠、高吞吐量、低延迟的数据同步能力。 - 完善的实时监控:SeaTunnel支持数据同步过程中每一步的详细监控信息,让用户轻松了解同步任务读写的数据数量、数据大小、QPS等信息。 - 支持两种作业开发方法:编码和画布设计。 SeaTunnel Web 项目 https://github.com/apache/seatunnel-web 提供作业、调度、运行和监控功能的可视化管理。 -## SeaTunnel work flowchart +## SeaTunnel 工作流图 -![SeaTunnel work flowchart](../images/architecture_diagram.png) +![SeaTunnel Work Flowchart](../images/architecture_diagram.png) SeaTunnel的运行流程如上图所示。 @@ -52,11 +52,11 @@ SeaTunnel 使用的默认引擎是 [SeaTunnel Engine](seatunnel-engine/about.md) - **Sink Connector** SeaTunnel 支持将数据写入各种关系型、图形、NoSQL、文档和内存数据库; 分布式文件系统,例如HDFS; 以及各种云存储解决方案,例如S3和OSS。 我们还支持将数据写入许多常见的 SaaS 服务。 您可以在[此处]访问详细列表。 如果您愿意,您可以开发自己的 Sink 连接器并轻松将其集成到 SeaTunnel 中。 -## Who uses SeaTunnel +## 谁在使用 SeaTunnel SeaTunnel 拥有大量用户。 您可以在[用户](https://seatunnel.apache.org/user)中找到有关他们的更多信息. -## Landscapes +## 展望



@@ -65,6 +65,6 @@ SeaTunnel 拥有大量用户。 您可以在[用户](https://seatunnel.apache.or SeaTunnel 丰富了CNCF 云原生景观

-## Learn more +## 了解更多 -您可以参阅[快速入门](/docs/category/start-v2/locally/deployment) 了解后续步骤。 +您可以参阅[快速入门](/docs/category/start-v2/locally/deployment) 了解后续相关步骤。 diff --git a/docs/zh/faq.md b/docs/zh/faq.md index 5fdb06c2800..3be6ce38e56 100644 --- a/docs/zh/faq.md +++ b/docs/zh/faq.md @@ -93,7 +93,7 @@ sink { ## 有 HBase 插件吗? -有一个 hbase 输入插件。 您可以从这里下载:https://github.com/garyelephant/waterdrop-input-hbase +有一个 HBase 输入插件。 您可以从这里下载:https://github.com/garyelephant/waterdrop-input-hbase ## 如何使用SeaTunnel将数据写入Hive? @@ -136,7 +136,7 @@ sink { } ``` -3. Configure multiple instances in the configuration: +3. 在配置文件中配置多个ClickHouse实例: ``` { @@ -149,7 +149,7 @@ sink { } } ``` -4. Use cluster mode: +4. 使用集群模式: ``` { @@ -185,7 +185,7 @@ sink { 3、一般来说,M和N都确定了,从2可以得出结论:`spark.streaming.kafka.maxRatePerPartition`的大小与`spark.executor.cores` * `spark的大小正相关 .executor.instances`,可以在增加资源`maxRatePerPartition`的同时增加,以加快消耗。 -![kafka](../images/kafka.png) +![Kafka](../images/kafka.png) ## 如何解决错误 `Exception in thread "main" java.lang.NoSuchFieldError: INSTANCE`? @@ -204,11 +204,11 @@ spark { } ``` -## 如何为 Yarn 上的 SeaTunnel 指定不同的 JDK 版本? +## 如何为 YARN 上的 SeaTunnel 指定不同的 JDK 版本? 例如要设置JDK版本为JDK8,有两种情况: -- Yarn集群已部署JDK8,但默认JDK不是JDK8。 在 SeaTunnel 配置文件中添加两个配置: +- YARN集群已部署JDK8,但默认JDK不是JDK8。 在 SeaTunnel 配置文件中添加两个配置: ``` env { @@ -218,12 +218,12 @@ spark { ... } ``` -- Yarn集群未部署JDK8。 此时,启动附带JDK8的SeaTunnel。 详细操作参见: +- YARN集群未部署JDK8。 此时,启动附带JDK8的SeaTunnel。 详细操作参见: https://www.cnblogs.com/jasondan/p/spark-specific-jdk-version.html ## Spark local[*]模式运行SeaTunnel时总是出现OOM怎么办? -如果以本地模式运行,则需要修改`start-seatunnel.sh`启动脚本。 在 `spark-submit` 之后添加参数 `--driver-memory 4g` 。 一般情况下,生产环境中不使用本地模式。 因此,On Yarn时一般不需要设置该参数。 有关详细信息,请参阅:[应用程序属性](https://spark.apache.org/docs/latest/configuration.html#application-properties)。 +如果以本地模式运行,则需要修改`start-seatunnel.sh`启动脚本。 在 `spark-submit` 之后添加参数 `--driver-memory 4g` 。 一般情况下,生产环境中不使用本地模式。 因此,On YARN时一般不需要设置该参数。 有关详细信息,请参阅:[应用程序属性](https://spark.apache.org/docs/latest/configuration.html#application-properties)。 ## 我可以在哪里放置自己编写的插件或第三方 jdbc.jar 以供 SeaTunnel 加载? @@ -237,7 +237,7 @@ cp third-part.jar plugins/my_plugins/lib `my_plugins` 可以是任何字符串。 -## 如何在 SeaTunnel-v1(Spark) 中配置日志记录相关参数? +## 如何在 SeaTunnel-V1(Spark) 中配置日志记录相关参数? 可以通过三种方式配置日志相关参数(例如日志级别): @@ -284,7 +284,7 @@ log4j.appender.console.layout=org.apache.log4j.PatternLayout log4j.appender.console.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n ``` -## 如何在 SeaTunnel-v2(Spark、Flink) 中配置日志记录相关参数? +## 如何在 SeaTunnel-V2(Spark、Flink) 中配置日志记录相关参数? 目前,无法直接设置它们。 您需要修改SeaTunnel启动脚本。 相关参数在任务提交命令中指定。 具体参数请参考官方文档: