Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement] Add a parameter that controls the number of StreamLoad tasks committed per partition #92

Open
3 tasks done
baishaoisde opened this issue Apr 6, 2023 · 1 comment · May be fixed by #99
Open
3 tasks done

Comments

@baishaoisde
Copy link

baishaoisde commented Apr 6, 2023

Search before asking

  • I had searched in the issues and found no similar issues.

Description

If the amount of data in a partition is greater than INSERT_BARCH_SIZE, each task commits multiple StreamLoad tasks. If the task fails to retry, all data in the partition is recommitted to the StreamLoad task, as well as the data that was previously successfully written. Data duplication occurs.
当一个分区中的数据量大于参数INSERT_BARCH_SIZE时,每个task便会提交多个StreamLoad任务,如果任务发生失败重试,那么该分区的所有数据便会重新提交StreamLoad任务,对于之前成功写入的数据也会重新提交,造成数据重复。

我的建议是增加一个参数,如果开启则强制每个分区只提交一个StreamLoad,保证数据不会被重复提交。

Solution

My suggestion is to add a parameter that, if enabled, forces only one StreamLoad per partition to ensure that data is not repeatedly committed.

我的建议是增加一个参数,如果开启则强制每个分区只提交一个StreamLoad,保证数据不会被重复提交。

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@JNSimba
Copy link
Member

JNSimba commented Jul 11, 2024

Hello @baishaoisde , can turning on 2pc solve this problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants