Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-8920] Optimized SerDe costs of Flink write, simple bucket and non bucket cases #12796

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

geserdugarov
Copy link
Contributor

@geserdugarov geserdugarov commented Feb 6, 2025

Change Logs

Changes in Flink stream write into Hudi table with simple bucket index corresponding to #12697.

Benchmark description

Lineitem table from TPC-H benchmark was used. 60 mln rows, from which 20 mln rows are unique.

Perfomance estimation results

current with Kryo HoodieFlinkRecord Optimization
Non bucket
Data passed, GB 43.9 29.3 33.6%
Total time, s 570 358 37.2%
Simple bucket index
Data passed, GB 19.4 13.6 29.9%
Total time, s 344 237 31.1%

Flink operators

Current with Kryo:
0 operators - 1 reference

After switch to HoodieFlinkRecord:
0 operators - 2 HoodieFlinkRecord

Impact

Flink write performance improvement.

Risk level (write none, low medium or high below)

Low

Documentation Update

After merge

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:XL PR with lines of changes > 1000 label Feb 6, 2025
@geserdugarov geserdugarov changed the title [HUDI-8946] [HUDI-8921] Optimized SerDe costs of Flink write, simple bucket and non bucket cases [HUDI-8920] Optimized SerDe costs of Flink write, simple bucket and non bucket cases Feb 6, 2025
@geserdugarov geserdugarov force-pushed the master-serde-non-bucket branch from 33cc3b7 to 68028fc Compare February 6, 2025 16:04
…RecordTypeInfo` and `HoodieFlinkRecordSerializer` for `HoodieFlinkRecord`
@geserdugarov geserdugarov force-pushed the master-serde-non-bucket branch from 68028fc to 9377d36 Compare February 6, 2025 16:28
.booleanType()
.defaultValue(false)
.withDescription("Optimized Flink write into Hudi table, which uses customized serialization/deserialization. "
+ "Note, that only SIMPLE BUCKET index is supported for now.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR's title says "simple bucket and non bucket cases"

Copy link
Contributor Author

@geserdugarov geserdugarov Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed it. Thanks! Fixed in 5a81536.

@geserdugarov
Copy link
Contributor Author

geserdugarov commented Feb 7, 2025

I've added IT tests in 6ac8c9f, and also manually checked restore from Flink checkpoint for non bucket case. Restoring from checkpoint was successful.

@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@geserdugarov
Copy link
Contributor Author

geserdugarov commented Feb 10, 2025

This draft PR contains exactly the same commits as here. But I've added one extra, which contains write.fast.mode turned on by default. CI is almost successful with only failed integration tests in test-flink (flink1.20, 1.11.3).
I will look on fails, and check it they are related to hardcoded use of HoodieRecord in tests, or currently not supported cases like consistent hashing and bounded context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:XL PR with lines of changes > 1000
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants