Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-8799] Design of RFC-84, Optimized SerDe of DataStream in Flink operators #12697

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

geserdugarov
Copy link
Contributor

Change Logs

Proposed optimization to reduce SerDe costs for Flink operators.

Proof of concept was presented in the corresponding claim of RFC.
For stream write into Hudi with simple bucket index, total write time decreased by 15%, which is significant for stream processing.

Impact

None for this stage.

Risk level (write none, low medium or high below)

None for this stage.

Documentation Update

No need for this stage.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:M PR with lines of changes in (100, 300] label Jan 23, 2025
### Potential problems

1. Key generators are hardly coupled with Avro `GenericRecord`.
Therefore, to support all key generators we will have to do intermediate conversion into Avro in operator, that is responsible for getting Hudi key.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there already a POC now, can we do a micro-benchmark to prove the gains.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I will work on it a little bit more, and will provide benchmark results with profiling on this week. Also I will provide corresponding PR to check code changes.

Copy link
Contributor Author

@geserdugarov geserdugarov Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danny0405 , PR is ready: #12722.
In the description to this PR, I've mentioned costs from this extra conversion:

These costs could be accepted for now due to acceptable values: 5 798 CPU samples from 183 236 in total, which is about 3%.

Total performance improvement:

  • total write time decreased from 344 s to 265 s, which is about 23%,
  • data passed between Flink operators decreased from 19.4 GB to 12.9 GB, which is about 33.5%.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cshuo is recently working on a new RFC to add basic abstractions of schema/data type/expressions to Hudi, so that we can integrate with the engine specific "row" for both the writer and reader, the design doc would be coming out, will cc you if you have intreast in it, it's a huge task and maybe you can help with it.

@geserdugarov
Copy link
Contributor Author

@hudi-bot run azure

@geserdugarov geserdugarov force-pushed the master-rfc-flink-serde branch from cae2226 to af1ce0f Compare January 29, 2025 03:01
@geserdugarov geserdugarov force-pushed the master-rfc-flink-serde branch from 845b519 to 5e0f4ee Compare February 4, 2025 08:54
@hudi-bot
Copy link

hudi-bot commented Feb 4, 2025

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:M PR with lines of changes in (100, 300]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants