-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-8799] Design of RFC-84, Optimized SerDe of DataStream
in Flink operators
#12697
base: master
Are you sure you want to change the base?
Conversation
### Potential problems | ||
|
||
1. Key generators are hardly coupled with Avro `GenericRecord`. | ||
Therefore, to support all key generators we will have to do intermediate conversion into Avro in operator, that is responsible for getting Hudi key. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there already a POC now, can we do a micro-benchmark to prove the gains.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but I will work on it a little bit more, and will provide benchmark results with profiling on this week. Also I will provide corresponding PR to check code changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danny0405 , PR is ready: #12722.
In the description to this PR, I've mentioned costs from this extra conversion:
These costs could be accepted for now due to acceptable values: 5 798 CPU samples from 183 236 in total, which is about 3%.
Total performance improvement:
- total write time decreased from 344 s to 265 s, which is about 23%,
- data passed between Flink operators decreased from 19.4 GB to 12.9 GB, which is about 33.5%.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cshuo is recently working on a new RFC to add basic abstractions of schema/data type/expressions to Hudi, so that we can integrate with the engine specific "row" for both the writer and reader, the design doc would be coming out, will cc you if you have intreast in it, it's a huge task and maybe you can help with it.
e375fd5
to
cae2226
Compare
@hudi-bot run azure |
cae2226
to
af1ce0f
Compare
845b519
to
5e0f4ee
Compare
Change Logs
Proposed optimization to reduce SerDe costs for Flink operators.
Proof of concept was presented in the corresponding claim of RFC.
For stream write into Hudi with simple bucket index, total write time decreased by 15%, which is significant for stream processing.
Impact
None for this stage.
Risk level (write none, low medium or high below)
None for this stage.
Documentation Update
No need for this stage.
Contributor's checklist