[HUDI-8799] Design of RFC-84, Optimized SerDe of `DataStream` in Flink operators #12697

geserdugarov · 2025-01-23T12:55:26Z

Change Logs

Proposed optimization to reduce SerDe costs for Flink operators.

Proof of concept was presented in the corresponding claim of RFC.
For stream write into Hudi with simple bucket index, total write time decreased by 15%, which is significant for stream processing.

Impact

None for this stage.

Risk level (write none, low medium or high below)

None for this stage.

Documentation Update

No need for this stage.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

danny0405 · 2025-01-27T02:43:52Z

rfc/rfc-84/rfc-84.md

+### Potential problems
+
+1. Key generators are hardly coupled with Avro `GenericRecord`. 
+   Therefore, to support all key generators we will have to do intermediate conversion into Avro in operator, that is responsible for getting Hudi key.


Is there already a POC now, can we do a micro-benchmark to prove the gains.

Yes, but I will work on it a little bit more, and will provide benchmark results with profiling on this week. Also I will provide corresponding PR to check code changes.

@danny0405 , PR is ready: #12722.
In the description to this PR, I've mentioned costs from this extra conversion:

These costs could be accepted for now due to acceptable values: 5 798 CPU samples from 183 236 in total, which is about 3%.

Total performance improvement:

total write time decreased from 344 s to 265 s, which is about 23%,

data passed between Flink operators decreased from 19.4 GB to 12.9 GB, which is about 33.5%.

@cshuo is recently working on a new RFC to add basic abstractions of schema/data type/expressions to Hudi, so that we can integrate with the engine specific "row" for both the writer and reader, the design doc would be coming out, will cc you if you have intreast in it, it's a huge task and maybe you can help with it.

@cshuo , hi! If doesn't bother you, could you, please, check this RFC for design conflicts with your RFC, which is in progress. If there is no conflicts, I propose to move these changes further.
Otherwise, could you, please, provide some drafts, to start to work in collaboration.

@geserdugarov Sorry for the late reply, just came back from the lunar new year holiday :) I'll take a look at your rfc and benchmark code soon.

@geserdugarov here is RFC pr #12795, welcome to review and get involved to collaborate.

geserdugarov · 2025-01-28T16:24:04Z

@hudi-bot run azure

…k operators

…izer`

geserdugarov · 2025-02-06T14:16:04Z

@danny0405 , @cshuo , I've updated description here with added structure of HoodieFlinkRecordTypeInfo and HoodieFlinkRecordSerializer.

I've already implemented optimization for simple bucket index and non bucket case. There is only the last one left, consistent hashing case. Opened corresponding PR: #12796.

Assumption that I could do everything without custom serializer was wrong. I've faced issues with serde during conversion of DataStream into KeyedStream. With implemented HoodieFlinkRecordTypeInfo and HoodieFlinkRecordSerializer everything works correctly. And for non bucket case I got 31% performance improvement.

hudi-bot · 2025-02-06T15:52:21Z

CI report:

53d5287 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

github-actions bot added the size:M PR with lines of changes in (100, 300] label Jan 23, 2025

danny0405 reviewed Jan 27, 2025

View reviewed changes

geserdugarov force-pushed the master-rfc-flink-serde branch from e375fd5 to cae2226 Compare January 27, 2025 02:52

geserdugarov mentioned this pull request Jan 28, 2025

[HUDI-8921] Switch from HoodieRecord to HoodieFlinkRecord for Flink write, simple bucket index #12722

Closed

4 tasks

geserdugarov force-pushed the master-rfc-flink-serde branch from cae2226 to af1ce0f Compare January 29, 2025 03:01

geserdugarov added 2 commits February 4, 2025 15:53

[HUDI-8799] Design of RFC-84, Optimized SerDe of DataStream in Flin…

5ae9c6f

…k operators

Added HoodieFlinkRecord, which extends Flink's Tuple

5e0f4ee

geserdugarov force-pushed the master-rfc-flink-serde branch from 845b519 to 5e0f4ee Compare February 4, 2025 08:54

geserdugarov mentioned this pull request Feb 6, 2025

[HUDI-8920] Optimized SerDe costs of Flink write, simple bucket and non bucket cases #12796

Open

4 tasks

Structure of HoodieFlinkRecordTypeInfo and `HoodieFlinkRecordSerial…

53d5287

…izer`

github-actions bot added size:L PR with lines of changes in (300, 1000] and removed size:M PR with lines of changes in (100, 300] labels Feb 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-8799] Design of RFC-84, Optimized SerDe of `DataStream` in Flink operators #12697

[HUDI-8799] Design of RFC-84, Optimized SerDe of `DataStream` in Flink operators #12697

geserdugarov commented Jan 23, 2025

danny0405 Jan 27, 2025

geserdugarov Jan 27, 2025

geserdugarov Jan 28, 2025 •

edited

Loading

danny0405 Jan 30, 2025

geserdugarov Feb 5, 2025 •

edited

Loading

cshuo Feb 5, 2025

cshuo Feb 6, 2025

geserdugarov commented Jan 28, 2025

geserdugarov commented Feb 6, 2025 •

edited

Loading

hudi-bot commented Feb 6, 2025

[HUDI-8799] Design of RFC-84, Optimized SerDe of DataStream in Flink operators #12697

Are you sure you want to change the base?

[HUDI-8799] Design of RFC-84, Optimized SerDe of DataStream in Flink operators #12697

Conversation

geserdugarov commented Jan 23, 2025

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

danny0405 Jan 27, 2025

Choose a reason for hiding this comment

geserdugarov Jan 27, 2025

Choose a reason for hiding this comment

geserdugarov Jan 28, 2025 • edited Loading

Choose a reason for hiding this comment

danny0405 Jan 30, 2025

Choose a reason for hiding this comment

geserdugarov Feb 5, 2025 • edited Loading

Choose a reason for hiding this comment

cshuo Feb 5, 2025

Choose a reason for hiding this comment

cshuo Feb 6, 2025

Choose a reason for hiding this comment

geserdugarov commented Jan 28, 2025

geserdugarov commented Feb 6, 2025 • edited Loading

hudi-bot commented Feb 6, 2025

CI report:

[HUDI-8799] Design of RFC-84, Optimized SerDe of `DataStream` in Flink operators #12697

[HUDI-8799] Design of RFC-84, Optimized SerDe of `DataStream` in Flink operators #12697

geserdugarov Jan 28, 2025 •

edited

Loading

geserdugarov Feb 5, 2025 •

edited

Loading

geserdugarov commented Feb 6, 2025 •

edited

Loading