Are there any plans to support serialization of pyflink's Java and python inter-process communication? #1919

kaori-seasons · 2024-10-31T01:36:36Z

Feature Request

We expect to improve the performance of pyflink through fury serialization

In the python benchmark test, the time-consuming benchmark of each serialization is as follows:

This is the code location where the performance is relatively high when we use pyflink, which is mainly consumed in pickle encoding and decoding.

At present, our company's stock of historical data is 13 million, and each message is between 60kb and 75kb. After discussions with the pyflink community, it is recommended to use pemja for cross-language calls without using beam.

In this way, python's native pickle serialization is very slow

测试方法：

以天（一个点位每秒一条数据，一天共86400条）为单位，进行不同的数据量测试
分别测试 3 个算子、5 个算子和 10 个算子情况下的性能
- 对比都带有 output_type 和不都带 output_type 参数的性能

每个算子参数带有 output_type 参数的测试代码：
每个算子都不带 output_type 参数的测试代码：

其中 output_type 定义了传输数据每个字段类型，定义方式如下图：

a) 测试三个算子

时长	带有 output_type 耗时（秒）	没有 output_type耗时（秒）	提升效率
1 天	9.456	9.973	5.18%
3 天	14.532	18.187	20.10%
5 天	28.911	38.786	25.46%
7 天	34.397	51.691	33.46%

b) 测试5个算子

时长	带有 output_type 耗时（秒）	没有 output_type耗时（秒）	提升效率
1 天	9.971	10.401	4.13%
3 天	20.308	25.744	21.12%
5 天	30.166	40.305	25.16%
7 天	40.340	54.405	25.85%

c) 测试10个算子

时长	带有 output_type 耗时（秒）	没有 output_type耗时（秒）	提升效率
1 天	11.468	12.130	5.45%
3 天	23.697	31.121	23.85%
5 天	38.015	49.508	23.21%
7 天	48.859	65.140	24.99%

Judging from the test results, explicitly specifying the output_type parameter in PyFlink DataStream can significantly improve serialization performance, especially when the amount of data is large and there are many operators, the improvement effect is more obvious. Using output_type can reduce the overhead of serialization and deserialization, reduce the calculation of type inference, and thus improve performance.

But now, obviously we hope that fury can improve this situation. Does @chaokunyang have any good suggestions?

Is your feature request related to a problem? Please describe

No response

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

chaokunyang · 2024-10-31T04:00:08Z

Fury has two formats for your scenarios:

Xlang object graph format: This format will be faster if you have some shared values and it's the only solution if you have circular values.
Binary Row Format: This is a zero-copy format, you can read your fields without parsing other fields.

If your message body size is 60kb, do you need to read all those fields? If not, I will suggest to use the binary row format

kaori-seasons · 2024-10-31T13:03:56Z

Fury 针对您的场景有两种格式：

Xlang 对象图格式：如果您有一些共享值，则这种格式会更快，如果您有循环值，这是唯一的解决方案。

二进制行格式：这是一种零拷贝格式，您可以读取您的字段而无需解析其他字段。

如果你的消息正文大小为 60kb，你是否需要读取所有这些字段？如果不需要，我建议使用二进制行格式

All fields need to be read

chaokunyang · 2024-10-31T16:43:08Z

In such cases, Xlang object graph format will be faster.

kaori-seasons · 2024-11-03T23:07:49Z

@chaokunyang At present, due to work, I may have to start doing this from next month. Since I am not familiar with xlang object graph, there may be some obstacles in implementing fury's xlang object graph in pyflink. Would you mind taking the time to give me Any help on fury performance and type mapping?

chaokunyang · 2024-11-04T08:46:56Z

yeah, I'd love to help, feel free to reach me out

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Are there any plans to support serialization of pyflink's Java and python inter-process communication? #1919

Are there any plans to support serialization of pyflink's Java and python inter-process communication? #1919

kaori-seasons commented Oct 31, 2024 •

edited

Loading

chaokunyang commented Oct 31, 2024 •

edited

Loading

kaori-seasons commented Oct 31, 2024

chaokunyang commented Oct 31, 2024

kaori-seasons commented Nov 3, 2024 •

edited

Loading

chaokunyang commented Nov 4, 2024

Are there any plans to support serialization of pyflink's Java and python inter-process communication? #1919

Are there any plans to support serialization of pyflink's Java and python inter-process communication? #1919

Comments

kaori-seasons commented Oct 31, 2024 • edited Loading

Feature Request

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Describe alternatives you've considered

Additional context

chaokunyang commented Oct 31, 2024 • edited Loading

kaori-seasons commented Oct 31, 2024

chaokunyang commented Oct 31, 2024

kaori-seasons commented Nov 3, 2024 • edited Loading

chaokunyang commented Nov 4, 2024

kaori-seasons commented Oct 31, 2024 •

edited

Loading

chaokunyang commented Oct 31, 2024 •

edited

Loading

kaori-seasons commented Nov 3, 2024 •

edited

Loading