-
Notifications
You must be signed in to change notification settings - Fork 247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Are there any plans to support serialization of pyflink's Java and python inter-process communication? #1919
Comments
Fury has two formats for your scenarios:
If your message body size is 60kb, do you need to read all those fields? If not, I will suggest to use the binary row format |
All fields need to be read |
In such cases, Xlang object graph format will be faster. |
@chaokunyang At present, due to work, I may have to start doing this from next month. Since I am not familiar with xlang object graph, there may be some obstacles in implementing fury's xlang object graph in pyflink. Would you mind taking the time to give me Any help on fury performance and type mapping? |
yeah, I'd love to help, feel free to reach me out |
Feature Request
We expect to improve the performance of pyflink through fury serialization
In the python benchmark test, the time-consuming benchmark of each serialization is as follows:
This is the code location where the performance is relatively high when we use pyflink, which is mainly consumed in pickle encoding and decoding.
At present, our company's stock of historical data is 13 million, and each message is between 60kb and 75kb. After discussions with the pyflink community, it is recommended to use pemja for cross-language calls without using beam.
In this way, python's native pickle serialization is very slow
测试方法:
每个算子参数带有 output_type 参数的测试代码:
每个算子都不带 output_type 参数的测试代码:
其中 output_type 定义了传输数据每个字段类型,定义方式如下图:
a) 测试三个算子
b) 测试5个算子
c) 测试10个算子
Judging from the test results, explicitly specifying the output_type parameter in PyFlink DataStream can significantly improve serialization performance, especially when the amount of data is large and there are many operators, the improvement effect is more obvious. Using output_type can reduce the overhead of serialization and deserialization, reduce the calculation of type inference, and thus improve performance.
But now, obviously we hope that fury can improve this situation. Does @chaokunyang have any good suggestions?
Is your feature request related to a problem? Please describe
No response
Describe the solution you'd like
No response
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: