[Core]Support async lookup in hash store #4423

neuyilan · 2024-10-31T12:41:50Z

Currently, lookup join does not really support asynchronous, and the purpose of this PR is to support asynchronous lookup join.

This pr only support the hash store async lookup, in the next pr, I will do the rocksdb store async lookup.

neuyilan · 2024-11-01T02:13:16Z

@JingsongLi PTAL, thanks

JingsongLi

Hi @neuyilan , I took a rough look and found that there are many thread safety risks in certain areas. Can we go back to this requirement? Is it necessary for us to do multi-threaded access? Is this effective? Why not increase Flink's parallelism?

JingsongLi · 2024-11-04T05:11:14Z

...-flink-common/src/main/java/org/apache/paimon/flink/lookup/PrimaryKeyPartialLookupTable.java

+    }
+
+    private synchronized Triple<BinaryRow, Integer, InternalRow> extractPartitionAndBucket(
+            InternalRow key) {
        InternalRow adjustedKey = key;
        if (keyRearrange != null) {
            adjustedKey = keyRearrange.replaceRow(adjustedKey);


It is reused

I have encapsulated these as a separate function called extractPartartitionAndBucket. And it has been declared as synchronized, so there will be no thread safety issues.

JingsongLi · 2024-11-04T05:11:18Z

...-flink-common/src/main/java/org/apache/paimon/flink/lookup/PrimaryKeyPartialLookupTable.java

+    }
+
+    private synchronized Triple<BinaryRow, Integer, InternalRow> extractPartitionAndBucket(
+            InternalRow key) {
        InternalRow adjustedKey = key;
        if (keyRearrange != null) {
            adjustedKey = keyRearrange.replaceRow(adjustedKey);
        }
        extractor.setRecord(adjustedKey);
        int bucket = extractor.bucket();
        BinaryRow partition = extractor.partition();


It is reused

the same as above

JingsongLi · 2024-11-04T05:11:23Z

...-flink-common/src/main/java/org/apache/paimon/flink/lookup/PrimaryKeyPartialLookupTable.java

        InternalRow adjustedKey = key;
        if (keyRearrange != null) {
            adjustedKey = keyRearrange.replaceRow(adjustedKey);
        }
        extractor.setRecord(adjustedKey);
        int bucket = extractor.bucket();
        BinaryRow partition = extractor.partition();
-
        InternalRow trimmedKey = key;
        if (trimmedKeyRearrange != null) {
            trimmedKey = trimmedKeyRearrange.replaceRow(trimmedKey);


It is reused

the same as above

neuyilan · 2024-11-04T07:31:05Z

Hi @neuyilan , I took a rough look and found that there are many thread safety risks in certain areas. Can we go back to this requirement? Is it necessary for us to do multi-threaded access? Is this effective? Why not increase Flink's parallelism?

Yes, there are too many thread safety issues in the current design. So essentially, access is synchronous. Even if 'lookup. sync' is set to true in flink connector, it does not have any acceleration effect.

I think supporting asynchronous multi-threaded access is necessary. And in our testing, it was effective. Asynchronous multi-threaded access has the following benefits compared to increasing concurrency in Flink:

Reduce the usage of memory resources. Each Flink subtask will occupy additional memory; The more sub tasks there are, the more memory will be occupied. Assuming one 4G TM. In our scenario, if asynchronous multi-threaded access is enabled, performance can be improved by approximately 7 times. This requires 7 TMs to achieve, but these 7 TMs will occupy an additional 4 * 7G of memory. This is even more useful in elastic resources. In situations where memory resources are a bottleneck, my job has a hard limit on memory, allowing only a maximum of 4G memory to be used (managed by Yarn), but the CPU can be exceeded (cgroup soft limit). So this feature can be used to enable asynchronous multi-threaded access, accelerating performance without adding additional resources. No additional costs and increase efficiency.
Reduce cache disk usage, as currently cached data is exclusive to each task. If multi-threaded access can be utilized within a task, it can reduce the cache disk usage.

neuyilan · 2024-11-06T07:47:10Z

Hi, @JingsongLi Could you please review it again.

support async lookup

a258761

neuyilan changed the title ~~[Core]support async lookup in hash store~~ [Core]Support async lookup in hash store Oct 31, 2024

neuyilan marked this pull request as draft October 31, 2024 12:44

add it tests

36b3dde

neuyilan marked this pull request as ready for review October 31, 2024 13:06

neuyilan added 2 commits October 31, 2024 21:07

remove useless codes

1e777e2

fix the tests

776538e

neuyilan closed this Oct 31, 2024

neuyilan reopened this Oct 31, 2024

refine the code

d08058d

neuyilan mentioned this pull request Nov 1, 2024

[Core]Support async lookup #4341

Closed

JingsongLi closed this Nov 1, 2024

JingsongLi reopened this Nov 1, 2024

JingsongLi requested changes Nov 4, 2024

View reviewed changes

merge master

e14a9e8

neuyilan requested a review from JingsongLi November 6, 2024 08:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core]Support async lookup in hash store #4423

[Core]Support async lookup in hash store #4423

neuyilan commented Oct 31, 2024 •

edited

Loading

neuyilan commented Nov 1, 2024

JingsongLi left a comment

JingsongLi Nov 4, 2024

neuyilan Nov 4, 2024

JingsongLi Nov 4, 2024

neuyilan Nov 4, 2024

JingsongLi Nov 4, 2024

neuyilan Nov 4, 2024

neuyilan commented Nov 4, 2024

neuyilan commented Nov 6, 2024

[Core]Support async lookup in hash store #4423

Are you sure you want to change the base?

[Core]Support async lookup in hash store #4423

Conversation

neuyilan commented Oct 31, 2024 • edited Loading

neuyilan commented Nov 1, 2024

JingsongLi left a comment

Choose a reason for hiding this comment

JingsongLi Nov 4, 2024

Choose a reason for hiding this comment

neuyilan Nov 4, 2024

Choose a reason for hiding this comment

JingsongLi Nov 4, 2024

Choose a reason for hiding this comment

neuyilan Nov 4, 2024

Choose a reason for hiding this comment

JingsongLi Nov 4, 2024

Choose a reason for hiding this comment

neuyilan Nov 4, 2024

Choose a reason for hiding this comment

neuyilan commented Nov 4, 2024

neuyilan commented Nov 6, 2024

neuyilan commented Oct 31, 2024 •

edited

Loading