Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Join multiprocessing take 2. #139

Open
root-11 opened this issue Mar 1, 2024 · 0 comments
Open

Join multiprocessing take 2. #139

root-11 opened this issue Mar 1, 2024 · 0 comments

Comments

@root-11
Copy link
Owner

root-11 commented Mar 1, 2024

@realratchet : In commit: aee872b in line 452:
"""
Ratchet:

    I thought real good about it and it is not possible to mapping tasks be
    multi-processed/constant memory, because every slice must know the entire right table dictionary
    and anything that is not the first slice must also have all previous other slice indices.

    Best we can do is reduce the RAM usage via using hash of a string or reduce memory usage via native implementation.
"""

Please correct me if I misunderstand, but as far as I see it, it is not necessary to hold the entire right table dictionary in memory.

In the process illustrated below, there are 5 tasks:

Task 1: create sparse index for a slice of LEFT and slice of RIGHT. The example shows 8 tasks that can be executed concurrently (if memory permits).

Task 2: concatenate the output of the 8 x task 1. This is a singleton and cannot run in parallel.

Task 3: [optional] sort the result from Task 2.

Task 4: construct the reindex table based on join type:

  • Inner = sparse table.
  • Left = left table range + duplicates in the sparse table.
  • Outer = left table range repeated by right table range.
    This process is single core.

Task 5: for each column in new a task is created for a page sized slice of either the left or right column in reindex.

  • to construct the new page, each process will first have to gather the required rows by performing read of multiple slices. In the example below REINDEX RIGHT illustrates this with blue and orange background colour reflecting respectively 1st and 2nd page of the table RIGHT.

image

PS> If the key is a table crossing key or any column contains strings, it makes sense to represent the key as a cryptographic hash.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant