Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Join with constant memory footprint #124

Closed
wants to merge 139 commits into from
Closed

Join with constant memory footprint #124

wants to merge 139 commits into from

Conversation

root-11
Copy link
Owner

@root-11 root-11 commented Jan 8, 2024

@realratchet @cerv15

Join has been optimized for speed using numpy and to maintain a constant memory footprint equal to 4 pages (4M values).

Occasionally (1 in 100) I can trigger the following error during multiprocessing:

/home/bjorn/github/tablite/tests/test_join.py::test_left_join_mp failed: def test_left_join_mp():
        Config.MULTIPROCESSING_MODE = Config.FORCE
>       do_left_join()

tests/test_join.py:52: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/test_join.py:29: in do_left_join
    left_join.show()
tablite/base.py:1748: in show
    print(self.to_ascii(slice_=slice_, blanks=blanks, dtype=dtype))
tablite/base.py:1712: in to_ascii
    for name, values in self.display_dict(slice_=slice_, blanks=blanks, dtype=dtype).items():
tablite/base.py:1663: in display_dict
    data[name] = list(chain(iter(col), repeat(blanks, times=n - len(col))))[slc]
tablite/base.py:694: in __iter__
    data = page.get()
tablite/base.py:147: in get
    array = load_numpy(self.path)
tablite/utils.py:472: in load_numpy
    return np.load(path, allow_pickle=True, fix_imports=False)
../../venv/py310tablite/lib/python3.10/site-packages/numpy/lib/npyio.py:432: in load
    return format.read_array(fid, allow_pickle=allow_pickle,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

fp = <_io.BufferedReader name='/tmp/tablite-tmp/pid-30773/pages/3.npy'>
allow_pickle = True, pickle_kwargs = {'encoding': 'ASCII', 'fix_imports': False}

    def read_array(fp, allow_pickle=False, pickle_kwargs=None, *,
                   max_header_size=_MAX_HEADER_SIZE):
        """
        Read an array from an NPY file.
    
        ...
    
        """
        if allow_pickle:
            # Effectively ignore max_header_size, since `allow_pickle` indicates
            # that the input is fully trusted.
            max_header_size = 2**64
    
        version = read_magic(fp)
        _check_version(version)
        shape, fortran_order, dtype = _read_array_header(
                fp, version, max_header_size=max_header_size)
        if len(shape) == 0:
            count = 1
        else:
            count = numpy.multiply.reduce(shape, dtype=numpy.int64)
    
        # Now read the actual data.
        if dtype.hasobject:
            # The array contained Python objects. We need to unpickle the data.
            if not allow_pickle:
                raise ValueError("Object arrays cannot be loaded when "
                                 "allow_pickle=False")
            if pickle_kwargs is None:
                pickle_kwargs = {}
            try:
>               array = pickle.load(fp, **pickle_kwargs)
E               _pickle.UnpicklingError: unpickling stack underflow

../../venv/py310tablite/lib/python3.10/site-packages/numpy/lib/format.py:792: UnpicklingError

I believe this is not a join bug, but rather a multiprocessing problem. @realratchet ?

Ratchet and others added 3 commits January 25, 2024 15:47
Fix issue with fail table column order and more granular tqdm
Ratchet and others added 2 commits January 26, 2024 16:13
Ratchet and others added 4 commits January 29, 2024 13:45
fix issue with string slicing
…oader couldn't load it

added tests for page loading parity i forgot to add
Fix scalar pages, fix unicode slicer
@realratchet
Copy link
Collaborator

I'll close the and re-open it from my fork, seems that my git decided to push to the master repo instead of fork for whatever reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants