lazily map over generators #19

johnfarina · 2018-09-15T04:22:44Z

First, great little library!

I am often in the situation where i want to (using multiple cores) compute something like (f(x) for x in xs), where xs is a generator yielding a very large number of objects (say billions), so I don't want to materialize xs all at once in memory. multiprocessing.Pool()'s methods unfortunately do so when they populate the underlying work queue.

To get around this, I've been using a little helper function like the following to manually break up an iterator into slices and call Pool.map on each slice:

def lazymap(f, xs, chunksize=1000):
    try:
        n = len(xs)
    except TypeError:
        xs, _ = tee(xs)
        n = sum(1 for x in _)

    pbar = tqdm(xs, total=n)
    with multiprocessing.Pool() as p:
        while True:
            rs = p.map(f, itertools.islice(xs, chunksize))
            if rs:
                pbar.update(len(rs))
                for r in rs:
                    yield r
            else:
                break

but this is pretty kludgy. It would be great if parmap would be able to do something similar to avoid consuming the entire input iterator before multiprocessing starts doing work!

It might be as simple as avoiding calling len on line 239 and refactoring the way that pool.map_async is used, but I admit I don't understand the code all that well.

Thoughts?

The text was updated successfully, but these errors were encountered:

zeehio · 2018-09-15T07:39:14Z

I am editing the answer as I read again the multiprocessing implementation.

If you know the length in advance you can wrap the generator in a custom class that provides a __len__ method as shown at https://stackoverflow.com/a/7460929/446149

I am looking for alternative solutions for when the length is unknown but I don't have much hope...

zeehio · 2018-09-15T11:38:57Z

It seems imap is the way to go. https://stackoverflow.com/questions/26520781/multiprocessing-pool-whats-the-difference-between-map-async-and-imap

I will look into it when I have time

zeehio added enhancement pull-request-welcome Happy to review a PR for this feature labels May 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lazily map over generators #19

lazily map over generators #19

johnfarina commented Sep 15, 2018

zeehio commented Sep 15, 2018 •

edited

Loading

zeehio commented Sep 15, 2018

lazily map over generators #19

lazily map over generators #19

Comments

johnfarina commented Sep 15, 2018

zeehio commented Sep 15, 2018 • edited Loading

zeehio commented Sep 15, 2018

zeehio commented Sep 15, 2018 •

edited

Loading