Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lazily map over generators #19

Open
johnfarina opened this issue Sep 15, 2018 · 2 comments
Open

lazily map over generators #19

johnfarina opened this issue Sep 15, 2018 · 2 comments
Labels
enhancement pull-request-welcome Happy to review a PR for this feature

Comments

@johnfarina
Copy link

First, great little library!

I am often in the situation where i want to (using multiple cores) compute something like (f(x) for x in xs), where xs is a generator yielding a very large number of objects (say billions), so I don't want to materialize xs all at once in memory. multiprocessing.Pool()'s methods unfortunately do so when they populate the underlying work queue.

To get around this, I've been using a little helper function like the following to manually break up an iterator into slices and call Pool.map on each slice:

def lazymap(f, xs, chunksize=1000):
    try:
        n = len(xs)
    except TypeError:
        xs, _ = tee(xs)
        n = sum(1 for x in _)

    pbar = tqdm(xs, total=n)
    with multiprocessing.Pool() as p:
        while True:
            rs = p.map(f, itertools.islice(xs, chunksize))
            if rs:
                pbar.update(len(rs))
                for r in rs:
                    yield r
            else:
                break

but this is pretty kludgy. It would be great if parmap would be able to do something similar to avoid consuming the entire input iterator before multiprocessing starts doing work!

It might be as simple as avoiding calling len on line 239 and refactoring the way that pool.map_async is used, but I admit I don't understand the code all that well.

Thoughts?

@zeehio
Copy link
Owner

zeehio commented Sep 15, 2018

I am editing the answer as I read again the multiprocessing implementation.

If you know the length in advance you can wrap the generator in a custom class that provides a __len__ method as shown at https://stackoverflow.com/a/7460929/446149

I am looking for alternative solutions for when the length is unknown but I don't have much hope...

@zeehio
Copy link
Owner

zeehio commented Sep 15, 2018

It seems imap is the way to go. https://stackoverflow.com/questions/26520781/multiprocessing-pool-whats-the-difference-between-map-async-and-imap

I will look into it when I have time

@zeehio zeehio added enhancement pull-request-welcome Happy to review a PR for this feature labels May 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement pull-request-welcome Happy to review a PR for this feature
Projects
None yet
Development

No branches or pull requests

2 participants