Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multiprocessing considerations #15

Open
troyraen opened this issue Aug 9, 2024 · 14 comments
Open

multiprocessing considerations #15

troyraen opened this issue Aug 9, 2024 · 14 comments
Assignees

Comments

@troyraen
Copy link
Contributor

troyraen commented Aug 9, 2024

Testing the NEOWISE parquet catalog notebooks (IRSA-6202, https://github.com/IPAC-SW/ipac-sp-notebooks/pull/69) revealed an issue with python's multiprocessing (thanks to Chris Gelino for running into it) that affects other notebooks in this repo and also the rendered pages. It's a bit involved so I'll start with the summary.

Summary

Notebooks in this repo that use python multiprocessing for parallelization will not work out of the box on Mac or Windows. They'll work just fine on Linux, which includes both Fornax and IRSA science platform. Conversely, our notebooks in fornax-demo-notebooks that use multiprocessing should work out of the box everywhere because we made a different design choice in that repo. The tradeoff is that Fornax notebooks will be more difficult to serve.

@bsipocz and I talked this over and agree those are probably the right choices for both repos, at least for now. We'll see how it goes both ways and can adjust if needed. Posting here so others are aware of it for other notebooks and also to have a larger discussion if needed.

The multiprocessing problem

Linux and Mac have different defaults in how child processes are started and unfortunately the Linux default, which works great for these notebooks, is not exactly reproduced on a Mac even when choosing the same option. Specifically, the problem on Mac is that the functions defined in the notebook (which runs in the main process) do not get copied to the pool workers (child processes). So the multiprocessing cell throws an AttributeError complaining that it can't find the function it's supposed to run. The easiest solution is to save the functions in a separate .py file and import them. Then the workers can find them.

The docs say that the Linux default option is not even available on Windows. I don't have a Windows machine to try with, but my best guess is that the same solution will be best there as well (though I never quite know with Windows).

The serving problem

Unfortunately, the solution of defining functions in a separate .py file collides with our preference that notebooks in this repo avoid relying on ancillary files whenever possible. This is because ancillary files are difficult to serve. -- The process of notebook rendering will work fine, but the public webpages will not show the ancillary files. We'd have to leave instructions for users to manually download them from the repo. @bsipocz can say more here.

We made the opposite choice for the fornax-demo-notebooks repo where all of the notebooks have ancillary .py files that store most of the code. Those notebooks require a whole lot more code than notebooks in this repo, at least so far. So I think the separate files there makes a lot of sense because those notebooks would be so huge if they included all the code directly.

@troyraen
Copy link
Contributor Author

troyraen commented Aug 9, 2024

For neowise-source-table-lightcurves.md, I left all the code in the notebook added instructions for Mac/Windows users (copied below). The use case in that notebook is a bit too big for a laptop anyway, and is still only a small version of the types of use cases that those bulk-access Parquet files are really good for. So I'm not really worried about accommodating Mac/Windows there.

This also affects Parallelize_Convolution.md so we should probably leave a similar Mac/Windows note in there. I don't see any reason to do anything else.

Also, we'll need to decide on #8. Do we still want to parallelize it?


Note in NEOWISE notebook:

These functions will be used in the next section. Defining them here in the notebook is useful for demonstration and should work seamlessly on Linux, which includes most science platforms. Mac and Windows users should see the note at the end of the notebook.

[*] Note to Mac and Windows users:

You will need to copy the functions and imports from this notebook into a separate '.py' file and then import them in order to use the multiprocessing pool for parallelization. In addition, you may need to load neowise_ds separately for each child process (i.e., worker) by adding that code to the init_worker function instead of passing it in as an argument. This has to do with differences in what does / does not get copied into the child processes on different platforms.

@jkrick
Copy link

jkrick commented Aug 9, 2024

I'm mostly clueless about how all of this comes together. @bsipocz @troyraen Would it make sense to do both things? ie., keep the functions inside the notebook for ease of readability and ease of serving them. But also make them ourselves into .py files that are available in the same repo so someone could grab them if necessary? Then the instructions would say "grab this file and put it in the same directory if you are working on Mac/Windows"

I'm sure it's a weakness of mine, but the need to copy and paste into a .py file is too great for me to do for a demo notebook.

Or.....because I think we are headed there anyway according to @bsipocz ... What if we did have an interactive notebook and the instructions say "if you are working on Mac/Windows, run the notebook here. And if you want to run this on your own machine after testing here, make sure to copy the functions to .py files" My aversion to copying things around is less if I am going to alter the notebook anyway.

@bsipocz
Copy link
Member

bsipocz commented Aug 9, 2024

One of the main differences are in the target audience of these tutorials vs the fornax use cases. The latter is expected to be cloned as a full repo into the FSC and therefore the files will be available and properly being set up for the users. While these more like a long form narrative how-to for basic IRSA functionality that we expect to work for a much wider variety of user setups, and the primary workflow is expected to be either downloaded individual notebooks via the HTML renderings or simply copy pasting code out of them. Therefore working with semi-library type of code under the hood is not practical here.

So for that purpose, I feel it's fully valid to say that this way of doing multiprocessing will only work in special circumstances, others please refer to XYZ tutorials outside of IRSA (and then we can point them to more generic tutorials in python land).

@bsipocz
Copy link
Member

bsipocz commented Aug 9, 2024

(I mean at some point we need to pretend that all of the big data cases can be carried out fully on an old windows laptop, etc., so not going all the way to accommodate for an unpractical corner case feels acceptable)

@troyraen
Copy link
Contributor Author

troyraen commented Aug 9, 2024

Thanks both @jkrick, @bsipocz.

@jkrick I mostly agree with everything you said, though I would argue against duplicate code. I feel that it's one of those things that, for one narrow case never sounds so bad, but it easily balloons into more manual management than was first intended and/or the duplicate code just gets forgotten, then it breaks and causes confusion. OTOH, I agree that if the user really needs the separate file they would appreciate us providing it. I don't think that's unique to you at all. I also agree that once users can interact with notebooks directly on the rendered site this will be less of a problem. (And I also understand this to be a goal, though not a short term one.)

I think specific choices can be made on a notebook-by-notebook basis though. To make the non-duplicate options explicit: 1) parallelize with code in notebook (plus a note); 2) parallelize with code in .py file; 3) just don't parallelize it.

The two notebooks that actually have this issue right now are not really problems IMO (different reasons for each) and not worth more effort on our side than just leaving a note.

But #8 is less clear. Hiding the code in a .py file seems counter productive there, but option 1 vs 3? Probably a scientist who's attached to that notebook just needs to make a decision (and if they really want to push for duplicate code, they can).

@bsipocz
Copy link
Member

bsipocz commented Aug 9, 2024

interact with notebooks directly on the rendered site

This is very much my wishlist, but bare in mind that actually pulling this off in the near future (e.g next 1-2 FY), it will likely only be feasible for the IRSA notebooks and not for the fornax ones, at least not for the ones that require any specialized compiled dependency (that at the moment definitely includes tractor, but also hipscat (as I don't see we will get healpy to work in the browser any time soon)).

@bsipocz
Copy link
Member

bsipocz commented Aug 9, 2024

Also, is my impression correct that this only affects multiprocessing but not ray/dask/etc solutions.

  1. we could still set up the notebook that a cell will/can dump out a py file on the fly, so it won't require two copies to be maintained in the repo, etc. But so far this is just a workaround idea.

@troyraen
Copy link
Contributor Author

troyraen commented Aug 9, 2024

Also, is my impression correct that this only affects multiprocessing but not ray/dask/etc solutions.

I haven't tried at all, so just don't know.

we could still set up the notebook that a cell will/can dump out a py file on the fly, so it won't require two copies to be maintained in the repo, etc. But so far this is just a workaround idea.

That could be interesting.

@jkrick
Copy link

jkrick commented Aug 9, 2024

ah yes, I see the problem with duplicate code. I hadn't thought of that, so good points.
And thanks Brigitta for the timescale for interactive code.
I like this newest idea where we have a cell which says "if you get the following error, we know about it, and here is the workaround.....save to an output.py file. " If I am hearing Troy correctly, then this is maybe only an interesting solution for the #8 notebook.

@bsipocz
Copy link
Member

bsipocz commented Aug 9, 2024

Also, is my impression correct that this only affects multiprocessing but not ray/dask/etc solutions.

I haven't tried at all, so just don't know.

Yes, both Ray/Dask passes for the convolution notebook, it's only the pool.map cell that fails: https://github.com/Caltech-IPAC/irsa-tutorials/actions/runs/10207653669/job/28242820480?pr=11#step:5:335

@bsipocz
Copy link
Member

bsipocz commented Aug 9, 2024

(In fact, it times out on CI, but I recall it had issues locally on my OSX, too)

@troyraen
Copy link
Contributor Author

maybe only an interesting solution for the #8 notebook.

And any future notebooks that want to use multiprocessing.

@troyraen
Copy link
Contributor Author

In fact, it times out on CI

My guess is that it hits this same AttributeError but just keeps trying until the timeout and the real error is never surfaced. I'd be quite surprised if it were anything else.

@troyraen
Copy link
Contributor Author

maybe only an interesting solution for the #8 notebook.

And any future notebooks that want to use multiprocessing.

And actually, it could be a better solution for the two current notebooks than the note I left in NEOWISE. I'd have to think through exactly how it would work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants