-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multiprocessing
considerations
#15
Comments
For neowise-source-table-lightcurves.md, I left all the code in the notebook added instructions for Mac/Windows users (copied below). The use case in that notebook is a bit too big for a laptop anyway, and is still only a small version of the types of use cases that those bulk-access Parquet files are really good for. So I'm not really worried about accommodating Mac/Windows there. This also affects Parallelize_Convolution.md so we should probably leave a similar Mac/Windows note in there. I don't see any reason to do anything else. Also, we'll need to decide on #8. Do we still want to parallelize it? Note in NEOWISE notebook:
|
I'm mostly clueless about how all of this comes together. @bsipocz @troyraen Would it make sense to do both things? ie., keep the functions inside the notebook for ease of readability and ease of serving them. But also make them ourselves into .py files that are available in the same repo so someone could grab them if necessary? Then the instructions would say "grab this file and put it in the same directory if you are working on Mac/Windows" I'm sure it's a weakness of mine, but the need to copy and paste into a .py file is too great for me to do for a demo notebook. Or.....because I think we are headed there anyway according to @bsipocz ... What if we did have an interactive notebook and the instructions say "if you are working on Mac/Windows, run the notebook here. And if you want to run this on your own machine after testing here, make sure to copy the functions to .py files" My aversion to copying things around is less if I am going to alter the notebook anyway. |
One of the main differences are in the target audience of these tutorials vs the fornax use cases. The latter is expected to be cloned as a full repo into the FSC and therefore the files will be available and properly being set up for the users. While these more like a long form narrative how-to for basic IRSA functionality that we expect to work for a much wider variety of user setups, and the primary workflow is expected to be either downloaded individual notebooks via the HTML renderings or simply copy pasting code out of them. Therefore working with semi-library type of code under the hood is not practical here. So for that purpose, I feel it's fully valid to say that this way of doing multiprocessing will only work in special circumstances, others please refer to XYZ tutorials outside of IRSA (and then we can point them to more generic tutorials in python land). |
(I mean at some point we need to pretend that all of the big data cases can be carried out fully on an old windows laptop, etc., so not going all the way to accommodate for an unpractical corner case feels acceptable) |
Thanks both @jkrick, @bsipocz. @jkrick I mostly agree with everything you said, though I would argue against duplicate code. I feel that it's one of those things that, for one narrow case never sounds so bad, but it easily balloons into more manual management than was first intended and/or the duplicate code just gets forgotten, then it breaks and causes confusion. OTOH, I agree that if the user really needs the separate file they would appreciate us providing it. I don't think that's unique to you at all. I also agree that once users can interact with notebooks directly on the rendered site this will be less of a problem. (And I also understand this to be a goal, though not a short term one.) I think specific choices can be made on a notebook-by-notebook basis though. To make the non-duplicate options explicit: 1) parallelize with code in notebook (plus a note); 2) parallelize with code in .py file; 3) just don't parallelize it. The two notebooks that actually have this issue right now are not really problems IMO (different reasons for each) and not worth more effort on our side than just leaving a note. But #8 is less clear. Hiding the code in a .py file seems counter productive there, but option 1 vs 3? Probably a scientist who's attached to that notebook just needs to make a decision (and if they really want to push for duplicate code, they can). |
This is very much my wishlist, but bare in mind that actually pulling this off in the near future (e.g next 1-2 FY), it will likely only be feasible for the IRSA notebooks and not for the fornax ones, at least not for the ones that require any specialized compiled dependency (that at the moment definitely includes tractor, but also hipscat (as I don't see we will get healpy to work in the browser any time soon)). |
Also, is my impression correct that this only affects
|
I haven't tried at all, so just don't know.
That could be interesting. |
ah yes, I see the problem with duplicate code. I hadn't thought of that, so good points. |
Yes, both Ray/Dask passes for the convolution notebook, it's only the |
(In fact, it times out on CI, but I recall it had issues locally on my OSX, too) |
And any future notebooks that want to use |
My guess is that it hits this same |
And actually, it could be a better solution for the two current notebooks than the note I left in NEOWISE. I'd have to think through exactly how it would work. |
Testing the NEOWISE parquet catalog notebooks (IRSA-6202, https://github.com/IPAC-SW/ipac-sp-notebooks/pull/69) revealed an issue with python's
multiprocessing
(thanks to Chris Gelino for running into it) that affects other notebooks in this repo and also the rendered pages. It's a bit involved so I'll start with the summary.Summary
Notebooks in this repo that use python
multiprocessing
for parallelization will not work out of the box on Mac or Windows. They'll work just fine on Linux, which includes both Fornax and IRSA science platform. Conversely, our notebooks in fornax-demo-notebooks that usemultiprocessing
should work out of the box everywhere because we made a different design choice in that repo. The tradeoff is that Fornax notebooks will be more difficult to serve.@bsipocz and I talked this over and agree those are probably the right choices for both repos, at least for now. We'll see how it goes both ways and can adjust if needed. Posting here so others are aware of it for other notebooks and also to have a larger discussion if needed.
The
multiprocessing
problemLinux and Mac have different defaults in how child processes are started and unfortunately the Linux default, which works great for these notebooks, is not exactly reproduced on a Mac even when choosing the same option. Specifically, the problem on Mac is that the functions defined in the notebook (which runs in the main process) do not get copied to the pool workers (child processes). So the multiprocessing cell throws an
AttributeError
complaining that it can't find the function it's supposed to run. The easiest solution is to save the functions in a separate .py file and import them. Then the workers can find them.The docs say that the Linux default option is not even available on Windows. I don't have a Windows machine to try with, but my best guess is that the same solution will be best there as well (though I never quite know with Windows).
The serving problem
Unfortunately, the solution of defining functions in a separate .py file collides with our preference that notebooks in this repo avoid relying on ancillary files whenever possible. This is because ancillary files are difficult to serve. -- The process of notebook rendering will work fine, but the public webpages will not show the ancillary files. We'd have to leave instructions for users to manually download them from the repo. @bsipocz can say more here.
We made the opposite choice for the fornax-demo-notebooks repo where all of the notebooks have ancillary .py files that store most of the code. Those notebooks require a whole lot more code than notebooks in this repo, at least so far. So I think the separate files there makes a lot of sense because those notebooks would be so huge if they included all the code directly.
The text was updated successfully, but these errors were encountered: