-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JobManager: create & start in parallel #719
Comments
@jdries @soxofaan could we run the start job across multiple threads e.g. by: from threading import Thread
another option might be to look into asyncio which would run sinfle thread but jump between the various job creations/starts |
Threading won't work I'm afraid, because in Python only one thread can be active at a time and the current "start_job" requests are blocking. So the execution would not be parallel in reality . We would have to use a non-blocking request library like https://www.python-httpx.org, or use multiprocessing to have effective parallelism multiprocessing might be the easiest route for now (I'm not so sure how easy it will be to switch to httpx from our classic "requests" based implementation) |
Indeed, that would probably be a more modern approach, but it's not trivial to migrate everything (or at least a well chosen subset) we already have to this new paradigm |
reading a bit deeper into it, If we indeed want a full performance we need to make sure that all network requests, database queries, etc can run asynchronously. This might make the code overtly complex since as a standard we only support 2 parallel jobs... |
Ok I did some testing with requests in threads, and apparently it does work to do requests in parallel that way. I was probably confusing it with another threading problem I had before. |
Do we know what the upper limit is on the amount of threads we could use? Being able to add 20 jobs in parallel would already make a big difference. LCFM would probably prefer 100 at once |
Note that it might be counter-productive to do too much in parallel as well: I would default to something like 5, and maybe scale up a bit if you know what you are doing. threading tutorials typically point to thread pools (with fixed limit) to solve this easily |
Just to be sure, this is with the STAC-based implementation of the job manager? As pandas dataframes are not thread-safe, so unless you explicitly add locks, this might run awry at scale. |
Same concern here (and in PR #723) about pandas. While pandas as kind of database API was handy in the proof of concept phase of the job manager, I have the feeling it is now actually making progress harder than it could be. We had various struggles in the past when implementing new job manager features, and it is now making threading based features quite challenging. I think we should try to get/keep pandas out of the code paths that we want to run in threads. |
So there are a couple of options which we could explore, Instead of pandas we could take a look at DASK or MODIS dataframes, which both support parallelism in a more suitable way. Nevertheless I am uncertain whether these would allow parallel writing. Another option to explore the use of defaultdict just as how it is being used for the stats... Lastly, we could consider only supporting threading for the stac based job manager? |
I would avoid complexity about data shared data structures and all the consistency challenges around that: |
I see, so making launch_job more singular focused on launching the job, and pulling the status update away from it. That way we can concurrently launch the jobs, without needing the threaded access to the dataframe |
Creating and starting a job takes some time, which means there's an interval when the jobmanager is creating new jobs and resources are potentially unused.
If we can start jobs in parallel, we can decrease this.
Do note that we typically have rate-limiting in place on backends, so we have to be resilient there.
The text was updated successfully, but these errors were encountered: