Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
There is now a central manager that computes rows remaining, and then sends those rows over IPC to the process/threadpool, instead of each worker independently trying to compute its shard of remaining work. This means that each process no longer needs to make its own duckdb query, so that memory usage of duckdb is limited to just the manager process. It also removes all sorts of continuation bugs where changing the filter/limit settings might cause shard reassignment and shards unable to find their own previous work. Joblib doesn't have an easy way to do job ids (see joblib/joblib#1008) so I have dropped the job id functionality. I can now run a 10 million row map (peak memory consumption 24GB, still all due to DuckDB) with 10 processes. At HEAD we would freeze on a 1million row map with 10 processes. Task progress reporting is vastly simplified since the manager bottlenecks and receives the finished work as a reply from each worker, allowing a trivial wrap in tqdm.
- Loading branch information