Fix map (#994)

There is now a central manager that computes rows remaining, and then sends those rows over IPC to the process/threadpool, instead of each worker independently trying to compute its shard of remaining work. This means that each process no longer needs to make its own duckdb query, so that memory usage of duckdb is limited to just the manager process. It also removes all sorts of continuation bugs where changing the filter/limit settings might cause shard reassignment and shards unable to find their own previous work. Joblib doesn't have an easy way to do job ids (see joblib/joblib#1008) so I have dropped the job id functionality. I can now run a 10 million row map (peak memory consumption 24GB, still all due to DuckDB) with 10 processes. At HEAD we would freeze on a 1million row map with 10 processes. Task progress reporting is vastly simplified since the manager bottlenecks and receives the finished work as a reply from each worker, allowing a trivial wrap in tqdm.
databricks · Dec 26, 2023 · e8b2827 · e8b2827
1 parent 46351ea
commit e8b2827
Show file tree

Hide file tree

Showing 19 changed files with 330 additions and 711 deletions.
diff --git a/development.md b/development.md
@@ -320,3 +320,9 @@ rm memray.bin memray-flamegraph-memray.html; \
 ```bash
 poetry run python -X importtime -c "import lilac" 2> import.log && poetry run tuna import.log
 ```
+
+#### Profiling lilac base library memory consumption
+
+rm memray.bin memray-flamegraph-memray.html; \
+echo "import lilac" > test_import_script.py && poetry run memray run -o memray.bin test_import_script.py
+&& poetry run memray flamegraph memray.bin && open memray-flamegraph-memray.html