Skip to content

Commit

Permalink
Fix map (#994)
Browse files Browse the repository at this point in the history
There is now a central manager that computes rows remaining, and then
sends those rows over IPC to the process/threadpool, instead of each
worker independently trying to compute its shard of remaining work.

This means that each process no longer needs to make its own duckdb
query, so that memory usage of duckdb is limited to just the manager
process. It also removes all sorts of continuation bugs where changing
the filter/limit settings might cause shard reassignment and shards
unable to find their own previous work.

Joblib doesn't have an easy way to do job ids (see
joblib/joblib#1008) so I have dropped the job
id functionality.

I can now run a 10 million row map (peak memory consumption 24GB, still
all due to DuckDB) with 10 processes. At HEAD we would freeze on a
1million row map with 10 processes.

Task progress reporting is vastly simplified since the manager
bottlenecks and receives the finished work as a reply from each worker,
allowing a trivial wrap in tqdm.
  • Loading branch information
brilee authored Dec 26, 2023
1 parent 46351ea commit e8b2827
Show file tree
Hide file tree
Showing 19 changed files with 330 additions and 711 deletions.
6 changes: 6 additions & 0 deletions development.md
Original file line number Diff line number Diff line change
Expand Up @@ -320,3 +320,9 @@ rm memray.bin memray-flamegraph-memray.html; \
```bash
poetry run python -X importtime -c "import lilac" 2> import.log && poetry run tuna import.log
```

#### Profiling lilac base library memory consumption

rm memray.bin memray-flamegraph-memray.html; \
echo "import lilac" > test_import_script.py && poetry run memray run -o memray.bin test_import_script.py
&& poetry run memray flamegraph memray.bin && open memray-flamegraph-memray.html
Loading

0 comments on commit e8b2827

Please sign in to comment.