Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix for to_polygons when using processes instead of threads in dask #756

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ArneDefauw
Copy link
Contributor

I noticed that when configuring dask to use 'processes' instead of 'threads' , spatialdata.to_polygons fails due to this line:

result of dask.compute() is lost when using processes, which is expected behaviour.

Note that using 'processes' instead of 'threads' considerably speeds up vectorizing labels for large masks, because the function we try to parallelize does not release the GIL.

Example:

import os
from pathlib import Path

import pooch
from pooch import Pooch

from spatialdata import read_zarr

BASE_URL = "https://objectstor.vib.be/spatial-hackathon-public/sparrow/public_datasets"

def _get_registry(path: str | Path | None = None) -> Pooch:
    return pooch.create(
        path=pooch.os_cache("sparrow") if path is None else path,
        base_url=BASE_URL,
        version="0.0.1",
        registry={
            "transcriptomics/vizgen/mouse/_sdata_2D.zarr.zip": "e1f36061e97e74ad131eb709ca678658829dc4385a444923ef74835e783d63bc",
        },
    )

registry=_get_registry( path = None ) # set path if you want to download data to somewhere else
unzip_path = registry.fetch("transcriptomics/vizgen/mouse/_sdata_2D.zarr.zip", processor=pooch.Unzip())
sdata = read_zarr(os.path.commonpath(unzip_path))
sdata.path = None
import dask
from spatialdata import to_polygons

dask.config.set(scheduler="processes")
gdf=to_polygons( sdata[ "segmentation_mask_full" ] )
# finishes in around 3m locally on a mac m2

dask.config.set(scheduler="threads")
gdf=to_polygons( sdata[ "segmentation_mask_full" ] )
# finishes in around 8m locally on a mac m2

"segmentation_mask_full' contains the masks from a merscope experiment, around 300k labels.

Copy link

codecov bot commented Nov 5, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.89%. Comparing base (27bb4a7) to head (d7c91a1).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #756      +/-   ##
==========================================
- Coverage   91.89%   91.89%   -0.01%     
==========================================
  Files          45       45              
  Lines        6919     6918       -1     
==========================================
- Hits         6358     6357       -1     
  Misses        561      561              
Files with missing lines Coverage Δ
src/spatialdata/_core/operations/vectorize.py 93.75% <100.00%> (-0.04%) ⬇️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant