Datatype Support in Quality Control and Impute #865

aGuyLearning · 2025-02-05T16:32:48Z

PR Checklist

This comment contains a description of changes (with reason)
Referenced issue is linked
If you've fixed a bug or added code that should be tested, add tests!
Documentation in docs is updated

Description of changes
As discussed in #861, this refactors the quality control functions to use single dispatch for future datatype support.

To this end, the explicit_impute function has been reworked already. closes #848

Technical details
This pull request introduces significant changes to the ehrapy preprocessing module, focusing on improving imputation and quality control functionalities. The main changes involve the addition of support for dask arrays, the use of singledispatch for function overloading, and the enhancement of test coverage for different array types.

Possible enhancements
Please let's discuss the changes, as I am not yet fully happy with them.

adata.x datatype cheker wrapper for parent function and more restrictive datatypes

…ion tests

…and improve readability

ehrapy/preprocessing/_quality_control.py

eroell · 2025-02-07T09:16:40Z

ehrapy/preprocessing/_quality_control.py

    obs_metrics = pd.DataFrame(index=adata.obs_names)
    var_metrics = pd.DataFrame(index=adata.var_names)
-    mtx = adata.X if layer is None else adata.layers[layer]
+    mtx = copy.deepcopy(arr.astype(object))


Is this deepcopy because mtx gets written onto if "encoding_mode" in adata.var? Or is there something else you spotted here requiring this?

From a first glance I think mtx was written to if "encoding_mode" in adata.var, which would be a bug. If this is also your reason here @aGuyLearning, we might look for something cheaper than copying the entire mtx

i removed the copy part and renamed the method argument to mtx, as it will just directly use the array. In addition, the mtx.astype(object) has been moved back into the "encoding_mode" block.

Should i rename the _compute_var_metrics argument as well?

eroell · 2025-02-07T09:23:53Z

As of now, explicit_impute raises a NotImplementedError for dask which is solely due to the required function from quality control is not available, right?
I see that you @aGuyLearning improved explicit_impute to support dask, once the quality control block is out of the way 👍
I'm looking into the quality control right now, there's some potential to daskify further

…ates copy

aGuyLearning · 2025-02-07T14:49:44Z

ehrapy/preprocessing/_quality_control.py

    categorical_indices = np.ndarray([0], dtype=int)
+    mtx = copy.deepcopy(arr.astype(object))


previously it copyied the array in the for loop of the "encoding_mode" block. This passes all tests and requires less compute. Am i understanding something wrong, or is this fine?

eroell · 2025-02-07T16:18:26Z

Daskification of qc_metrics.
Profiling for AnnData of shape 30'000 x 10'000, reduces the peak memory consumption from 2G to 350MB.

numpy

import anndata as ad
import numpy as np
import ehrapy as ep

rng = np.random.default_rng(42)
X_np = rng.random((30_000, 1_000))
X_np[X_np <= 0.1] = np.nan

adata_np = ad.AnnData(X_np)
df = ep.pp.qc_metrics(adata_np)[0]

dask

import anndata as ad
import dask.array as da
import numpy as np

import ehrapy as ep

X_dask = da.random.random(size=(30_000, 1_000), chunks=(1000, 1000))
X_dask[X_dask <= 0.1] = np.nan

adata_da = ad.AnnData(X_dask)
df = ep.pp.qc_metrics(adata_da)[0]

eroell · 2025-02-07T16:27:27Z

tests/preprocessing/test_imputation.py

+            explicit_impute(impute_num_adata, replacement=1011, copy=True)
+
+
+@pytest.mark.parametrize(


Yes, that would be great! If we want to check da.from_array, by default passing the argument chunks=1000 helps, as dask complains when calling da.from_array when dtype is simply object in some cases. Would you be interested in adding such a fixture @aGuyLearning?

eroell · 2025-02-07T16:28:41Z

tests/preprocessing/test_quality_control.py

+    "array_type, expected_error",
+    [
+        (np.array, None),
+        # (da.array, NotImplementedError),


With the new fixture mentioned above, the dask test could be activated! It most likely will fail if just calling da.from_array, with no chunks argument specified

eroell · 2025-02-07T16:29:01Z

tests/preprocessing/test_quality_control.py

+    "array_type, expected_error",
+    [
+        (np.array, None),
+        # (da.array, NotImplementedError),


Here, too, the test could be activated for dask, once the fixture with specifying the chunks argument is available

eroell

See comments - this comes along very well!

Zethson

Cool work !

Zethson · 2025-02-07T16:42:01Z

ehrapy/preprocessing/_imputation.py

@@ -19,6 +21,13 @@
 if TYPE_CHECKING:
    from anndata import AnnData

+try:


We have this check in every file now. Should this be done once somewhere else? Probably just do what scanpy does. If they also check in every file - fine.

Zethson · 2025-02-07T16:46:50Z

tests/preprocessing/test_imputation.py

@@ -46,6 +48,11 @@ def _base_check_imputation(
    Raises:
        AssertionError: If any of the checks fail.
    """
+    # if .x of the AnnData is a dask array, convert it to a numpy array


Suggested change

# if .x of the AnnData is a dask array, convert it to a numpy array

# if .X of the AnnData is a dask array, convert it to a numpy array

But I can see this in the code. Can you write WHY you're doing this, please? Much more useful.

aGuyLearning added 5 commits January 29, 2025 13:43

Enhancement: Add Dask support for explicit imputation

edb2bb4

Enhancement: Add Dask support for quality control metrics and imputat…

a9da3c5

…ion tests

Fix test for imputation to handle Dask arrays without raising errors

16a5817

Refactor quality control metrics functions to streamline computation …

a603dbb

…and improve readability

added expected error

ea8f7c0

aGuyLearning requested a review from eroell February 5, 2025 16:32

aGuyLearning self-assigned this Feb 5, 2025

aGuyLearning linked an issue Feb 5, 2025 that may be closed by this pull request

Compatibility of ep.pp.explicit_impute with different datatypes #848

Open

2 tasks

aGuyLearning marked this pull request as draft February 5, 2025 16:38

aGuyLearning and others added 2 commits February 5, 2025 17:43

Remove unused Dask import from quality control module

35b75df

simplify missing value computation

ff7dd32

eroell reviewed Feb 7, 2025

View reviewed changes

ehrapy/preprocessing/_quality_control.py Show resolved Hide resolved

eroell reviewed Feb 7, 2025

View reviewed changes

Rename parameter 'arr' to 'mtx' in _compute_obs_metrics no longer cre…

fd57fe4

…ates copy

aGuyLearning commented Feb 7, 2025

View reviewed changes

daskify qc_metrics

f90acd3

eroell reviewed Feb 7, 2025

View reviewed changes

eroell requested changes Feb 7, 2025

View reviewed changes

Zethson reviewed Feb 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datatype Support in Quality Control and Impute #865

Datatype Support in Quality Control and Impute #865

aGuyLearning commented Feb 5, 2025 •

edited

Loading

eroell Feb 7, 2025

eroell Feb 7, 2025

aGuyLearning Feb 7, 2025

eroell commented Feb 7, 2025 •

edited

Loading

aGuyLearning Feb 7, 2025

eroell commented Feb 7, 2025

eroell Feb 7, 2025

eroell Feb 7, 2025

eroell Feb 7, 2025 •

edited

Loading

eroell left a comment

Zethson left a comment

Zethson Feb 7, 2025

Zethson Feb 7, 2025

Zethson Feb 7, 2025

		categorical_indices = np.ndarray([0], dtype=int)
		mtx = copy.deepcopy(arr.astype(object))

		explicit_impute(impute_num_adata, replacement=1011, copy=True)


		@pytest.mark.parametrize(

	# if .x of the AnnData is a dask array, convert it to a numpy array
	# if .X of the AnnData is a dask array, convert it to a numpy array

Datatype Support in Quality Control and Impute #865

Are you sure you want to change the base?

Datatype Support in Quality Control and Impute #865

Conversation

aGuyLearning commented Feb 5, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eroell commented Feb 7, 2025 • edited Loading

Choose a reason for hiding this comment

eroell commented Feb 7, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eroell Feb 7, 2025 • edited Loading

Choose a reason for hiding this comment

eroell left a comment

Choose a reason for hiding this comment

Zethson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aGuyLearning commented Feb 5, 2025 •

edited

Loading

eroell commented Feb 7, 2025 •

edited

Loading

eroell Feb 7, 2025 •

edited

Loading