initial suggestions of array type handling on example of normalization methods #835

eroell · 2024-12-04T08:20:30Z

PR Checklist

This comment contains a description of changes (with reason)
Referenced issue is linked Fixes Array type handling: Normalization #834
If you've fixed a bug or added code that should be tested, add tests!
Documentation in docs is updated

Description of changes

This should serve as example for discussion and iterations on one example case (the normalization suite) of checking how rolling out the consistent data type support could look like.

This should become an example of how Step 1 in #829 could look like, and be used as guide to implement the rest.

Technical details

Inspired from e.g. scanpy's scale; but less performance engineering, rather focusing on leveraging frameworks such as dask-ml.

Additional context

…nization

eroell · 2024-12-06T16:44:12Z

ehrapy/preprocessing/_normalization.py

    import dask_ml.preprocessing as daskml_pp
+
+    DASK_AVAILABLE = True


What is your take on such a flag?
Could it be worth to introduce dask as a dependency now that we try to introduce it systematically?

anndata[dask] I guess to be compatible but fine with me

Yes, good point. If dask as dependency, then as anndata[dask]

Wonder whether we should have that flag here though and not more global somewhere. Else we need that check often. If we check globally though, we might import dask when it's not needed which is a performance penalty.

Depends on how slow dask is to import. This varies wildly depending on the project.

E.g. AnnData is a little slow because of pandas. In scanpy we lazy-import some sklearn subpackage because that would have a huge impact, and without it we’re only slower than anndata because of sklearn.utils and numba.

eroell · 2024-12-06T16:47:05Z

ehrapy/preprocessing/_normalization.py

@@ -69,6 +75,23 @@ def _scale_func_group(
        return None


+@singledispatch


What do you think about systematically introducing singledispatch in this manner? It might be a bit of an overkill here; but I think in the long run, using this structure reduces code complexity by introducing a regularly appearing pattern.

Yes, I agree with your argument

eroell · 2024-12-06T16:55:26Z

tests/preprocessing/test_normalization.py

@@ -87,12 +89,35 @@ def test_vars_checks(adata_to_norm):
        ep.pp.scale_norm(adata_to_norm, vars=["String1"])


-@pytest.mark.parametrize("array_type", ARRAY_TYPES)
-def test_norm_scale(array_type, adata_to_norm):
+# TODO: list the supported array types centrally?


These two lists are not used right now. Should we have a central lookup for which function works with which array type?

Right now, the "ground truth" is in the parameterization of the test_<normalization-method>_array_types tests.

Should we have a central lookup for which function works with which array type?

wdym with lookup here?

For example a supported_array_types dict or dataframe in e.g. ehrapy.core._supported_array_types.py

supported_array_types = { `ep.pp.scale_norm`: {"np.array": True, "dask.array.Array": True, "sp.sparse.spmatrix: False}, .... }

Could be used in the test parametrization, instead of writing them up individually all the time.

But I have not seen something like that anywhere before, and there might be good reasons to not do this

from functools import singledispatch @singledispatch def func(arg): pass print(func.registry.keys())

You can construct that dynamically. Maybe you can use that to parameterize the tests etc?

Very good point. In favor of using dynamically constructed registries indeed.

eroell · 2024-12-06T17:01:06Z

tests/preprocessing/test_normalization.py

+
+
+# TODO: check this for each function, with just default settings?
+@pytest.mark.parametrize(


I picked strict three array types to be tested rigidly: np.ndarray, dask.array.Array, scipy.sparse.spmatrix. Not sure if we should test for more. I personally favor being strict about these three things, and not guaranteeing anything else.

Yeah totally fine. cupy arrays are for the future

eroell · 2024-12-06T17:06:24Z

Very happy about your input on any of the questions I raised in the comments @flying-sheep :)

eroell · 2024-12-06T17:11:20Z

Also interested what @nicolassidoux thinks here! :)

Zethson

You added lots of NotImplementedError messages. I wonder whether we can DRY this more easily?
Those error messages should also always suggest which types are support for that function as a sort of solution.
adata_to_norm_casted = adata_to_norm.copy() why do we always need those copies? Doesn't the fixture already create new copies?

Zethson · 2024-12-09T10:55:03Z

ehrapy/preprocessing/_normalization.py

@@ -113,6 +133,23 @@ def scale_norm(
    )


+@singledispatch
+def _minmax_norm_function(arr):
+    raise NotImplementedError(f"minmax_norm does not support data to be of type {type(arr)}")


Ideally we should always suggest which types are supported.

you can automate this by accessing _minmax_norm_function.registry.keys()

Zethson · 2024-12-09T10:56:02Z

ehrapy/preprocessing/_normalization.py

    import dask_ml.preprocessing as daskml_pp
+
+    DASK_AVAILABLE = True


Wonder whether we should have that flag here though and not more global somewhere. Else we need that check often. If we check globally though, we might import dask when it's not needed which is a performance penalty.

Zethson · 2024-12-09T10:58:59Z

tests/preprocessing/test_normalization.py

@@ -87,12 +89,35 @@ def test_vars_checks(adata_to_norm):
        ep.pp.scale_norm(adata_to_norm, vars=["String1"])


-@pytest.mark.parametrize("array_type", ARRAY_TYPES)
-def test_norm_scale(array_type, adata_to_norm):
+# TODO: list the supported array types centrally?


from functools import singledispatch @singledispatch def func(arg): pass print(func.registry.keys())

You can construct that dynamically. Maybe you can use that to parameterize the tests etc?

eroell · 2024-12-09T13:58:11Z

You added lots of NotImplementedError messages. I wonder whether we can DRY this more easily?

Yess

Those error messages should also always suggest which types are support for that function as a sort of solution.

Good point, the dynamic registries help here again

adata_to_norm_casted = adata_to_norm.copy() why do we always need those copies? Doesn't the fixture already create new copies?

because I just not paying attention to this yet, I'll remove :)

…nization

for more information, see https://pre-commit.ci

…nization

eroell and others added 2 commits December 1, 2024 21:19

initial suggestions of array type checks on example of scale_norm

877034d

Merge branch 'main' into enhancement/normalization-array-format-harmo…

5d5438b

…nization

eroell changed the title ~~initial suggestions of array type checks on example of scale_norm~~ initial suggestions of array type handling on example of scale_norm Dec 4, 2024

eroell changed the title ~~initial suggestions of array type handling on example of scale_norm~~ initial suggestions of array type handling on example of normalization methods Dec 4, 2024

eroell added 3 commits December 6, 2024 14:52

Merge branch 'main' into enhancement/normalization-array-format-harmo…

0c1e041

…nization

singledispatch normalization functions and test them

d6dc2c9

try dask import

621ea97

eroell commented Dec 6, 2024

View reviewed changes

eroell requested review from Zethson and flying-sheep December 6, 2024 16:56

eroell commented Dec 6, 2024

View reviewed changes

eroell requested a review from nicolassidoux December 6, 2024 17:10

Zethson reviewed Dec 9, 2024

View reviewed changes

eroell and others added 3 commits December 12, 2024 17:20

Merge branch 'main' into enhancement/normalization-array-format-harmo…

28251d7

…nization

doc build fix

33192d1

[pre-commit.ci] auto fixes from pre-commit.com hooks

48f5936

for more information, see https://pre-commit.ci

github-actions bot added the chore label Dec 12, 2024

Merge branch 'main' into enhancement/normalization-array-format-harmo…

3083b5e

…nization

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

initial suggestions of array type handling on example of normalization methods #835

initial suggestions of array type handling on example of normalization methods #835

eroell commented Dec 4, 2024 •

edited

Loading

eroell Dec 6, 2024

Zethson Dec 6, 2024

eroell Dec 6, 2024

Zethson Dec 9, 2024

flying-sheep Dec 10, 2024

eroell Dec 6, 2024

Zethson Dec 6, 2024

eroell Dec 6, 2024 •

edited

Loading

Zethson Dec 6, 2024

eroell Dec 6, 2024

Zethson Dec 9, 2024 •

edited

Loading

eroell Dec 9, 2024

eroell Dec 6, 2024

Zethson Dec 6, 2024

eroell commented Dec 6, 2024

eroell commented Dec 6, 2024

Zethson left a comment

Zethson Dec 9, 2024

flying-sheep Dec 10, 2024

Zethson Dec 9, 2024

Zethson Dec 9, 2024 •

edited

Loading

eroell commented Dec 9, 2024

		import dask_ml.preprocessing as daskml_pp

		DASK_AVAILABLE = True

		@@ -69,6 +75,23 @@ def _scale_func_group(
		return None


		@singledispatch



		# TODO: check this for each function, with just default settings?
		@pytest.mark.parametrize(

initial suggestions of array type handling on example of normalization methods #835

Are you sure you want to change the base?

initial suggestions of array type handling on example of normalization methods #835

Conversation

eroell commented Dec 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eroell Dec 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Zethson Dec 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eroell commented Dec 6, 2024

eroell commented Dec 6, 2024

Zethson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Zethson Dec 9, 2024 • edited Loading

Choose a reason for hiding this comment

eroell commented Dec 9, 2024

eroell commented Dec 4, 2024 •

edited

Loading

eroell Dec 6, 2024 •

edited

Loading

Zethson Dec 9, 2024 •

edited

Loading

Zethson Dec 9, 2024 •

edited

Loading