Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial suggestions of array type handling on example of normalization methods #835

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

eroell
Copy link
Collaborator

@eroell eroell commented Dec 4, 2024

PR Checklist

  • This comment contains a description of changes (with reason)
  • Referenced issue is linked Fixes Array type handling: Normalization #834
  • If you've fixed a bug or added code that should be tested, add tests!
  • Documentation in docs is updated

Description of changes

This should serve as example for discussion and iterations on one example case (the normalization suite) of checking how rolling out the consistent data type support could look like.

This should become an example of how Step 1 in #829 could look like, and be used as guide to implement the rest.

Technical details

Inspired from e.g. scanpy's scale; but less performance engineering, rather focusing on leveraging frameworks such as dask-ml.

Additional context

@eroell eroell changed the title initial suggestions of array type checks on example of scale_norm initial suggestions of array type handling on example of scale_norm Dec 4, 2024
@eroell eroell changed the title initial suggestions of array type handling on example of scale_norm initial suggestions of array type handling on example of normalization methods Dec 4, 2024
import dask_ml.preprocessing as daskml_pp

DASK_AVAILABLE = True
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is your take on such a flag?
Could it be worth to introduce dask as a dependency now that we try to introduce it systematically?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anndata[dask] I guess to be compatible but fine with me

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good point. If dask as dependency, then as anndata[dask]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonder whether we should have that flag here though and not more global somewhere. Else we need that check often. If we check globally though, we might import dask when it's not needed which is a performance penalty.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends on how slow dask is to import. This varies wildly depending on the project.

E.g. AnnData is a little slow because of pandas. In scanpy we lazy-import some sklearn subpackage because that would have a huge impact, and without it we’re only slower than anndata because of sklearn.utils and numba.

@@ -69,6 +75,23 @@ def _scale_func_group(
return None


@singledispatch
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about systematically introducing singledispatch in this manner? It might be a bit of an overkill here; but I think in the long run, using this structure reduces code complexity by introducing a regularly appearing pattern.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree with your argument

@@ -87,12 +89,35 @@ def test_vars_checks(adata_to_norm):
ep.pp.scale_norm(adata_to_norm, vars=["String1"])


@pytest.mark.parametrize("array_type", ARRAY_TYPES)
def test_norm_scale(array_type, adata_to_norm):
# TODO: list the supported array types centrally?
Copy link
Collaborator Author

@eroell eroell Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two lists are not used right now. Should we have a central lookup for which function works with which array type?

Right now, the "ground truth" is in the parameterization of the test_<normalization-method>_array_types tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a central lookup for which function works with which array type?

wdym with lookup here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example a supported_array_types dict or dataframe in e.g. ehrapy.core._supported_array_types.py

supported_array_types = {
    `ep.pp.scale_norm`: {"np.array": True, "dask.array.Array": True, "sp.sparse.spmatrix: False},
    ....
}

Could be used in the test parametrization, instead of writing them up individually all the time.

But I have not seen something like that anywhere before, and there might be good reasons to not do this

Copy link
Member

@Zethson Zethson Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from functools import singledispatch

@singledispatch
def func(arg):
    pass

print(func.registry.keys())

You can construct that dynamically. Maybe you can use that to parameterize the tests etc?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point. In favor of using dynamically constructed registries indeed.



# TODO: check this for each function, with just default settings?
@pytest.mark.parametrize(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I picked strict three array types to be tested rigidly: np.ndarray, dask.array.Array, scipy.sparse.spmatrix. Not sure if we should test for more. I personally favor being strict about these three things, and not guaranteeing anything else.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah totally fine. cupy arrays are for the future

@eroell
Copy link
Collaborator Author

eroell commented Dec 6, 2024

Very happy about your input on any of the questions I raised in the comments @flying-sheep :)

@eroell eroell requested a review from nicolassidoux December 6, 2024 17:10
@eroell
Copy link
Collaborator Author

eroell commented Dec 6, 2024

Also interested what @nicolassidoux thinks here! :)

Copy link
Member

@Zethson Zethson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. You added lots of NotImplementedError messages. I wonder whether we can DRY this more easily?
  2. Those error messages should also always suggest which types are support for that function as a sort of solution.
  3. adata_to_norm_casted = adata_to_norm.copy() why do we always need those copies? Doesn't the fixture already create new copies?

@@ -113,6 +133,23 @@ def scale_norm(
)


@singledispatch
def _minmax_norm_function(arr):
raise NotImplementedError(f"minmax_norm does not support data to be of type {type(arr)}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we should always suggest which types are supported.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can automate this by accessing _minmax_norm_function.registry.keys()

import dask_ml.preprocessing as daskml_pp

DASK_AVAILABLE = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonder whether we should have that flag here though and not more global somewhere. Else we need that check often. If we check globally though, we might import dask when it's not needed which is a performance penalty.

@@ -87,12 +89,35 @@ def test_vars_checks(adata_to_norm):
ep.pp.scale_norm(adata_to_norm, vars=["String1"])


@pytest.mark.parametrize("array_type", ARRAY_TYPES)
def test_norm_scale(array_type, adata_to_norm):
# TODO: list the supported array types centrally?
Copy link
Member

@Zethson Zethson Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from functools import singledispatch

@singledispatch
def func(arg):
    pass

print(func.registry.keys())

You can construct that dynamically. Maybe you can use that to parameterize the tests etc?

@eroell
Copy link
Collaborator Author

eroell commented Dec 9, 2024

You added lots of NotImplementedError messages. I wonder whether we can DRY this more easily?

Yess

Those error messages should also always suggest which types are support for that function as a sort of solution.

Good point, the dynamic registries help here again

adata_to_norm_casted = adata_to_norm.copy() why do we always need those copies? Doesn't the fixture already create new copies?

because I just not paying attention to this yet, I'll remove :)

@github-actions github-actions bot added the chore label Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Array type handling: Normalization
3 participants