-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
initial suggestions of array type handling on example of normalization methods #835
base: main
Are you sure you want to change the base?
Conversation
import dask_ml.preprocessing as daskml_pp | ||
|
||
DASK_AVAILABLE = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is your take on such a flag?
Could it be worth to introduce dask as a dependency now that we try to introduce it systematically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
anndata[dask] I guess to be compatible but fine with me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good point. If dask as dependency, then as anndata[dask]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonder whether we should have that flag here though and not more global somewhere. Else we need that check often. If we check globally though, we might import dask when it's not needed which is a performance penalty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depends on how slow dask is to import. This varies wildly depending on the project.
E.g. AnnData is a little slow because of pandas. In scanpy we lazy-import some sklearn subpackage because that would have a huge impact, and without it we’re only slower than anndata because of sklearn.utils
and numba
.
@@ -69,6 +75,23 @@ def _scale_func_group( | |||
return None | |||
|
|||
|
|||
@singledispatch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about systematically introducing singledispatch in this manner? It might be a bit of an overkill here; but I think in the long run, using this structure reduces code complexity by introducing a regularly appearing pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree with your argument
@@ -87,12 +89,35 @@ def test_vars_checks(adata_to_norm): | |||
ep.pp.scale_norm(adata_to_norm, vars=["String1"]) | |||
|
|||
|
|||
@pytest.mark.parametrize("array_type", ARRAY_TYPES) | |||
def test_norm_scale(array_type, adata_to_norm): | |||
# TODO: list the supported array types centrally? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two lists are not used right now. Should we have a central lookup for which function works with which array type?
Right now, the "ground truth" is in the parameterization of the test_<normalization-method>_array_types
tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have a central lookup for which function works with which array type?
wdym with lookup here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example a supported_array_types
dict or dataframe in e.g. ehrapy.core._supported_array_types.py
supported_array_types = {
`ep.pp.scale_norm`: {"np.array": True, "dask.array.Array": True, "sp.sparse.spmatrix: False},
....
}
Could be used in the test parametrization, instead of writing them up individually all the time.
But I have not seen something like that anywhere before, and there might be good reasons to not do this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from functools import singledispatch
@singledispatch
def func(arg):
pass
print(func.registry.keys())
You can construct that dynamically. Maybe you can use that to parameterize the tests etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good point. In favor of using dynamically constructed registries indeed.
|
||
|
||
# TODO: check this for each function, with just default settings? | ||
@pytest.mark.parametrize( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I picked strict three array types to be tested rigidly: np.ndarray
, dask.array.Array
, scipy.sparse.spmatrix
. Not sure if we should test for more. I personally favor being strict about these three things, and not guaranteeing anything else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah totally fine. cupy arrays are for the future
Very happy about your input on any of the questions I raised in the comments @flying-sheep :) |
Also interested what @nicolassidoux thinks here! :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- You added lots of
NotImplementedError
messages. I wonder whether we can DRY this more easily? - Those error messages should also always suggest which types are support for that function as a sort of solution.
adata_to_norm_casted = adata_to_norm.copy()
why do we always need those copies? Doesn't the fixture already create new copies?
@@ -113,6 +133,23 @@ def scale_norm( | |||
) | |||
|
|||
|
|||
@singledispatch | |||
def _minmax_norm_function(arr): | |||
raise NotImplementedError(f"minmax_norm does not support data to be of type {type(arr)}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we should always suggest which types are supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can automate this by accessing _minmax_norm_function.registry.keys()
import dask_ml.preprocessing as daskml_pp | ||
|
||
DASK_AVAILABLE = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonder whether we should have that flag here though and not more global somewhere. Else we need that check often. If we check globally though, we might import dask when it's not needed which is a performance penalty.
@@ -87,12 +89,35 @@ def test_vars_checks(adata_to_norm): | |||
ep.pp.scale_norm(adata_to_norm, vars=["String1"]) | |||
|
|||
|
|||
@pytest.mark.parametrize("array_type", ARRAY_TYPES) | |||
def test_norm_scale(array_type, adata_to_norm): | |||
# TODO: list the supported array types centrally? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from functools import singledispatch
@singledispatch
def func(arg):
pass
print(func.registry.keys())
You can construct that dynamically. Maybe you can use that to parameterize the tests etc?
Yess
Good point, the dynamic registries help here again
because I just not paying attention to this yet, I'll remove :) |
PR Checklist
docs
is updatedDescription of changes
This should serve as example for discussion and iterations on one example case (the normalization suite) of checking how rolling out the consistent data type support could look like.
This should become an example of how Step 1 in #829 could look like, and be used as guide to implement the rest.
Technical details
Inspired from e.g. scanpy's scale; but less performance engineering, rather focusing on leveraging frameworks such as dask-ml.
Additional context