-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datatype Support in Quality Control and Impute #865
base: main
Are you sure you want to change the base?
Conversation
…and improve readability
obs_metrics = pd.DataFrame(index=adata.obs_names) | ||
var_metrics = pd.DataFrame(index=adata.var_names) | ||
mtx = adata.X if layer is None else adata.layers[layer] | ||
mtx = copy.deepcopy(arr.astype(object)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this deepcopy because mtx gets written onto if "encoding_mode" in adata.var
? Or is there something else you spotted here requiring this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a first glance I think mtx was written to if "encoding_mode" in adata.var
, which would be a bug. If this is also your reason here @aGuyLearning, we might look for something cheaper than copying the entire mtx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i removed the copy part and renamed the method argument to mtx, as it will just directly use the array. In addition, the mtx.astype(object)
has been moved back into the "encoding_mode" block.
Should i rename the _compute_var_metrics argument as well?
As of now, |
categorical_indices = np.ndarray([0], dtype=int) | ||
mtx = copy.deepcopy(arr.astype(object)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
previously it copyied the array in the for loop of the "encoding_mode" block. This passes all tests and requires less compute. Am i understanding something wrong, or is this fine?
explicit_impute(impute_num_adata, replacement=1011, copy=True) | ||
|
||
|
||
@pytest.mark.parametrize( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would be great! If we want to check da.from_array
, by default passing the argument chunks=1000
helps, as dask complains when calling da.from_array
when dtype is simply object in some cases. Would you be interested in adding such a fixture @aGuyLearning?
"array_type, expected_error", | ||
[ | ||
(np.array, None), | ||
# (da.array, NotImplementedError), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the new fixture mentioned above, the dask test could be activated! It most likely will fail if just calling da.from_array
, with no chunks
argument specified
"array_type, expected_error", | ||
[ | ||
(np.array, None), | ||
# (da.array, NotImplementedError), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, too, the test could be activated for dask, once the fixture with specifying the chunks argument is available
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments - this comes along very well!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool work !
@@ -19,6 +21,13 @@ | |||
if TYPE_CHECKING: | |||
from anndata import AnnData | |||
|
|||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have this check in every file now. Should this be done once somewhere else? Probably just do what scanpy does. If they also check in every file - fine.
@@ -46,6 +48,11 @@ def _base_check_imputation( | |||
Raises: | |||
AssertionError: If any of the checks fail. | |||
""" | |||
# if .x of the AnnData is a dask array, convert it to a numpy array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# if .x of the AnnData is a dask array, convert it to a numpy array | |
# if .X of the AnnData is a dask array, convert it to a numpy array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I can see this in the code. Can you write WHY you're doing this, please? Much more useful.
PR Checklist
docs
is updatedDescription of changes
As discussed in #861, this refactors the quality control functions to use single dispatch for future datatype support.
To this end, the explicit_impute function has been reworked already. closes #848
Technical details
This pull request introduces significant changes to the
ehrapy
preprocessing module, focusing on improving imputation and quality control functionalities. The main changes involve the addition of support fordask
arrays, the use ofsingledispatch
for function overloading, and the enhancement of test coverage for different array types.Possible enhancements
Please let's discuss the changes, as I am not yet fully happy with them.