Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datatype Support in Quality Control and Impute #865

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

aGuyLearning
Copy link
Collaborator

@aGuyLearning aGuyLearning commented Feb 5, 2025

PR Checklist

  • This comment contains a description of changes (with reason)
  • Referenced issue is linked
  • If you've fixed a bug or added code that should be tested, add tests!
  • Documentation in docs is updated

Description of changes
As discussed in #861, this refactors the quality control functions to use single dispatch for future datatype support.

To this end, the explicit_impute function has been reworked already. closes #848

Technical details
This pull request introduces significant changes to the ehrapy preprocessing module, focusing on improving imputation and quality control functionalities. The main changes involve the addition of support for dask arrays, the use of singledispatch for function overloading, and the enhancement of test coverage for different array types.

Possible enhancements
Please let's discuss the changes, as I am not yet fully happy with them.

  • adata.x datatype cheker wrapper for parent function and more restrictive datatypes

@aGuyLearning aGuyLearning requested a review from eroell February 5, 2025 16:32
@aGuyLearning aGuyLearning self-assigned this Feb 5, 2025
@aGuyLearning aGuyLearning linked an issue Feb 5, 2025 that may be closed by this pull request
2 tasks
@aGuyLearning aGuyLearning marked this pull request as draft February 5, 2025 16:38
obs_metrics = pd.DataFrame(index=adata.obs_names)
var_metrics = pd.DataFrame(index=adata.var_names)
mtx = adata.X if layer is None else adata.layers[layer]
mtx = copy.deepcopy(arr.astype(object))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this deepcopy because mtx gets written onto if "encoding_mode" in adata.var? Or is there something else you spotted here requiring this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a first glance I think mtx was written to if "encoding_mode" in adata.var, which would be a bug. If this is also your reason here @aGuyLearning, we might look for something cheaper than copying the entire mtx

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i removed the copy part and renamed the method argument to mtx, as it will just directly use the array. In addition, the mtx.astype(object) has been moved back into the "encoding_mode" block.

Should i rename the _compute_var_metrics argument as well?

@eroell
Copy link
Collaborator

eroell commented Feb 7, 2025

As of now, explicit_impute raises a NotImplementedError for dask which is solely due to the required function from quality control is not available, right?
I see that you @aGuyLearning improved explicit_impute to support dask, once the quality control block is out of the way 👍
I'm looking into the quality control right now, there's some potential to daskify further

categorical_indices = np.ndarray([0], dtype=int)
mtx = copy.deepcopy(arr.astype(object))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previously it copyied the array in the for loop of the "encoding_mode" block. This passes all tests and requires less compute. Am i understanding something wrong, or is this fine?

@eroell
Copy link
Collaborator

eroell commented Feb 7, 2025

Daskification of qc_metrics.
Profiling for AnnData of shape 30'000 x 10'000, reduces the peak memory consumption from 2G to 350MB.

numpy

import anndata as ad
import numpy as np
import ehrapy as ep

rng = np.random.default_rng(42)
X_np = rng.random((30_000, 1_000))
X_np[X_np <= 0.1] = np.nan

adata_np = ad.AnnData(X_np)
df = ep.pp.qc_metrics(adata_np)[0]
image

dask

import anndata as ad
import dask.array as da
import numpy as np

import ehrapy as ep

X_dask = da.random.random(size=(30_000, 1_000), chunks=(1000, 1000))
X_dask[X_dask <= 0.1] = np.nan

adata_da = ad.AnnData(X_dask)
df = ep.pp.qc_metrics(adata_da)[0]
image

explicit_impute(impute_num_adata, replacement=1011, copy=True)


@pytest.mark.parametrize(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would be great! If we want to check da.from_array, by default passing the argument chunks=1000 helps, as dask complains when calling da.from_array when dtype is simply object in some cases. Would you be interested in adding such a fixture @aGuyLearning?

"array_type, expected_error",
[
(np.array, None),
# (da.array, NotImplementedError),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the new fixture mentioned above, the dask test could be activated! It most likely will fail if just calling da.from_array, with no chunks argument specified

"array_type, expected_error",
[
(np.array, None),
# (da.array, NotImplementedError),
Copy link
Collaborator

@eroell eroell Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, too, the test could be activated for dask, once the fixture with specifying the chunks argument is available

Copy link
Collaborator

@eroell eroell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments - this comes along very well!

Copy link
Member

@Zethson Zethson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool work !

@@ -19,6 +21,13 @@
if TYPE_CHECKING:
from anndata import AnnData

try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have this check in every file now. Should this be done once somewhere else? Probably just do what scanpy does. If they also check in every file - fine.

@@ -46,6 +48,11 @@ def _base_check_imputation(
Raises:
AssertionError: If any of the checks fail.
"""
# if .x of the AnnData is a dask array, convert it to a numpy array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# if .x of the AnnData is a dask array, convert it to a numpy array
# if .X of the AnnData is a dask array, convert it to a numpy array

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I can see this in the code. Can you write WHY you're doing this, please? Much more useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Compatibility of ep.pp.explicit_impute with different datatypes
3 participants