Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make all imputation methods consistent in regard to encoding requirements #827

Merged
merged 20 commits into from
Nov 25, 2024

Conversation

nicolassidoux
Copy link
Collaborator

See #824 for the details of the change.

Copy link
Member

@Zethson Zethson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very very much! This is great. I am a fan of simplifying the code (by removing semi important features such as the progress bars in general) and really think that generalizing the imputation tests is an awesome idea.

I have a few questions and concerns though:

  1. I am not a fan of adding more _utils_* modules. I argue that all and any utils modules are a code smell. It is never clear to developers what is actually in there and it should probably be somewhere else. Let's just have descriptive module names (or scripts) that better describe what can be found in there.
  2. I don't know how I feel about the removal of the imputation spinners. We had added them because some imputation functions can run for a looooong time and it just gives feedback that the code is still running. I know - it's not a progress bar but it's at least something? I don't have the strongest opinion on this but wonder what other people think @Imipenem @eroell
  3. Removing some docstrings about raised Errors is okay with me because the errors themselves can document the behavior when raised. The docstrings are more useful for other developers if they want to build on top because they don't have to check the whole code for any potential raised Errors.
  4. Instead of printing to stdout and returning a bool in _base_check_imputation I wonder whether this function should just raise errors? prints are not for errors - they are for informative messages. Pytest will resolve the stack and point to the part that raised the error.

ehrapy/_utils_available.py Outdated Show resolved Hide resolved
ehrapy/_utils_data.py Outdated Show resolved Hide resolved
ehrapy/_utils_data.py Outdated Show resolved Hide resolved
ehrapy/_utils_data.py Outdated Show resolved Hide resolved
ehrapy/_utils_data.py Outdated Show resolved Hide resolved
ehrapy/preprocessing/_imputation.py Outdated Show resolved Hide resolved
ehrapy/preprocessing/_imputation.py Outdated Show resolved Hide resolved
tests/preprocessing/test_imputation.py Outdated Show resolved Hide resolved
tests/preprocessing/test_imputation.py Outdated Show resolved Hide resolved
tests/preprocessing/test_imputation.py Outdated Show resolved Hide resolved
nicolassidoux and others added 3 commits November 25, 2024 09:40
@nicolassidoux
Copy link
Collaborator Author

nicolassidoux commented Nov 25, 2024

  1. I am not a fan of adding more _utils_* modules. I argue that all and any utils modules are a code smell. It is never clear to developers what is actually in there and it should probably be somewhere else. Let's just have descriptive module names (or scripts) that better describe what can be found in there.

Well, this is always a big topic. In a way you're right, having these utils is a hint something could be better organized. But it doesn't mean it is in our control. For example: _is_val_missing is a function that would better fit in AnnData itself. So I would consider those as "extensions" more than "utils".

  1. I don't know how I feel about the removal of the imputation spinners. We had added them because some imputation functions can run for a looooong time and it just gives feedback that the code is still running. I know - it's not a progress bar but it's at least something? I don't have the strongest opinion on this but wonder what other people think @Imipenem @eroell

I never saw it spinning 😆 But maybe it's an issue with my stdout.

  1. Removing some docstrings about raised Errors is okay with me because the errors themselves can document the behavior when raised. The docstrings are more useful for other developers if they want to build on top because they don't have to check the whole code for any potential raised Errors.

Again a bit topic 😄 As I said, it sounds like a good idea, but almost not maintenable in practice. From my experience.

  1. Instead of printing to stdout and returning a bool in _base_check_imputation I wonder whether this function should just raise errors? prints are not for errors - they are for informative messages. Pytest will resolve the stack and point to the part that raised the error.

This is what I would have done for production code, but usually for tests, I like that kind of output. But alright, I'll modify that.

@nicolassidoux
Copy link
Collaborator Author

Looks the Test / test (ubuntu-latest, 3.11, --pre) (pull_request) fails...

@eroell
Copy link
Collaborator

eroell commented Nov 25, 2024

Looks the Test / test (ubuntu-latest, 3.11, --pre) (pull_request) fails...

Nothing to do with us, its a dependency & they're already on it :)

@eroell
Copy link
Collaborator

eroell commented Nov 25, 2024

I don't know how I feel about the removal of the imputation spinners. We had added them because some imputation functions can run for a looooong time and it just gives feedback that the code is still running. I know - it's not a progress bar but it's at least something? I don't have the strongest opinion on this but wonder what other people think @Imipenem @eroell

I'm split... if we were to keep it, I'd move it to a decorator? That would make the functions way more readable, and reduce redundancy

I never saw it spinning 😆 But maybe it's an issue with my stdout.

You run usually scripts if I recall correctly? If you run it from a jupyter notebook, on this 6 dots on the image, you can see something like a snake walking around for the spinning display
image

Copy link
Member

@Zethson Zethson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example: _is_val_missing is a function that would better fit in AnnData itself. So I would consider those as "extensions" more than "utils".

Then @nicolassidoux they should be in https://github.com/theislab/ehrapy/blob/main/ehrapy/anndata/anndata_ext.py right? I still expect a lot of that to eventually make it into @eroell ehrdata but for now it should probably be there.

Could you please move the new functions?

I'm okay with the spinners being removed unless you deem them worthwhile now @nicolassidoux and want to write a decorator.

After the functions have been moved, I'd love to merge this.

Thank you very much!

ehrapy/preprocessing/_imputation.py Show resolved Hide resolved
ehrapy/_utils_data.py Outdated Show resolved Hide resolved
ehrapy/_utils_data.py Outdated Show resolved Hide resolved
ehrapy/preprocessing/_imputation.py Outdated Show resolved Hide resolved
ehrapy/preprocessing/_imputation.py Outdated Show resolved Hide resolved
ehrapy/_settings.py Show resolved Hide resolved
ehrapy/_utils_data.py Outdated Show resolved Hide resolved
@nicolassidoux
Copy link
Collaborator Author

@Zethson

Then @nicolassidoux they should be in https://github.com/theislab/ehrapy/blob/main/ehrapy/anndata/anndata_ext.py right? I still expect a lot of that to eventually make it into @eroell ehrdata but for now it should probably be there.

Oh well I should have noticed this file indeed 😄

I'm okay with the spinners being removed unless you deem them worthwhile now @nicolassidoux and want to write a decorator.

Let me assess how much work it is to make that done.

@nicolassidoux
Copy link
Collaborator Author

Hm of course all notebooks tests or almost fail because of yaspin missing. This spinner is nice because it works for both scripts and notebook. What do you guys think?

@Zethson
Copy link
Member

Zethson commented Nov 25, 2024

Rich is much much more popular and also used in packages such as pip itself. Therefore, users will have it anyways. That's why I think it'd be better if we stuck to Rich?

@nicolassidoux
Copy link
Collaborator Author

Ok, it doesn't work well on my stdout, but ok in notebooks.

@nicolassidoux
Copy link
Collaborator Author

@eroell @Zethson Ready for final review, I think

@eroell
Copy link
Collaborator

eroell commented Nov 25, 2024

Spinners work, cool. Thanks for making the decorator!

Nice to have these additional tests, aim of the PR is fulfilled.

I have one final point on the return type. Before, all imputation methods returned a new adata if copy=True, else None. This is the "scverse"/"scanpy" style behavior. miceforest was an (I suppose accidental) exception, which always returned the adata.

Right now, it has been changed to all imputation methods returning adata. I suggest to rather have all functions following the "scverse"/"scanpy" style instead.

Even if this could mean a small breaking change for users doing adata = ep.pp.mice_forest_impute(adata, copy=False), the fix being to just do ep.pp.mice_forest_impute(adata, copy=False), I think going consistent here now helps sharpening our API for the future.

If you agree with this or convince me otherwise, this looks ready to merge for me!

@nicolassidoux
Copy link
Collaborator Author

@eroell

I have one final point on the return type.

Fine for me, anyway I advocated for removing this copy thing, but that's a separate debate.
I'll modify that right now so the issue can be closed ASAP.

@nicolassidoux
Copy link
Collaborator Author

Done, I also noticed miss_forest_impute really needed a rewrite.

@nicolassidoux
Copy link
Collaborator Author

Hmm pre-commit doesn't like there is no return statement if copy=False. Should i modify like:

    ...
    if copy:
        return adata
    else:
        return

@Zethson
Copy link
Member

Zethson commented Nov 25, 2024

Hmm pre-commit doesn't like there is no return statement if copy=False. Should i modify like:

    ...
    if copy:
        return adata
    else:
        return

Also how scanpy does it: https://github.com/scverse/scanpy/blob/main/src/scanpy/tools/_dpt.py#L199C5-L199C35

@eroell
Copy link
Collaborator

eroell commented Nov 25, 2024

Thanks for the final polish here - Like your suggestion, or like scanpy does it, both fine!
Then this looks good to merge for me :)

@nicolassidoux
Copy link
Collaborator Author

Done!

@Zethson
Copy link
Member

Zethson commented Nov 25, 2024

The RTD build seems to time out because of dependencies somehow urgh

@Zethson
Copy link
Member

Zethson commented Nov 25, 2024

I'll merge this now and I'll take a look myself soon.

@Zethson Zethson merged commit 67fedbf into main Nov 25, 2024
15 of 17 checks passed
@Zethson Zethson deleted the enhancement/issue-824 branch November 25, 2024 22:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make all imputation methods consistent in regard to encoding requirements
3 participants