-
Notifications
You must be signed in to change notification settings - Fork 600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug fix for scanpy.tl.score_genes #3167
Conversation
changes the ranking system so number of genes in each bin succeeds when data is scaled and when there are many zeros in dataset.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #3167 +/- ##
=======================================
Coverage 76.54% 76.54%
=======================================
Files 109 109
Lines 12490 12492 +2
=======================================
+ Hits 9560 9562 +2
Misses 2930 2930
|
Hi, that certainly seems like an improvement! I took the liberty to split out two bugs: #3168 and #3169 We had a similar fix #2875, which is hidden behind a I think since that fix is not yet released, we should rename the flag and unify both fixes behind it. Could you please
|
for more information, see https://pre-commit.ci
Hi, Thank you!
Thanks again |
Sure, I’ll happily elaborate! As you can see, our test suite is failing. That’s because we have a test Now of course we’d also like to fix things eventually, so we’d implement your fix behind an option (so people have to opt-in to the changes). Eventually, we’d switch to the new behavior (likely scanpy 2.0). That’s why I propose to rename the Another thing: We need to test that this works as intended. Can you create a reproducer with built-in datasets (or synthetic data that you create using def test_score_genes():
adata = TODO # create test data here
gene_list = TODO
gene_pool = TODO
gene_list, gene_pool, get_subset = _check_score_genes_args(
adata, gene_list, gene_pool, use_raw=use_raw, layer=layer
)
bins = list(_score_genes_bins(
gene_list,
gene_pool,
ctrl_as_ref=False, # needs to be renamed
ctrl_size=50,
n_bins=25,
get_subset=get_subset,
))
assert 0 not in map(len, bins) |
Hi @gmvaccaro, it would be awesome if you can give us some code to reproduce the problem! Originally, you said
but what code would reproduce the problem with the zero-sized bin from these data? |
From the data you used, I assume you tried this, but I don’t get an empty bin: https://gist.github.com/flying-sheep/b2ae449ab70a9358e07a82f284de5dca |
Hi! Here is the source code with data Nestorowa et al., 2016 with added diagnostic tests to produce these results. |
Wonderful, thank you! |
Hm, I adapted your reproducer to use scanpy 1.10.3’s code and it doesn’t seem to be an issue: https://gist.github.com/flying-sheep/b2ae449ab70a9358e07a82f284de5dca#file-score_genes_diagnostics_tests2-ipynb I’m going to assume that this is fixed in 1.10.3. If you can reproduce it with 1.10.3, we can reopen it! |
The score_genes procedure currently uses a ranking system to split genes into bins of similar expression levels. The current approach fails with some datasets.