Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug fix for scanpy.tl.score_genes #3167

Closed
wants to merge 3 commits into from
Closed

Conversation

gmvaccaro
Copy link

@gmvaccaro gmvaccaro commented Jul 25, 2024

The score_genes procedure currently uses a ranking system to split genes into bins of similar expression levels. The current approach fails with some datasets.

  • Release notes not necessary because:

changes the ranking system so number of genes in each bin succeeds when data is scaled and when there are many zeros in dataset.
Copy link

codecov bot commented Jul 25, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 76.54%. Comparing base (62454de) to head (51f2f5f).
Report is 39 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3167   +/-   ##
=======================================
  Coverage   76.54%   76.54%           
=======================================
  Files         109      109           
  Lines       12490    12492    +2     
=======================================
+ Hits         9560     9562    +2     
  Misses       2930     2930           
Files with missing lines Coverage Δ
src/scanpy/tools/_score_genes.py 86.40% <100.00%> (+0.26%) ⬆️

@flying-sheep
Copy link
Member

Hi, that certainly seems like an improvement! I took the liberty to split out two bugs: #3168 and #3169

We had a similar fix #2875, which is hidden behind a ctrl_as_ref flag.

I think since that fix is not yet released, we should rename the flag and unify both fixes behind it.

Could you please

  1. check if issue score_genes fails completely when the gene set has zero expression in some cells #3169 is already fixed by installing scanpy’s git version and setting ctrl_as_ref=True

  2. Our backwards compatibility means that changes to the scoring need to be optional. This is why tests/test_score_genes.py::test_score_with_reference is failing here, and that’s what ctrl_as_ref is for.

    So since that option is not yet released and in order to fix the test, we should probably change that option to incorporate both improvements. We can rename it to reflect the two things it does.

@gmvaccaro
Copy link
Author

Hi,

Thank you!

  1. I've installed scanpy's git version and set ctrl_as_ref=True with the same results, it does not fix score_genes fails completely when the gene set has zero expression in some cells #3169

  2. I am not entirely sure what you mean here, would you mind elaborating? This bug fix is specific to the way score_genes() ranks the genes into bins & when there's a lot of zero expression in cells, I'm a bit unsure how this connects to ctrl_as_ref and what you would like me to do.

Thanks again

@flying-sheep
Copy link
Member

flying-sheep commented Aug 2, 2024

Sure, I’ll happily elaborate!

As you can see, our test suite is failing. That’s because we have a test tests/test_score_genes.py::test_score_with_reference which checks if the scores emitted by the functions are exactly equal to older versions of the function. We have that test because we’d like people to be able to rely on consistent calculations.

Now of course we’d also like to fix things eventually, so we’d implement your fix behind an option (so people have to opt-in to the changes). Eventually, we’d switch to the new behavior (likely scanpy 2.0).

That’s why I propose to rename the ctrl_as_ref option and use it to gate both the change it already affects as well as your change.


Another thing: We need to test that this works as intended. Can you create a reproducer with built-in datasets (or synthetic data that you create using numpy) that would show the degraded binning behavior with the old behavior? We could then add a test like this:

def test_score_genes():
    adata = TODO  # create test data here
    gene_list = TODO
    gene_pool = TODO
    gene_list, gene_pool, get_subset = _check_score_genes_args(
        adata, gene_list, gene_pool, use_raw=use_raw, layer=layer
    )

    bins = list(_score_genes_bins(
        gene_list,
        gene_pool,
        ctrl_as_ref=False,  # needs to be renamed
        ctrl_size=50,
        n_bins=25,
        get_subset=get_subset,
    ))

    assert 0 not in map(len, bins)

@flying-sheep flying-sheep modified the milestones: 1.10.3, 1.10.4 Sep 17, 2024
@flying-sheep
Copy link
Member

Hi @gmvaccaro, it would be awesome if you can give us some code to reproduce the problem!

Originally, you said

Utilizing the murine hematopoietic progenitors from Nestorowa et al., 2016, as well as the regev_lab_cell_cycle_genes.txt, one issue is apparent.

but what code would reproduce the problem with the zero-sized bin from these data?

@flying-sheep
Copy link
Member

From the data you used, I assume you tried this, but I don’t get an empty bin:

https://gist.github.com/flying-sheep/b2ae449ab70a9358e07a82f284de5dca

@gmvaccaro
Copy link
Author

Hi! Here is the source code with data Nestorowa et al., 2016 with added diagnostic tests to produce these results.

https://github.com/gmvaccaro/scanpy.tl.score_genes_fix/blob/main/score_genes_diagnostics_tests2.ipynb

@flying-sheep
Copy link
Member

Wonderful, thank you!

@flying-sheep
Copy link
Member

flying-sheep commented Oct 15, 2024

Hm, I adapted your reproducer to use scanpy 1.10.3’s code and it doesn’t seem to be an issue: https://gist.github.com/flying-sheep/b2ae449ab70a9358e07a82f284de5dca#file-score_genes_diagnostics_tests2-ipynb

I’m going to assume that this is fixed in 1.10.3. If you can reproduce it with 1.10.3, we can reopen it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants