Bug fix for scanpy.tl.score_genes #3167

gmvaccaro · 2024-07-25T20:54:41Z

The score_genes procedure currently uses a ranking system to split genes into bins of similar expression levels. The current approach fails with some datasets.

Currently the code doesn’t produce the expected number of bins of equal or approximately equal size, see score_genes doesn’t produce the expected number of bins of equal or approximately equal size #3168
The code fails completely when the gene set has zero expression in some cells, see score_genes fails completely when the gene set has zero expression in some cells #3169

Closes score_genes doesn’t produce the expected number of bins of equal or approximately equal size #3168, closes score_genes fails completely when the gene set has zero expression in some cells #3169
Tests included or not required because:

Release notes not necessary because:

changes the ranking system so number of genes in each bin succeeds when data is scaled and when there are many zeros in dataset.

codecov · 2024-07-25T21:12:02Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 76.54%. Comparing base (62454de) to head (51f2f5f).
Report is 39 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #3167   +/-   ##
=======================================
  Coverage   76.54%   76.54%           
=======================================
  Files         109      109           
  Lines       12490    12492    +2     
=======================================
+ Hits         9560     9562    +2     
  Misses       2930     2930

Files with missing lines	Coverage Δ
src/scanpy/tools/_score_genes.py	`86.40% <100.00%> (+0.26%)`	⬆️

flying-sheep · 2024-07-26T08:16:45Z

Hi, that certainly seems like an improvement! I took the liberty to split out two bugs: #3168 and #3169

We had a similar fix #2875, which is hidden behind a ctrl_as_ref flag.

I think since that fix is not yet released, we should rename the flag and unify both fixes behind it.

Could you please

check if issue score_genes fails completely when the gene set has zero expression in some cells #3169 is already fixed by installing scanpy’s git version and setting ctrl_as_ref=True
Our backwards compatibility means that changes to the scoring need to be optional. This is why tests/test_score_genes.py::test_score_with_reference is failing here, and that’s what ctrl_as_ref is for.

So since that option is not yet released and in order to fix the test, we should probably change that option to incorporate both improvements. We can rename it to reflect the two things it does.

for more information, see https://pre-commit.ci

gmvaccaro · 2024-08-01T19:15:42Z

Hi,

Thank you!

I've installed scanpy's git version and set ctrl_as_ref=True with the same results, it does not fix score_genes fails completely when the gene set has zero expression in some cells #3169
I am not entirely sure what you mean here, would you mind elaborating? This bug fix is specific to the way score_genes() ranks the genes into bins & when there's a lot of zero expression in cells, I'm a bit unsure how this connects to ctrl_as_ref and what you would like me to do.

Thanks again

flying-sheep · 2024-08-02T07:33:53Z

Sure, I’ll happily elaborate!

As you can see, our test suite is failing. That’s because we have a test tests/test_score_genes.py::test_score_with_reference which checks if the scores emitted by the functions are exactly equal to older versions of the function. We have that test because we’d like people to be able to rely on consistent calculations.

Now of course we’d also like to fix things eventually, so we’d implement your fix behind an option (so people have to opt-in to the changes). Eventually, we’d switch to the new behavior (likely scanpy 2.0).

That’s why I propose to rename the ctrl_as_ref option and use it to gate both the change it already affects as well as your change.

Another thing: We need to test that this works as intended. Can you create a reproducer with built-in datasets (or synthetic data that you create using numpy) that would show the degraded binning behavior with the old behavior? We could then add a test like this:

def test_score_genes():
    adata = TODO  # create test data here
    gene_list = TODO
    gene_pool = TODO
    gene_list, gene_pool, get_subset = _check_score_genes_args(
        adata, gene_list, gene_pool, use_raw=use_raw, layer=layer
    )

    bins = list(_score_genes_bins(
        gene_list,
        gene_pool,
        ctrl_as_ref=False,  # needs to be renamed
        ctrl_size=50,
        n_bins=25,
        get_subset=get_subset,
    ))

    assert 0 not in map(len, bins)

flying-sheep · 2024-09-23T15:10:13Z

Hi @gmvaccaro, it would be awesome if you can give us some code to reproduce the problem!

Originally, you said

Utilizing the murine hematopoietic progenitors from Nestorowa et al., 2016, as well as the regev_lab_cell_cycle_genes.txt, one issue is apparent.

but what code would reproduce the problem with the zero-sized bin from these data?

flying-sheep · 2024-09-24T10:08:13Z

From the data you used, I assume you tried this, but I don’t get an empty bin:

https://gist.github.com/flying-sheep/b2ae449ab70a9358e07a82f284de5dca

gmvaccaro · 2024-09-30T19:28:46Z

Hi! Here is the source code with data Nestorowa et al., 2016 with added diagnostic tests to produce these results.

https://github.com/gmvaccaro/scanpy.tl.score_genes_fix/blob/main/score_genes_diagnostics_tests2.ipynb

flying-sheep · 2024-10-01T08:32:05Z

Wonderful, thank you!

flying-sheep · 2024-10-15T14:57:52Z

Hm, I adapted your reproducer to use scanpy 1.10.3’s code and it doesn’t seem to be an issue: https://gist.github.com/flying-sheep/b2ae449ab70a9358e07a82f284de5dca#file-score_genes_diagnostics_tests2-ipynb

I’m going to assume that this is fixed in 1.10.3. If you can reproduce it with 1.10.3, we can reopen it!

fix_score_genes.py

d99991e

changes the ranking system so number of genes in each bin succeeds when data is scaled and when there are many zeros in dataset.

This was referenced Jul 26, 2024

score_genes doesn’t produce the expected number of bins of equal or approximately equal size #3168

Closed

score_genes fails completely when the gene set has zero expression in some cells #3169

Closed

flying-sheep added this to the 1.10.3 milestone Jul 26, 2024

flying-sheep and others added 2 commits July 26, 2024 12:59

Merge branch 'main' into patch-1

9aa2f83

[pre-commit.ci] auto fixes from pre-commit.com hooks

51f2f5f

for more information, see https://pre-commit.ci

flying-sheep modified the milestones: 1.10.3, 1.10.4 Sep 17, 2024

flying-sheep closed this Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug fix for scanpy.tl.score_genes #3167

Bug fix for scanpy.tl.score_genes #3167

gmvaccaro commented Jul 25, 2024 •

edited by flying-sheep

Loading

codecov bot commented Jul 25, 2024 •

edited

Loading

flying-sheep commented Jul 26, 2024

gmvaccaro commented Aug 1, 2024

flying-sheep commented Aug 2, 2024 •

edited

Loading

flying-sheep commented Sep 23, 2024

flying-sheep commented Sep 24, 2024

gmvaccaro commented Sep 30, 2024

flying-sheep commented Oct 1, 2024

flying-sheep commented Oct 15, 2024 •

edited

Loading

Bug fix for scanpy.tl.score_genes #3167

Bug fix for scanpy.tl.score_genes #3167

Conversation

gmvaccaro commented Jul 25, 2024 • edited by flying-sheep Loading

codecov bot commented Jul 25, 2024 • edited Loading

Codecov Report

flying-sheep commented Jul 26, 2024

gmvaccaro commented Aug 1, 2024

flying-sheep commented Aug 2, 2024 • edited Loading

flying-sheep commented Sep 23, 2024

flying-sheep commented Sep 24, 2024

gmvaccaro commented Sep 30, 2024

flying-sheep commented Oct 1, 2024

flying-sheep commented Oct 15, 2024 • edited Loading

gmvaccaro commented Jul 25, 2024 •

edited by flying-sheep

Loading

codecov bot commented Jul 25, 2024 •

edited

Loading

flying-sheep commented Aug 2, 2024 •

edited

Loading

flying-sheep commented Oct 15, 2024 •

edited

Loading