Skip to content

score_genes doesn’t produce the expected number of bins of equal or approximately equal size #3168

Closed
@flying-sheep

Description

@flying-sheep

Please make sure these conditions are met

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of scanpy.
  • (optional) I have confirmed this bug exists on the main branch of scanpy.

What happened?

(extracted from #3167)

Utilizing the murine hematopoietic progenitors from Nestorowa et al., 2016, as well as the regev_lab_cell_cycle_genes.txt, one issue is apparent.

Currently the code doesn’t produce the expected number of bins of equal or approximately equal size. Bin 24 is empty when n_bins = 25.
current_hist

The current ranking system code within score_genes()

n_items = int(np.round(len(obs_avg) / (n_bins - 1)))
obs_cut = obs_avg.rank(method="min") // n_items

The modified code in #3167

obs_avg.sort_values(ascending=True, inplace=True)
n_items = int(np.ceil(len(obs_avg) / (n_bins)))
rank = np.repeat(np.arange(n_bins), n_items)[:len(obs_avg)]
obs_cut = pd.Series(rank, index=obs_avg.index)

The modified code performs as expected producing 25 bins containing approximately equal number of genes. The last bin can have up to 24 less than expected because the total number of genes is not perfectly divisible by 25.
modified_hist

Minimal code sample

TODO

Error output

No response

Versions

Python 3.10.12 
scanpy==1.10.2 anndata==0.10.8 umap==0.5.6 numpy==1.26.4 scipy==1.14.0 pandas==2.2.2 scikit-learn==1.5.0 statsmodels==0.14.2 igraph==0.11.6 louvain==0.8.2 pynndescent==0.5.13

Metadata

Metadata

Assignees

Labels

Type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions