Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

score_genes doesn’t produce the expected number of bins of equal or approximately equal size #3168

Closed
2 of 3 tasks
flying-sheep opened this issue Jul 26, 2024 · 1 comment
Assignees
Labels
Bug 🐛 Needs info❔ More information needed
Milestone

Comments

@flying-sheep
Copy link
Member

flying-sheep commented Jul 26, 2024

Please make sure these conditions are met

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of scanpy.
  • (optional) I have confirmed this bug exists on the main branch of scanpy.

What happened?

(extracted from #3167)

Utilizing the murine hematopoietic progenitors from Nestorowa et al., 2016, as well as the regev_lab_cell_cycle_genes.txt, one issue is apparent.

Currently the code doesn’t produce the expected number of bins of equal or approximately equal size. Bin 24 is empty when n_bins = 25.
current_hist

The current ranking system code within score_genes()

n_items = int(np.round(len(obs_avg) / (n_bins - 1)))
obs_cut = obs_avg.rank(method="min") // n_items

The modified code in #3167

obs_avg.sort_values(ascending=True, inplace=True)
n_items = int(np.ceil(len(obs_avg) / (n_bins)))
rank = np.repeat(np.arange(n_bins), n_items)[:len(obs_avg)]
obs_cut = pd.Series(rank, index=obs_avg.index)

The modified code performs as expected producing 25 bins containing approximately equal number of genes. The last bin can have up to 24 less than expected because the total number of genes is not perfectly divisible by 25.
modified_hist

Minimal code sample

TODO

Error output

No response

Versions

Python 3.10.12 
scanpy==1.10.2 anndata==0.10.8 umap==0.5.6 numpy==1.26.4 scipy==1.14.0 pandas==2.2.2 scikit-learn==1.5.0 statsmodels==0.14.2 igraph==0.11.6 louvain==0.8.2 pynndescent==0.5.13
@flying-sheep flying-sheep added this to the 1.10.3 milestone Jul 26, 2024
@flying-sheep flying-sheep self-assigned this Aug 8, 2024
@flying-sheep flying-sheep added the Needs info❔ More information needed label Aug 12, 2024
@flying-sheep flying-sheep modified the milestones: 1.10.3, 1.10.4 Sep 17, 2024
@flying-sheep
Copy link
Member Author

See discussion in #3167

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug 🐛 Needs info❔ More information needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant