-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] umap needs init_pos = 'random' for larger datasets #196
Comments
Please try setting 'init_pos="random"'. This usually works for me. |
Yep, that seems to resolve it. Known issue with spectral init? |
I don't know actually. I have ran into this too for very large datasets (1M or more). I will forward this to the rapids team at NVIDIA. |
I updated the title so people know how to fix it. I'll leave this open for know. I'll also update UMAP so that I has a new default with auto that sets the init_pos to random if you have more than 1M cells. |
The issue remains when set init_pos = 'random' |
There will be a further update with rapids-24.10. I hope this will completely eliminate the problem. |
Thanks for your quick reply! |
Hi, have this be updated? |
Describe the bug
rapids_singlecell.tl.umap will occasionally generate embeddings with spuriously large values (e.g., a small number of points that are very large).
Details: I start with a slice of the Tabula Muris data, and follow the standard workflow for generating a UMAP embedding: filter, highly_variable_genes, pca, neighbors, umap. When I use the ScanPy CPU implementation of UMAP, things look as expected. However, if I use the RSC
tl.umap
, I get embeddings which have a small number of points with very large values (i.e., appear as small, very dispersed clusters).This occurs regardless of which implementation I use for the other steps (e.g., PCA) - only the call to umap appears to vary in its behavior.
Example plot using scanpy.tl.umap:
Example plot using rapids_singlecell.tl.umap:
Running
numpy.histogram
on the embedding shows the effect. Scanpy generated umap, points are well distributed:RSC generated umap, small number of outliers:
I have found that this issue only occurs with specific data. For example, some slices of the above dataset generate good embeddings, others do not. For datasets where it fails, it fails reliably (so it appears to be something in the dataset or parameterization that triggers the bug).
When this occurs, the actual embedding in the central mass of points looks good. It is only the additional outliers which are problematic.
Steps/Code to reproduce bug
Installed on Linux (popos/ubuntu), using mamba and the
rsc_rapids_24.04.yml
recipe. I have a sample dataset that reliably fails and which I can make available.Test code:
Environment details (please complete the following information):
pip list
The text was updated successfully, but these errors were encountered: