-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does LSI require all peaks for optimal results, especially when dealing with ultra-scale datasets? #108
Comments
Hi @YH-Zheng. Thanks for your interest in GLUE! Yes, empirically speaking, LSI does work better when more peaks are included. Selecting highly variable peaks in the LSI step would likely result in lower cell type resolution. If the number of cells is too large, one solution might be to obtain the loading matrix from subsampled cells, and apply it to all cells, which is exactly what we did with the human fetal atlas integration. |
Thanks for your reply @Jeff1995, you mean it would be better to select all the peaks in the LSI step. Is the subsampling scheme you are talking about similar to Metacell's method, which integrates Metacell to obtain cell type tags and then propagates them to lower-level single cells? |
I was thinking about randomly subsampling single cells, but using metacells would theoretically be a better choice as there is less information loss. However, using metacells to obtain LSI loading matrix might be a bit tricky and needs extra caution, because the aggregated ATAC profile of metacells may deviate from the distribution of lower-level single-cells, so the loading matrix of metacells could be suboptimal for single-cells. |
Hi, @Jeff1995 I noticed in the code that you used the downsampled data for the initial training of the GLUE model and treated it as pretraining for the entire dataset. Should I follow a similar approach? Should I downsample initially by a certain ratio, train once, and save the results as pretraining input for the entire dataset? I attempted to use other computing frameworks to accelerate the computation of LSI, such as the Mars framework. It performs well with small-scale data, but it seems challenging to create a specific tensor from atac.X for subsequent computations. Therefore, I had to abandon the LSI calculation for all ATAC peaks. I have successfully implemented LSI dimensionality reduction using HVG mappings from RNA, but the final results seem somewhat mediocre. I would appreciate discussing details further with you. Thank a lot! |
Sorry for the late reply! Regarding the first problem, the code in our experiment downsampled cells per organ to balance the organ distribution across modalities. You wouldn't need to do that unless you also have highly unbalanced cell types. Simple random downsampling would work. And for the second problem. Yes, I'd recommend pre-training the model on downsampled data if downsampling still retains a descent number of cells (say 10^4 cells), mainly because it would save time (you would have the opportunity to check whether the model alignment is reasonable before tuning it on the whole dataset). |
Hello, I currently have scATAC data with approximately 3.43 million cells and around 160,000 peaks. When I attempt LSI dimensionality reduction using all peaks, it takes an incredibly long time (seemingly more than a day, which I eventually terminated).
However, when I use guidance to map highly variable genes from RNA to ATAC, involving 15,868 highly variable peaks, LSI takes less time, and I successfully complete the model training. The final cell type transfer seems to work well, but when I visualize the merged ATAC and RNA, I notice that the cell subtypes aren't completely separated, unlike in the downsampled ATAC dataset. I wonder if this is due to the use of highly variable peaks.
As for training with RNA data, my dataset is also large. Currently, I'm employing random downsampling. Do you have any suggestions for handling such ultra-scale datasets?
The text was updated successfully, but these errors were encountered: