-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Topics has been calculated to number 350, but the loglikelihood is still not optimal. #114
Comments
Hi @YH-Zheng Indeed, choosing the correct number of topics can be a bit tricky and subjective. I would not run models with a larger number of topics. I would choose the model with 200 topics in your case, that's the point where most metrics are maximised. After selecting the model you should check wether your topics represent your cell types well, based on plotting cell-topic probabilities, i.e. do you have a topic that is specific for each cell type? and based on motif enrichment, are the regions in topics enriched for the motifs that you are expecting? All the best, Seppe |
Thans for your reply. You mean to make all four indicators as large as possible as the appropriate number of topics, but both of the metrics (Arun_2010, Cao_Juan_2009 ) you mentioned in the tutorial are that the better the model, the lower the metric.
If the chosen topic does not separate my ATAC data by my celltype annotation, would it be better to divide all cells into subsets and run subject modeling separately (e.g., B cells, CD4T cells, and many smaller subsets within these large subsets of cells)? Or increase the number of topic? Best wishes, Yuhui |
Hi @YH-Zheng You are correct about those two metrics, however for plotting them we invert their values (hence the "inv" prefix). I would not run topic modelling separately per cell type, you need the background of the other cell types to be able to identify cell type specific regions. In that case I would indeed increate the number of topics. All the best, Seppe |
I randomly sample 1k cells from each celltype in a data set of 4 million cells, and get an atac matrix of 55243 cells x 165804 peaks. However, when I perform topic calculation, loglikelihood is not reached when the number of topics reaches 350. Does this mean I need to increase the number of topics? But 350 is a large value relative to the example, how do I pick the optimal number of topics?
The text was updated successfully, but these errors were encountered: