Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase robustness of cell type prioritization #30

Open
tkapello opened this issue Sep 13, 2023 · 5 comments
Open

Increase robustness of cell type prioritization #30

tkapello opened this issue Sep 13, 2023 · 5 comments

Comments

@tkapello
Copy link

Hi,

 thank you for your interesting package. In my dataset of ~20,000 cells and 26,897 genes, I found that the tool found 20,000 unique significant genes. I wanted to ask whether this makes sense as the number seems relatively high (>60%). Is there a way one can tailor the analysis to increase the robustness of genes used to prioritize cell types apart from adjusting the trees?

Best,
Theo

@skinnider
Copy link
Collaborator

skinnider commented Sep 13, 2023

Hi Theo - I'm not entirely sure I understand your question. Augur doesn't test for statistical significance but simply returns the feature importance from the random forest algorithm. But there are a number of reasons to take these importances with a grain of salt and if you are interested in identifying statistically significant differences, a conventional differential expression (DE) analysis as implemented in our Libra package might make more sense.

Beyond that, you can set augur_mode='velocity' to disable feature selection and use alternative feature selection methods, or at the expense of a longer runtime you can just run Augur on all genes.

@tkapello
Copy link
Author

Sorry, I was not clear enough. I guess my question can be rephrased as "what does "deature importance" actually mean? How can it be interpreted?".

@skinnider
Copy link
Collaborator

I am probably not going to give a better explanation than in the randomForest documentation. In Augur, the importance values are then averaged over repeated subsamples for each cell type. In general, I would recommend using the results of a DE analysis with Libra to identify genes that are changing between conditions within individual cell types, rather than relying on feature importance.

@tkapello
Copy link
Author

Thank you again, I was wondering more about your interpretation of the usability of "feature importance" in cell type prioritization. For example, would 20,000 important features correlate with higher robustness rather than 15,000 features? Or would you say there is a lower threshold of features that signals more confidence, e.g. "AUC=0.8 based on 18,000 important features" compared to "AUC=0.7 based on 10,000 important features"?

@skinnider
Copy link
Collaborator

In general I would say I don't really factor this in and go solely by the AUC. Many subsamples of equal size (default=50) are being performed for each cell type, so the fact that 18,000 features have an assigned feature importance doesn't mean that all 18,000 were being used by every classifier trained for that cell type. Feature importance can also be zero or negative, so just because a feature importance is assigned doesn't mean that gene is actually a feature that has a positive impact on classification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants