Increase robustness of cell type prioritization #30

tkapello · 2023-09-13T13:36:10Z

Hi,

 thank you for your interesting package. In my dataset of ~20,000 cells and 26,897 genes, I found that the tool found 20,000 unique significant genes. I wanted to ask whether this makes sense as the number seems relatively high (>60%). Is there a way one can tailor the analysis to increase the robustness of genes used to prioritize cell types apart from adjusting the trees?

Best,
Theo

The text was updated successfully, but these errors were encountered:

skinnider · 2023-09-13T22:11:36Z

Hi Theo - I'm not entirely sure I understand your question. Augur doesn't test for statistical significance but simply returns the feature importance from the random forest algorithm. But there are a number of reasons to take these importances with a grain of salt and if you are interested in identifying statistically significant differences, a conventional differential expression (DE) analysis as implemented in our Libra package might make more sense.

Beyond that, you can set augur_mode='velocity' to disable feature selection and use alternative feature selection methods, or at the expense of a longer runtime you can just run Augur on all genes.

tkapello · 2023-09-15T14:00:00Z

Sorry, I was not clear enough. I guess my question can be rephrased as "what does "deature importance" actually mean? How can it be interpreted?".

skinnider · 2023-09-16T14:02:16Z

I am probably not going to give a better explanation than in the randomForest documentation. In Augur, the importance values are then averaged over repeated subsamples for each cell type. In general, I would recommend using the results of a DE analysis with Libra to identify genes that are changing between conditions within individual cell types, rather than relying on feature importance.

tkapello · 2023-09-16T15:14:23Z

Thank you again, I was wondering more about your interpretation of the usability of "feature importance" in cell type prioritization. For example, would 20,000 important features correlate with higher robustness rather than 15,000 features? Or would you say there is a lower threshold of features that signals more confidence, e.g. "AUC=0.8 based on 18,000 important features" compared to "AUC=0.7 based on 10,000 important features"?

skinnider · 2023-09-16T19:50:01Z

In general I would say I don't really factor this in and go solely by the AUC. Many subsamples of equal size (default=50) are being performed for each cell type, so the fact that 18,000 features have an assigned feature importance doesn't mean that all 18,000 were being used by every classifier trained for that cell type. Feature importance can also be zero or negative, so just because a feature importance is assigned doesn't mean that gene is actually a feature that has a positive impact on classification.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase robustness of cell type prioritization #30

Increase robustness of cell type prioritization #30

tkapello commented Sep 13, 2023

skinnider commented Sep 13, 2023 •

edited

Loading

tkapello commented Sep 15, 2023

skinnider commented Sep 16, 2023

tkapello commented Sep 16, 2023

skinnider commented Sep 16, 2023

Increase robustness of cell type prioritization #30

Increase robustness of cell type prioritization #30

Comments

tkapello commented Sep 13, 2023

skinnider commented Sep 13, 2023 • edited Loading

tkapello commented Sep 15, 2023

skinnider commented Sep 16, 2023

tkapello commented Sep 16, 2023

skinnider commented Sep 16, 2023

skinnider commented Sep 13, 2023 •

edited

Loading