Skip to content

7. Pseudo Bulk Analysis Using Additional Gumshoe Functions

Gurveer Gill edited this page Jul 19, 2024 · 23 revisions

Gumshoe Tutorial

Introduction

This tutorial is based on aggregated single-cell data from the following paper:

  • Usoskin D, Furlan A, Islam S, Abdo H, Lönnerberg P, Lou D, Hjerling-Leffler J, Haeggström J, Kharchenko O, Kharchenko PV, Linnarsson S, Ernfors P (2015) Unbiased classification of sensory neuron types by large-scale single-cell RNA sequencing. Nat Neurosci 18:145–153.

For the sake of simplicity and as the focus of Gumshoe is supplemental utilities for Sleuth, we can skip the step of transcript abundance quantification using Kallisto. We have done this step, and the output files are below. A corresponding metadata file for the dataset has been included. The data is from a single-cell experiment on colonic sensory neurons in mice. The data has been combined and converted into bulk RNA-Seq data, and has four tissue sources (NF, NP, PEP, TH) and two sexes (male and female). This tutorial builds upon the previous 'Gumshoe Functions' tutorial. Further, this tutorial aims to showcase further Gumshoe package functions and identifying optimal regression models.

NOTE: This tutorial assumes that you have reviewed the previous tutorial.


Downloading The Data

Tutorial data can be found in the data folder of the package. For this tutorial, you should find the following in the folder:

Tutorial_Pseudo_Bulk
├── gumshoe_pseudo_bulk_analysis.R
├── gumshoe_pseudo_bulk_tutorial.ipynb
├── metadata.txt
├── output
│   ├── female_NF.0.kallisto
│   ├── female_NF.1.kallisto
│   ├── female_NF.2.kallisto
│   ├── female_NP.0.kallisto
│   ├── female_NP.1.kallisto
│   ├── female_NP.2.kallisto
│   ├── female_PEP.0.kallisto
│   ├── female_PEP.1.kallisto
│   ├── female_PEP.2.kallisto
│   ├── female_TH.0.kallisto
│   ├── female_TH.1.kallisto
│   ├── female_TH.2.kallisto
│   ├── male_NF.0.kallisto
│   ├── male_NF.1.kallisto
│   ├── male_NF.2.kallisto
│   ├── male_NP.0.kallisto
│   ├── male_NP.1.kallisto
│   ├── male_NP.2.kallisto
│   ├── male_PEP.0.kallisto
│   ├── male_PEP.1.kallisto
│   ├── male_PEP.2.kallisto
│   ├── male_TH.0.kallisto
│   ├── male_TH.1.kallisto
│   └── male_TH.2.kallisto
└── plot
    ├── factor_combination_density.pdf
    ├── histamine_itch.pdf
    └── histamine_itch_t2g.pdf

Analysis Using Sleuth and Gumshoe

Loading Libraries

After downloading or generating the data, our first step should be to load the libraries we will be using and construct or load our metadata file. For the sake of simplicity, we will be using the metadata file included with the tutorial data. In our case, the metadata file contains the sample name, path, sex, and tissue type.

# Libraries ----
library(tidyverse)
library(sleuth)
library(biomaRt)
library(gumshoe)
library(ggrepel)
library(multcompView)
library(rcompanion)
library(ggpubr)
library(reshape2)
library(ComplexHeatmap)
library(circlize)
library(ggridges)

# Working Directory ----
# NOTE: Please ensure that your working directory is the current directory, as this is critical for the remainder of the tutorial.
# setwd("")

Metadata File

For the sake of simplicity, we will be using the metadata file included with the tutorial data. In our case, the metadata file contains the sample name, path, sex, and tissue type. First, let's load this metadata file and then convert the tissue column to a factor and create a new variable to hold our re-leveled metadata.

# Metadata File ----
## Load Metadata File ----
metadata <- read.table("metadata.txt", header = TRUE)

## Re-level Metadata ----
# Convert the tissue type column to a factor and, based on how R works, NF will be the first factor level.
metadata$tissue <- as.factor(metadata$tissue)

# Create a new variable that will contain a relevled version of the metadata
metadata_releveled <- metadata
metadata_releveled$tissue <- relevel(metadata_releveled$tissue, ref = "TH")

Creating a Data Frame for Gumshoe

The steps we've taken till this point are ubiquitous when running a Sleuth analysis. To utilize Gumshoe to its full potential, we must deviate from the traditional approach and construct a unique data structure to be used with the sleuth_interpret Gumshoe function. These steps might seem redundant and unnecessary, but let's consider that we wanted to build upon our previous tutorial analysis and compare the results to an analysis with transcript to gene mapping with classical transcript aggregation. A normal approach would require far more lines of code, but we can simplify this process by simply adding it to our data frame.

We can now create a data frame that contains the information that Gumshoe will use to automate the analysis. In this case, we can add the transcript to gene mapping arguments to the model_parameters variable.

# Gumshoe Data Frame Creation ----
## Analysis Information ----
# Define the metadata file name(s), associated model name(s), the corresponding model(s) to be used, and the model parameters.
# NOTE: The order of the model data must be align with the model names.
metadata_names <- c("metadata_nominal_NF_baseline",
                    "metadata_nominal_TH_baseline")
model_names <- c("NF_model_sex,NF_model_sex_tissue,NF_model_interaction",
                 "NF_t2g_model_sex,NF_t2g_model_sex_tissue,NF_t2g_model_interaction",
                 "TH_model_sex,TH_model_sex_tissue,TH_model_interaction",
                 "TH_t2g_model_sex,TH_t2g_model_sex_tissue,TH_t2g_model_interaction")
model_data <- c("~sex, ~sex + tissue, ~sex*tissue")
model_parameters <- c("",
                      "target_map = t2g, aggregation_column = 'ext_gene', gene_mode = TRUE")

## Data Frame Creation ----
# Take the metadata names, model names, and model data information and combine it into a single data frame.
analysis_data <- data.frame(metadata_name = metadata_names,
                            metadata_file = tibble(list(metadata)),
                            model_name = model_names,
                            model_data = model_data,
                            model_parameters = model_parameters)

NOTE: The model names must be unique, or they will be overwritten

You can confirm the contents of the analysis_data by running View(analysis_data), and it should look like this:

                 metadata_name list.metadata.                                                        model_name                       model_data                                                    model_parameters
1 metadata_nominal_NF_baseline   c("femal....             NF_model_sex,NF_model_sex_tissue,NF_model_interaction ~sex, ~sex + tissue, ~sex*tissue
2 metadata_nominal_TH_baseline   c("femal.... NF_t2g_model_sex,NF_t2g_model_sex_tissue,NF_t2g_model_interaction ~sex, ~sex + tissue, ~sex*tissue target_map = t2g, aggregation_column = 'ext_gene', gene_mode = TRUE
3 metadata_nominal_NF_baseline   c("femal....             TH_model_sex,TH_model_sex_tissue,TH_model_interaction ~sex, ~sex + tissue, ~sex*tissue
4 metadata_nominal_TH_baseline   c("femal.... TH_t2g_model_sex,TH_t2g_model_sex_tissue,TH_t2g_model_interaction ~sex, ~sex + tissue, ~sex*tissue target_map = t2g, aggregation_column = 'ext_gene', gene_mode = TRUE

At this point, we've defined two metadata names corresponding to two different metadata files, which have a varying factor baseline, named three models to use for each metadata, and created formulae associated with each model name. Each model uses different factors to run the analysis. The first model is solely sex, the second is sex and tissue type, and the third is sex, tissue type, and the interaction between both. We have a total of four metadata names, with a transcript to gene aggregation variant for both the NF and TH model names.

We can now replace the standard metadata file with our re-leveled version.

# Replace the metadata file with the re-leveled version.
analysis_data[[2]][[2]] <- metadata_releveled
analysis_data[[2]][[4]] <- metadata_releveled

To confirm that we have the correct metadata in the analysis_data data frame, we can check the factor levels of the tissue column in the metadata files contained in the analysis_data data frame.

levels(analysis_data[[2]][[1]]$tissue)
# "NF"  "NP"  "PEP" "TH"
levels(analysis_data[[2]][[2]]$tissue)
# "TH"  "NF"  "NP"  "PEP"
levels(analysis_data[[2]][[3]]$tissue)
# "NF"  "NP"  "PEP" "TH"
levels(analysis_data[[2]][[4]]$tissue)
# "TH"  "NF"  "NP"  "PEP"

Transcript to Gene Mapping

Finally, we must create a dataframe to map all the transcripts to genes. This can be accomplished using the biomaRt package using the following lines of code.

mart <- useMart("ensembl", dataset = "mmusculus_gene_ensembl")
t2g <- getBM(attributes = c("ensembl_transcript_id_version", "ensembl_gene_id","external_gene_name"), mart = mart)
colnames(t2g) <- c("target_id", "ens_gene", "ext_gene")

Using Gumshoe

Finally, with a few lines of code, we can analyze all the data. First, we will be using the sleuth_interpret function that comes with Gumshoe. We don't need to worry about assigning the function output to a variable, as this step is also automated for you. A quick tip, if running Sleuth through RStudio, the number of cores utilized will automatically be set to 1, but this can be bypassed by running Sys.setenv("RSTUDIO" = 0).

Although the Gumshoe package functions aim to be as flexible as possible, the t2g variable columns must be named according to the aforementioned lines of code and transcript aggregation can't be performed using the Lancaster method to ensure the success of downstream functions. However, sleuth_interpret can be used as a standalone function and doesn't have these restrictions.

sleuth_interpret(analysis_data, num_cores = 1)

Comparing Regression Models

Residual Sum of Squares

One of the simplest ways to compare and identify better regression models is by checking the residual sum of squares (RSS) for each sleuth_object. We can do this object by object, but for simplicity, we've made a dataframe of the model names and the associated RSS values.

# Check the model RSS values
rss_df <- data.frame(models = c("so_NF_model_sex", "so_NF_t2g_model_sex", "so_TH_model_sex", "so_TH_t2g_model_sex",
                                "so_NF_model_sex_tissue", "so_NF_t2g_model_sex_tissue", "so_TH_model_sex_tissue", "so_TH_t2g_model_sex_tissue",
                                "so_NF_model_interaction", "so_TH_model_interaction", "so_NF_t2g_model_interaction", "so_TH_t2g_model_interaction"),
                     model_rss = c(sleuth_model_rss(so_NF_model_sex), sleuth_model_rss(so_TH_model_sex), sleuth_model_rss(so_NF_t2g_model_sex),  sleuth_model_rss(so_TH_t2g_model_sex),
                                        sleuth_model_rss(so_NF_model_sex_tissue), sleuth_model_rss(so_TH_model_sex_tissue), sleuth_model_rss(so_NF_t2g_model_sex_tissue), sleuth_model_rss(so_TH_t2g_model_sex_tissue),
                                        sleuth_model_rss(so_NF_model_interaction), sleuth_model_rss(so_TH_model_interaction), sleuth_model_rss(so_NF_t2g_model_interaction), sleuth_model_rss(so_TH_t2g_model_interaction)))

When we run View(rss_df), we should now see a dataframe with 2 columns, with the first being the model names and then second being the RSS value for the respective model. It should resemble the following:

                        models        model_rss
1              so_NF_model_sex 52959.9894200399
2          so_NF_t2g_model_sex 52959.9894200399
3              so_TH_model_sex 16709.3668898258
4          so_TH_t2g_model_sex 16709.3668898258
5       so_NF_model_sex_tissue 26341.8264855065
6   so_NF_t2g_model_sex_tissue 26341.8264855065
7       so_TH_model_sex_tissue 7542.60054997578
8   so_TH_t2g_model_sex_tissue 7542.60054997578
9      so_NF_model_interaction 1247.34784170957
10     so_TH_model_interaction 1247.34784170957
11 so_NF_t2g_model_interaction 225.594523780623
12 so_TH_t2g_model_interaction 225.594523780623

Gumshoe includes sleuth_test_wt and sleuth_test_lrt to automate the running of the wald test (WT) and likelihood-ratio test (LRT) for each coefficient in the models of the sleuth_object the code was run on. The output of the functions is assigned to the sleuth_object the function was run on. Gumshoe follows a notion of automation throughout all of its functions, and the sleuth_model_rss function encompasses this as well via reporting the RSS for each coefficient in the models of the sleuth_object following running either the WT or the LRT.

Although the WT model coefficient RSS values are rarely useful, the RSS values following the LRT can be very helpful to understanding how the regression model factors change the regression. Treat the following as an exercise to become more comfortable with functions in Gumshoes. Run sleuth_test_lrt on the so_NF_model_sex_tissue model and then run compare the RSS dataframe using the sleuth_model_rss function for the so_NF_model_sex_tissue and so_NF_model_sex models. You might notice that the RSS values for the so_NF_model_sex model is the same as that of the no_tissue fit for the so_NF_model_sex_tissue model. So, starting with a complex model and performing the LRT can help save time in identifying how individual regression model factors change the RSS. Here are the RSS values based on the exercise:

# so_NF_model_sex_tissue RSS after LRT
          Parameters              RSS
2               full 1247.34784170957
3          no_tissue 52959.9894200399
4             no_sex 36951.4458116443

# so_NF_model_sex RSS
    Parameters              RSS
2         full 52959.9894200399

Adding Target Mapping Post-hoc

In this tutorial, we generated six sleuth_objects that didn't contain any target mapping or gene aggregation information, but to showcase some of the other functions, we do need non-gene aggregated sleuth_objects that still contain target mapping information. We could rerun sleuth_interpret(analysis_data after changing the sleuth_parameters to include this information, but this would take a lot of time to rerun, and it's important to note that Sleuth doesn't utilize target mapping for any purpose if gene aggregation is false. So, we can simply insert the target mapping information into the existing sleuth_objects that are absent of it by assigning the t2g dataframe to them using the following command:

so_NF_model_sex$target_mapping <- t2g
so_NF_model_sex_tissue$target_mapping <- t2g
so_NF_model_interaction$target_mapping <- t2g
so_TH_model_sex$target_mapping <- t2g
so_H0_01model_sex_tissue$target_mapping <- t2g
so_TH_model_interaction$target_mapping <- t2g

Performing Pseudo-Bulk Analysis

Rationale Behind Pseudo-Bulk Analysis

Our pseudo-bulk dataset was generated using single-cell RNA-sequencing, which is known to suffer from observed zero values or dropout. However, we can't truly identify if these zero values are because we're unable to detect reads or if there is a biologically-true absence of expression. Using pseudo-bulk analysis, we can aggregate expression and reduce the dropout. As such, the analysis in the tutorial allows us to analyze the elusive itch pathways mentioned in the paper or potentially identify population-specific differentially expressed genes that weren't identified previously in the publication due to the dropout rate. We will tackle the prior and leave the latter as an exercise.

The publication investigates and predicts several itch pathways, such as:

  1. LPAR3 and LPAR5 in uniquely NP neurons.
  2. Chloroquine mediated MRGPRA3 and MRGPRX1 activation in uniquely NP neurons.
  3. Serotonin-induced itch HTR1F and HTR2A in uniquely NP neurons.

However, some itch pathways were found to remain elusive, like the histamine receptor H1 (HRH1) that was found to be lowly expressed in the NP group and protease-activated receptor (PAR) 2 (F2RL1)-dependent itch couldn't be assigned. Therefore, reassessing the detection of both HRH1 and PAR2 in the pseudo-bulk data would be worthwhile to investigate their expression within the particular types of sensory neurons.

Investigating Histamine and PAR Itch Associated Genes in Sensory Neurons

To easily assess and visualize the expression of HRH1 and F2RL1, we can utilize a unique function in Gumshoe that builds on the ComplexHeatmap package to display gene and transcript level data in a unique format specifically designed for RNA-sequencing analysis to convey information easily. Firstly, we can assess the expression of these genes using the filtered_scaled_transcript_counts function to assess the number of counts with no regard to whether they're statistically significant or not. We can check this with the following line of code: filtered_scaled_transcript_counts(so_NF_t2g_model_interaction, c("Hrh1", "F2rl1")), but surprisingly we don't see any counts. Although, hope is not lost just yet, and it would be worthwhile checking the non-gene aggregated sleuth_object using filtered_scaled_transcript_counts(so_NF_model_interaction, c("Hrh1", "F2rl1")), but again we don't see any counts.

filtered_scaled_transcript_counts(so_NF_t2g_model_interaction, c("Hrh1", "F2rl1"))
# [1] target_id             ens_gene              ext_gene              sample                scaled_reads_per_base tpm                   est_counts
# <0 rows> (or 0-length row.names)

filtered_scaled_transcript_counts(so_NF_model_interaction, c("Hrh1", "F2rl1"))
# [1] target_id             ens_gene              ext_gene              sample                scaled_reads_per_base tpm                   est_counts
# <0 rows> (or 0-length row.names)

This was quite surprising, so it might be worthwhile to check the number of normalized counts in the samples.

so_NF_t2g_model_interaction$obs_norm[so_NF_t2g_model_interaction$obs_norm$target_id %in% c('Hrh1', 'F2rl1'),]
so_NF_model_interaction$obs_norm[so_NF_model_interaction$obs_norm$target_id %in% t2g$target_id[t2g$ext_gene %in% c('Hrh1', 'F2rl1')],]

Based on these results, it doesn't seem as though the pseudo-bulk dataset captured the low counts for HRH1 and F2RL1 expression.

PLCβ3, TRPV1, and TRPV3 Expression

We can still further investigate both histamine and PAR itch by evaluating the expression of other genes also involved in these processes, such as PLCβ3 and TRPV1 for histamine itch and TRPV3 (see https://doi.org/10.1016/j.jid.2020.01.012) for PAR itch. So, let's check the scaled counts of these genes with filtered_scaled_transcript_counts(so_NF_t2g_model_interaction, genes = c('Plcb3', 'Trpv1', 'Trpv3')) or filtered_scaled_transcript_counts(so_NF_model_interaction, genes = c('Plcb3', 'Trpv1', 'Trpv3')). Did you notice anything strange with either output? Its quite interesting how we don't see TRPV1/3 present even though they're known to be highly expressed in sensory neurons, so let's go check out the raw counts and see why this might be happening.

so_NF_t2g_model_interaction$obs_norm[so_NF_t2g_model_interaction$obs_norm$target_id %in% c('Trpv1'),]
so_NF_t2g_model_interaction$obs_norm[so_NF_t2g_model_interaction$obs_norm$target_id %in% c('Trpv3'),]
so_NF_model_interaction$obs_norm[so_NF_model_interaction$obs_norm$target_id %in% t2g$target_id[t2g$ext_gene %in% c('Trpv1')],]
so_NF_model_interaction$obs_norm[so_NF_model_interaction$obs_norm$target_id %in% t2g$target_id[t2g$ext_gene %in% c('Trpv3')],]

The output showcases that there are abundant counts for TRPV1, but not TRPV3. This leaves us unable to further explore the PAR itch pathway at the current time, but it also presents us with another question. If TRPV1 counts at the transcript and gene-aggregation level are so high, why are we unable to see them using the filtered_scaled_transcript_counts function? Well, there are two reasons for this issue:

  1. The filtered_scaled_transcript_counts function uses the obs_norm_filt dataframe in a sleuth_object, which requires that the gene passed the sleuth filter
  2. The transcripts and genes aren't passing the filter (see https://achri.blogspot.com/2018/02/are-you-losing-important-genes-in-your.html) that requires at least 5 mapped reads to a transcript in at least 47% of the samples

Although, a quick check of the number of samples with 5 mapped reads to a transcript in the gene-aggregation sleuth_object shows 12 rows that pass this filter:

trpv1_t2g_expression <- so_NF_t2g_model_interaction$obs_norm[so_NF_t2g_model_interaction$obs_norm$target_id %in% c('Trpv1'),]
nrow(trpv1_t2g_expression[trpv1_t2g_expression$scaled_reads_per_base > 5,])
# [1] 12

So, why is TRPV1 not showing up when we run the filtered_scaled_transcript_counts function? To cut to the chase, it is because Sleuth filters per each transcript and not per each gene, even with transcript-aggregation set to TRUE. As single-cell data is typically utilized to find cluster-specific differences, it is not surprising that generating pseudo-bulk data from these clusters leads to a low or absent expression of certain transcripts or genes in certain groupings of samples. Is there any to rectify this shortcoming? Short answer, definitely. We could re-run the analysis and change the filter used to run sleuth_prep to be more relaxed, or we could run sleuth_prep with no filter and manually filter the obs_norm_filt dataframe prior to sleuth_fit. However, we must be careful with the approaches we use to filter samples based on transcript or gene specific expression, to prevent an increase in type I or II errors. In this tutorial, we will be modifying the filter used for the purpose of investigating itch pathways. The limitations of our filtering approach are outside the scope of this tutorial.

Better Transcript Filtering

In this tutorial, we won't use a filter for our data given that we have a targeted approach to assess the expression level of a few particular genes, namely PLCβ3 and TRPV1. So, the fast solution is to re-process the data without a filter, but Gumshoe also includes a design_filter function that is explained in the following article: https://achri.blogspot.com/2018/02/are-you-losing-important-genes-in-your.html. Using this filter facilitates the inclusion of transcripts with 5 reads in a minimum of 47% of a given experimental factor combination, for example, the male NP samples. We can perform this step quite easily by running the following code:

reanalysis_data <- analysis_data
reanalysis_data <- reanalysis_data[-c(2,4),]
reanalysis_data$model_name[1] <- 'NF_model_interaction'; reanalysis_data$model_name[2] <- 'NF_t2g_model_interaction'
reanalysis_data$model_data <- '~sex*tissue'
reanalysis_data$model_parameters[1] <- "filter_fun=function(x){design_filter(metadata, ~sex*tissue, x)}"
reanalysis_data$model_parameters[2] <- "target_map = t2g, aggregation_column = 'ext_gene', gene_mode = TRUE, filter_fun=function(x){design_filter(metadata, ~sex*tissue, x)}"

sleuth_interpret(reanalysis_data, num_core = 1)

We can now run the WT on both the so_NF_model_interaction and so_NF_t2g_model_interaction models and then check if our change in the filter design allows us to investigate TRPV1 expression.

sleuth_test_wt(so_NF_model_interaction)
sleuth_test_wt(so_NF_t2g_model_interaction)

so_NF_model_interaction$target_mapping <- t2g

head(filtered_scaled_transcript_counts(so_NF_model_interaction, "Trpv1"))
head(filtered_scaled_transcript_counts(so_NF_t2g_model_interaction, "Trpv1"))

Heatmap Visualization and Non-Parametric Experimental Factor Combination Testing

We can now assess the expression of PLCβ3 and TRPV1 in the various neuronal populations, such as NF, NP, PEP, and TH, to evaluate histamine itch. We can visualize and determine if gene or transcript expression is significant using the heatmap_plot function in Gumshoe. To perform this, we can run the following code:

# For all the transcripts passing the filter and associated with the gene
heatmap_plot(so_NF_model_interaction, genes = c("Plcb3", "Trpv1"),
             grouping_colours = list(sex = c("F" = "deeppink1", "M" = "dodgerblue2"), tissue = c("NF" = "chartreuse1", "NP" = "blue", "PEP" = "darkorange", "TH" = "gray40")), q_max = FALSE,
             clusterRows = TRUE, clusterColumn = FALSE)

# For all the genes
heatmap_plot(so_NF_t2g_model_interaction, genes = c("Plcb3", "Trpv1"),
             grouping_colours = list(sex = c("F" = "deeppink1", "M" = "dodgerblue2"), tissue = c("NF" = "chartreuse1", "NP" = "blue", "PEP" = "darkorange", "TH" = "gray40")), q_max = FALSE,
             clusterRows = TRUE, clusterColumn = FALSE)

Based on the heatmap plot, we find that our results confirm those as in the Usokin publication, and build upon them by identifying factor combination specific differences, such as, for example, significant interactions between sex and both NP and PEP tissue. Further, we see a signifcant interaction between sex and specific PLCβ3 transcripts (ENSMUST00000237808.2 and ENSMUST00000025912.10) in TH tissue. Although, this does make us wonder if the transcript counts between each factor combination are similar to one another or, in other words, do the transcript counts originate from the same distribution? To assess this, we can perform a Kruskal-Wallis test of the transcript counts by the factor combinations for a select gene, which in this case will be both PLCβ3 and TRPV1, and to do so we can run the sleuth_kruskal_wallis function. The prior will be left as an exercise for the reader.

kw_result <- sleuth_kruskal_wallis(so_NF_model_interaction, "Trpv1")

# Joining with `by = join_by(sample)`
#
# 	Kruskal-Wallis rank sum test
#
# data:  est_counts by factor_group
# Kruskal-Wallis chi-squared = 33.779, df = 7, p-value = 1.894e-05
#
#
# 	Pairwise comparisons using Wilcoxon rank sum test with continuity correction
#
# data:  data$est_counts and data$factor_group
#
#       F_NF  F_NP  F_PEP F_TH  M_NF  M_NP  M_PEP
# F_NP  0.111 -     -     -     -     -     -
# F_PEP 0.051 0.704 -     -     -     -     -
# F_TH  1.000 0.111 0.051 -     -     -     -
# M_NF  0.237 0.408 0.408 0.237 -     -     -
# M_NP  0.013 0.074 0.013 0.013 0.013 -     -
# M_PEP 0.013 0.019 0.013 0.013 0.013 0.443 -
# M_TH  0.051 0.704 0.935 0.051 0.408 0.013 0.013
#
# P value adjustment method: BH
#  F_NF  F_NP F_PEP  F_TH  M_NF  M_NP M_PEP  M_TH
#   "a"  "ab"   "a"   "a"   "a"  "bc"   "c"   "a"

We find that the Kruskal-Wallis test yields a p-value of less than 0.001, so then we can run a pairwise comparison with BH correction. It is sometimes difficult to identify what differs following a pairwise comparison, therefore, it is helpful to express these differences using a letter-based representation. The results showcase a significant difference that withstands multiple testing corrections between all the samples and the male NP and PEP, the female NP and the male PEP, the male NP and all other samples other than the female NP and male PEP, and the male PEP samples and all other samples other than the male NP. We can visualize the count density by using the following code:

ggplot(kw_result, aes(x = est_counts, y = factor_group)) +
   geom_density_ridges2()

As such, it seems as though the number of TRPV1 counts is greater in the male PEP and NP than the female PEP and NP, which may be due to an actual difference in TRPV1 expression or as a result of several other reasons.

Conclusion

This takes us to the end of the tutorial, and based on our dataset, we've covered nearly all applicable functions found in Gumshoe. If you have any questions, suggestions, or feedback, feel free to send an email ([email protected])!