diff --git a/geneset_randomHist.png b/geneset_randomHist.png new file mode 100644 index 0000000..c2a91b7 Binary files /dev/null and b/geneset_randomHist.png differ diff --git a/index.html b/index.html index b3f7b1f..ec7ec63 100644 --- a/index.html +++ b/index.html @@ -2209,6 +2209,10 @@

Enrichment and Pathways Analysis

relationship between KEGG pathways. There is however a complex network between genes belonging to the same pathway which does not exist in GO.

+

The choice of database does not actually affect how the statistical testing works. We test of significant collections regardless of how the collections have been defined.

@@ -2245,7 +2249,7 @@

Over-representation analysis (ORA)

with a fold-change cut-off)

The question we are asking here is;

-

“Are the number of DE genes associated with Theme X +

“Are the number of DE genes associated with Gene Set X significantly greater than what we might expect by chance alone?”

@@ -2298,8 +2302,54 @@

Over-representation analysis (ORA)

-

In this first test, our genes will be grouped together according to -their Gene Ontology (GO) terms:- http://www.geneontology.org/

+

As a worked example, consider a Gene Set with 634 +genes. After performing differential expression, we find that our list +of differentially-expressed genes comprises 4595 genes. +Amongst this gene list, 233 belong to our Gene Set. +Plugging-in the numbers we get:-

+ +++++ + + + + + + + + + + + + + + + + + + + +
Differentially ExpressedNot Differentially Expressed
In Gene Set233388
Not in Gene Set436222196
+

Which yields a significant p-value with a Fishers’ +test.

+

Another way of thinking about this is to randomly pick a set +of 4595 genes (i.e. without using a p-value cut-off) +and see how many belong to our gene set.

+

The first time we do this, we get 100 genes in our +set. The second time we get 112 and so on…

+

If we repeat enough time we can make a histogram:-

+

+

We see that a value of 233 is extremely unlikely. In other words, +using our p-value cut-off to generate our gene list has resulted in +about twice as many of our gene set than we would expect by +chance

+

Using WebGestalt for ORA

diff --git a/index.md b/index.md index 12012b5..dc09fed 100644 --- a/index.md +++ b/index.md @@ -427,6 +427,8 @@ The ontologies are split into three categories The KEGG database also defines sets of genes. There is no defined relationship between KEGG pathways. There is however a complex network between genes belonging to the same pathway which does not exist in GO. +- [e.g. Pathways in cancer](https://www.genome.jp/kegg-bin/show_pathway?hsa05200) + The choice of database does not actually affect how the statistical testing works. We test of significant collections regardless of how the collections have been defined. @@ -450,7 +452,7 @@ There are two different approaches one might use, and we will cover the theory b The question we are asking here is; -> ***"Are the number of DE genes associated with Theme X significantly greater than what we might expect by chance alone?"*** +> ***"Are the number of DE genes associated with Gene Set X significantly greater than what we might expect by chance alone?"*** We can answer this question by knowing @@ -471,7 +473,31 @@ with:- | Not in Gene Set | c | d | c + d | | **Total** | **a + c** | **b +d** | **a + b + c + d (=n)** | -In this first test, our genes will be grouped together according to their Gene Ontology (GO) terms:- + +As a worked example, consider a Gene Set with **634** genes. After performing differential expression, we find that our list of differentially-expressed genes comprises **4595** genes. Amongst this gene list, **233** belong to our Gene Set. Plugging-in the numbers we get:- + + + +| | Differentially Expressed | Not Differentially Expressed | +|-----------------|--------------------------|------------------------------| +| In Gene Set | 233 | 388 | +| Not in Gene Set | 4362 | 22196 | + + +Which yields a **significant p-value** with a Fishers' test. + +Another way of thinking about this is to *randomly* pick a set of **4595** genes (i.e. without using a p-value cut-off) and see how many belong to our gene set. + + +The first time we do this, we get **100** genes in our set. The second time we get **112** and so on... + +If we repeat enough time we can make a histogram:- + +![](geneset_randomHist.png) + +We see that a value of 233 is extremely unlikely. In other words, using our p-value cut-off to generate our gene list has resulted in about **twice as many of our gene set than we would expect by chance** + +- [R script for those that are interested...](sample_geneset.R) ## Using WebGestalt for ORA diff --git a/sample_geneset.R b/sample_geneset.R new file mode 100644 index 0000000..c2b9538 --- /dev/null +++ b/sample_geneset.R @@ -0,0 +1,57 @@ +## check if packages are found and install if not + +if(!require(BiocManager)) install.packages("BiocManager") +if(!require(tidyverse)) install.packages("tidyverse") +if(!require(org.Mm.eg.db)) BiocManager::install("org.Mm.eg.db") + + +## load our packages +library(org.Mm.eg.db) +library(tidyverse) + +## find genes belonging to GO:0007049 - cell cycle +my_genes <- AnnotationDbi::select(org.Mm.eg.db, + keys = "GO:0007049", + keytype = "GO", + columns = c("SYMBOL","ENTREZID")) %>% pull(SYMBOL) + +length(my_genes) +## 634 + +## load our entire set of results +results <- read_csv("background.csv") +n_sig <- sum(results$FDR < 0.05) + +# 4595 + +n_obs <- sum(results$FDR < 0.05 & results$SYMBOL %in% my_genes) +## 233 + +tab <- table(results$SYMBOL %in% my_genes, results$FDR < 0.05) +tab +# +# FALSE TRUE +# FALSE 22196 4362 +# TRUE 388 233 + + +chisq.test(tab) + +test_genes <- NULL + +## set seed so the results are reproducible +set.seed(1233) + +for(i in 1:1000){ + + test_genes[[i]] <- results %>% + slice_sample(n = n_sig) %>% ## pick n_sig rows at random + mutate(InPathway = SYMBOL %in% my_genes) %>% ## add extra column for how many of selected genes are in our gene set + count(InPathway) %>% ## count up the number in our gene set + filter(InPathway) %>% + pull(n) ## extract n - the number of genes in our gene set for this random set +} + +data.frame(N = unlist(test_genes)) |> + ggplot(aes(x = N)) + geom_histogram(binwidth = 1) + geom_vline(xintercept = n_obs,col="red",lty=2) +ggsave("geneset_randomHist.png")