diff --git a/geneset_randomHist.png b/geneset_randomHist.png
new file mode 100644
index 0000000..c2a91b7
Binary files /dev/null and b/geneset_randomHist.png differ
diff --git a/index.html b/index.html
index b3f7b1f..ec7ec63 100644
--- a/index.html
+++ b/index.html
@@ -2209,6 +2209,10 @@
Enrichment and Pathways Analysis
relationship between KEGG pathways. There is however a complex network
between genes belonging to the same pathway which does not exist in
GO.
+
The choice of database does not actually affect how the statistical
testing works. We test of significant collections regardless of how the
collections have been defined.
@@ -2245,7 +2249,7 @@ Over-representation analysis (ORA)
with a fold-change cut-off)
The question we are asking here is;
-“Are the number of DE genes associated with Theme X
+“Are the number of DE genes associated with Gene Set X
significantly greater than what we might expect by chance
alone?”
@@ -2298,8 +2302,54 @@ Over-representation analysis (ORA)
-In this first test, our genes will be grouped together according to
-their Gene Ontology (GO) terms:- http://www.geneontology.org/
+As a worked example, consider a Gene Set with 634
+genes. After performing differential expression, we find that our list
+of differentially-expressed genes comprises 4595 genes.
+Amongst this gene list, 233 belong to our Gene Set.
+Plugging-in the numbers we get:-
+
+
+
+
+
+
+
+
+
+
+
+In Gene Set |
+233 |
+388 |
+
+
+Not in Gene Set |
+4362 |
+22196 |
+
+
+
+Which yields a significant p-value with a Fishers’
+test.
+Another way of thinking about this is to randomly pick a set
+of 4595 genes (i.e. without using a p-value cut-off)
+and see how many belong to our gene set.
+The first time we do this, we get 100 genes in our
+set. The second time we get 112 and so on…
+If we repeat enough time we can make a histogram:-
+
+We see that a value of 233 is extremely unlikely. In other words,
+using our p-value cut-off to generate our gene list has resulted in
+about twice as many of our gene set than we would expect by
+chance
+
Using WebGestalt for ORA
diff --git a/index.md b/index.md
index 12012b5..dc09fed 100644
--- a/index.md
+++ b/index.md
@@ -427,6 +427,8 @@ The ontologies are split into three categories
The KEGG database also defines sets of genes. There is no defined relationship between KEGG pathways. There is however a complex network between genes belonging to the same pathway which does not exist in GO.
+- [e.g. Pathways in cancer](https://www.genome.jp/kegg-bin/show_pathway?hsa05200)
+
The choice of database does not actually affect how the statistical testing works. We test of significant collections regardless of how the collections have been defined.
@@ -450,7 +452,7 @@ There are two different approaches one might use, and we will cover the theory b
The question we are asking here is;
-> ***"Are the number of DE genes associated with Theme X significantly greater than what we might expect by chance alone?"***
+> ***"Are the number of DE genes associated with Gene Set X significantly greater than what we might expect by chance alone?"***
We can answer this question by knowing
@@ -471,7 +473,31 @@ with:-
| Not in Gene Set | c | d | c + d |
| **Total** | **a + c** | **b +d** | **a + b + c + d (=n)** |
-In this first test, our genes will be grouped together according to their Gene Ontology (GO) terms:-
+
+As a worked example, consider a Gene Set with **634** genes. After performing differential expression, we find that our list of differentially-expressed genes comprises **4595** genes. Amongst this gene list, **233** belong to our Gene Set. Plugging-in the numbers we get:-
+
+
+
+| | Differentially Expressed | Not Differentially Expressed |
+|-----------------|--------------------------|------------------------------|
+| In Gene Set | 233 | 388 |
+| Not in Gene Set | 4362 | 22196 |
+
+
+Which yields a **significant p-value** with a Fishers' test.
+
+Another way of thinking about this is to *randomly* pick a set of **4595** genes (i.e. without using a p-value cut-off) and see how many belong to our gene set.
+
+
+The first time we do this, we get **100** genes in our set. The second time we get **112** and so on...
+
+If we repeat enough time we can make a histogram:-
+
+![](geneset_randomHist.png)
+
+We see that a value of 233 is extremely unlikely. In other words, using our p-value cut-off to generate our gene list has resulted in about **twice as many of our gene set than we would expect by chance**
+
+- [R script for those that are interested...](sample_geneset.R)
## Using WebGestalt for ORA
diff --git a/sample_geneset.R b/sample_geneset.R
new file mode 100644
index 0000000..c2b9538
--- /dev/null
+++ b/sample_geneset.R
@@ -0,0 +1,57 @@
+## check if packages are found and install if not
+
+if(!require(BiocManager)) install.packages("BiocManager")
+if(!require(tidyverse)) install.packages("tidyverse")
+if(!require(org.Mm.eg.db)) BiocManager::install("org.Mm.eg.db")
+
+
+## load our packages
+library(org.Mm.eg.db)
+library(tidyverse)
+
+## find genes belonging to GO:0007049 - cell cycle
+my_genes <- AnnotationDbi::select(org.Mm.eg.db,
+ keys = "GO:0007049",
+ keytype = "GO",
+ columns = c("SYMBOL","ENTREZID")) %>% pull(SYMBOL)
+
+length(my_genes)
+## 634
+
+## load our entire set of results
+results <- read_csv("background.csv")
+n_sig <- sum(results$FDR < 0.05)
+
+# 4595
+
+n_obs <- sum(results$FDR < 0.05 & results$SYMBOL %in% my_genes)
+## 233
+
+tab <- table(results$SYMBOL %in% my_genes, results$FDR < 0.05)
+tab
+#
+# FALSE TRUE
+# FALSE 22196 4362
+# TRUE 388 233
+
+
+chisq.test(tab)
+
+test_genes <- NULL
+
+## set seed so the results are reproducible
+set.seed(1233)
+
+for(i in 1:1000){
+
+ test_genes[[i]] <- results %>%
+ slice_sample(n = n_sig) %>% ## pick n_sig rows at random
+ mutate(InPathway = SYMBOL %in% my_genes) %>% ## add extra column for how many of selected genes are in our gene set
+ count(InPathway) %>% ## count up the number in our gene set
+ filter(InPathway) %>%
+ pull(n) ## extract n - the number of genes in our gene set for this random set
+}
+
+data.frame(N = unlist(test_genes)) |>
+ ggplot(aes(x = N)) + geom_histogram(binwidth = 1) + geom_vline(xintercept = n_obs,col="red",lty=2)
+ggsave("geneset_randomHist.png")