Add example of sampling genes

sheffield-bioinformatics-core · Oct 24, 2023 · 6647c56 · 6647c56
1 parent 3ee1daa
commit 6647c56
Show file tree

Hide file tree

Showing 4 changed files with 138 additions and 5 deletions.
diff --git a/geneset_randomHist.png b/geneset_randomHist.png
diff --git a/index.html b/index.html
diff --git a/index.md b/index.md
@@ -427,6 +427,8 @@ The ontologies are split into three categories
 
 The KEGG database also defines sets of genes. There is no defined relationship between KEGG pathways. There is however a complex network between genes belonging to the same pathway which does not exist in GO.
 
+- [e.g. Pathways in cancer](https://www.genome.jp/kegg-bin/show_pathway?hsa05200)
+
 The choice of database does not actually affect how the statistical testing works. We test of significant collections regardless of how the collections have been defined.
 
 
@@ -450,7 +452,7 @@ There are two different approaches one might use, and we will cover the theory b
 
 The question we are asking here is;
 
-> ***"Are the number of DE genes associated with Theme X significantly greater than what we might expect by chance alone?"***
+> ***"Are the number of DE genes associated with Gene Set X significantly greater than what we might expect by chance alone?"***
 
 We can answer this question by knowing
 
@@ -471,7 +473,31 @@ with:-
 |       Not in Gene Set        |                c                 |     d     |         c + d          |
 |          **Total**           |            **a + c**             | **b +d**  | **a + b + c + d (=n)** |
 
-In this first test, our genes will be grouped together according to their Gene Ontology (GO) terms:- <http://www.geneontology.org/>
+
+As a worked example, consider a Gene Set with **634** genes. After performing differential expression, we find that our list of differentially-expressed genes comprises **4595** genes. Amongst this gene list, **233** belong to our Gene Set. Plugging-in the numbers we get:-
+
+
+
+|                 | Differentially Expressed | Not Differentially Expressed |
+|-----------------|--------------------------|------------------------------|
+| In Gene Set     | 233                      | 388                          |
+| Not in Gene Set | 4362                     | 22196                        |
+
+
+Which yields a **significant p-value** with a Fishers' test.
+
+Another way of thinking about this is to *randomly* pick a set of **4595** genes (i.e. without using a p-value cut-off) and see how many belong to our gene set.
+
+
+The first time we do this, we get **100** genes in our set. The second time we get **112** and so on...
+
+If we repeat enough time we can make a histogram:-
+
+![](geneset_randomHist.png)
+
+We see that a value of 233 is extremely unlikely. In other words, using our p-value cut-off to generate our gene list has resulted in about **twice as many of our gene set than we would expect by chance**
+
+- [R script for those that are interested...](sample_geneset.R)
 
 ## Using WebGestalt for ORA
 

diff --git a/sample_geneset.R b/sample_geneset.R
@@ -0,0 +1,57 @@
+## check if packages are found and install if not
+
+if(!require(BiocManager)) install.packages("BiocManager")
+if(!require(tidyverse)) install.packages("tidyverse")  
+if(!require(org.Mm.eg.db)) BiocManager::install("org.Mm.eg.db")
+
+
+## load our packages
+library(org.Mm.eg.db)
+library(tidyverse)
+
+## find genes belonging to GO:0007049  - cell cycle
+my_genes <- AnnotationDbi::select(org.Mm.eg.db, 
+              keys = "GO:0007049",
+              keytype = "GO",
+              columns = c("SYMBOL","ENTREZID")) %>% pull(SYMBOL)
+
+length(my_genes)
+## 634
+
+## load our entire set of results
+results <- read_csv("background.csv")
+n_sig <- sum(results$FDR < 0.05)
+
+# 4595
+
+n_obs <- sum(results$FDR < 0.05 & results$SYMBOL %in% my_genes)
+## 233
+
+tab <- table(results$SYMBOL %in% my_genes, results$FDR < 0.05)
+tab
+#
+#         FALSE  TRUE
+# FALSE 22196  4362
+# TRUE    388   233
+
+
+chisq.test(tab)
+
+test_genes <- NULL
+
+## set seed so the results are reproducible
+set.seed(1233)
+
+for(i in 1:1000){
+
+  test_genes[[i]] <- results %>% 
+    slice_sample(n = n_sig) %>% ## pick n_sig rows at random
+    mutate(InPathway = SYMBOL %in% my_genes) %>% ## add extra column for how many of selected genes are in our gene set
+    count(InPathway) %>% ## count up the number in our gene set
+    filter(InPathway) %>% 
+    pull(n) ## extract n - the number of genes in our gene set for this random set
+}
+
+data.frame(N = unlist(test_genes)) |>
+  ggplot(aes(x  = N)) + geom_histogram(binwidth = 1) + geom_vline(xintercept = n_obs,col="red",lty=2) 
+ggsave("geneset_randomHist.png")