Skip to content

Commit

Permalink
Add example of sampling genes
Browse files Browse the repository at this point in the history
  • Loading branch information
markdunning committed Oct 24, 2023
1 parent 3ee1daa commit 6647c56
Show file tree
Hide file tree
Showing 4 changed files with 138 additions and 5 deletions.
Binary file added geneset_randomHist.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
56 changes: 53 additions & 3 deletions index.html

Large diffs are not rendered by default.

30 changes: 28 additions & 2 deletions index.md
Original file line number Diff line number Diff line change
Expand Up @@ -427,6 +427,8 @@ The ontologies are split into three categories

The KEGG database also defines sets of genes. There is no defined relationship between KEGG pathways. There is however a complex network between genes belonging to the same pathway which does not exist in GO.

- [e.g. Pathways in cancer](https://www.genome.jp/kegg-bin/show_pathway?hsa05200)

The choice of database does not actually affect how the statistical testing works. We test of significant collections regardless of how the collections have been defined.


Expand All @@ -450,7 +452,7 @@ There are two different approaches one might use, and we will cover the theory b

The question we are asking here is;

> ***"Are the number of DE genes associated with Theme X significantly greater than what we might expect by chance alone?"***
> ***"Are the number of DE genes associated with Gene Set X significantly greater than what we might expect by chance alone?"***
We can answer this question by knowing

Expand All @@ -471,7 +473,31 @@ with:-
| Not in Gene Set | c | d | c + d |
| **Total** | **a + c** | **b +d** | **a + b + c + d (=n)** |

In this first test, our genes will be grouped together according to their Gene Ontology (GO) terms:- <http://www.geneontology.org/>

As a worked example, consider a Gene Set with **634** genes. After performing differential expression, we find that our list of differentially-expressed genes comprises **4595** genes. Amongst this gene list, **233** belong to our Gene Set. Plugging-in the numbers we get:-



| | Differentially Expressed | Not Differentially Expressed |
|-----------------|--------------------------|------------------------------|
| In Gene Set | 233 | 388 |
| Not in Gene Set | 4362 | 22196 |


Which yields a **significant p-value** with a Fishers' test.

Another way of thinking about this is to *randomly* pick a set of **4595** genes (i.e. without using a p-value cut-off) and see how many belong to our gene set.


The first time we do this, we get **100** genes in our set. The second time we get **112** and so on...

If we repeat enough time we can make a histogram:-

![](geneset_randomHist.png)

We see that a value of 233 is extremely unlikely. In other words, using our p-value cut-off to generate our gene list has resulted in about **twice as many of our gene set than we would expect by chance**

- [R script for those that are interested...](sample_geneset.R)

## Using WebGestalt for ORA

Expand Down
57 changes: 57 additions & 0 deletions sample_geneset.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
## check if packages are found and install if not

if(!require(BiocManager)) install.packages("BiocManager")
if(!require(tidyverse)) install.packages("tidyverse")
if(!require(org.Mm.eg.db)) BiocManager::install("org.Mm.eg.db")


## load our packages
library(org.Mm.eg.db)
library(tidyverse)

## find genes belonging to GO:0007049 - cell cycle
my_genes <- AnnotationDbi::select(org.Mm.eg.db,
keys = "GO:0007049",
keytype = "GO",
columns = c("SYMBOL","ENTREZID")) %>% pull(SYMBOL)

length(my_genes)
## 634

## load our entire set of results
results <- read_csv("background.csv")
n_sig <- sum(results$FDR < 0.05)

# 4595

n_obs <- sum(results$FDR < 0.05 & results$SYMBOL %in% my_genes)
## 233

tab <- table(results$SYMBOL %in% my_genes, results$FDR < 0.05)
tab
#
# FALSE TRUE
# FALSE 22196 4362
# TRUE 388 233


chisq.test(tab)

test_genes <- NULL

## set seed so the results are reproducible
set.seed(1233)

for(i in 1:1000){

test_genes[[i]] <- results %>%
slice_sample(n = n_sig) %>% ## pick n_sig rows at random
mutate(InPathway = SYMBOL %in% my_genes) %>% ## add extra column for how many of selected genes are in our gene set
count(InPathway) %>% ## count up the number in our gene set
filter(InPathway) %>%
pull(n) ## extract n - the number of genes in our gene set for this random set
}

data.frame(N = unlist(test_genes)) |>
ggplot(aes(x = N)) + geom_histogram(binwidth = 1) + geom_vline(xintercept = n_obs,col="red",lty=2)
ggsave("geneset_randomHist.png")

0 comments on commit 6647c56

Please sign in to comment.