Skip to content

Latest commit

 

History

History
153 lines (115 loc) · 3.04 KB

creating_cohort_gene_expression_matrices.md

File metadata and controls

153 lines (115 loc) · 3.04 KB
title author date output
Pulling down small matrices
David L Gibbs
March 28, 2016
html_document

The goal is to use a predifined cohort, and a gene set, along with bigquery to create a small matrix of gene expression.

If you don't have our ISBCGCExamples, check out https://github.com/isb-cgc/examples-R

library(ISBCGCExamples)
library(stringr)
library(dplyr)

Let's suppose you've already created a cohort using the web-app or "save_cohorts" R function. Earlier, I created a cohort of all TCGA samples from brain tissue. I'll use the cohort functions to wrangle that data.

mytoken <- isb_init()       # get authorized

mycohorts <- list_cohorts(mytoken) # get my list of cohorts
## Auto-refreshing stale OAuth token.
mycohorts$count             # number of cohorts I've saved (3)
## [1] "3"
mycohorts$items[[2]]$name   # name of the cohort
## [1] "brain_age_10_to_39"
mycohorts$items[[2]]$id     # ID number of the cohort.
## [1] "69"

To get the list of barcodes, use the barcodes_from_cohort function, substituting the cohort ID. Your cohort IDs are found in the list_cohorts output.

mytoken <- isb_init()       # get authorized
barcodes  <- barcodes_from_cohort("69", mytoken)
barcodes$sample_count # number of samples available
## [1] "126"

barcodes is a list with both sample and patient barcodes, and counts of both.

sample_barcodes <- unlist(barcodes$samples)

Secondly, let's suppose you have a list of genes in a file.. on your desktop.

gene_table <- read.table("notch_pathway_genes.txt", sep="\t", header=T, stringsAsFactors=F)
notch_genes <- gene_table$Gene.Name

Now we'll build the query. I'm first going to define a helper function that creates lists of things in the SQL format.

qlist <- function(g) {
  x <- sapply(g, function(gi) paste("\'", gi, "\'", sep=""))
  y <- c()
  for(i in 1:(length(g)-1)){
    y <- c(y, x[i], ",")
  }
  y <- c(y, x[length(x)])
  z <- paste(y, collapse="")
  paste0('(', z, ')')
}

Now depending on whether you want to work with ParticipantBarcodes or SampleBarcodes, you can modify the below query.

require(bigrquery) || install.packages("bigrquery")
## [1] TRUE
querySql <- paste0("
SELECT
  ParticipantBarcode,
  SampleBarcode,
  Study,
  HGNC_gene_symbol as gene_symbol,
  LOG2(normalized_count+1) AS log2_expr
FROM
  [isb-cgc:tcga_201510_alpha.mRNA_UNC_HiSeq_RSEM]
WHERE
  ( SampleTypeLetterCode='TP'
    AND HGNC_gene_symbol in ", qlist(notch_genes), "
    AND SampleBarcode in ", qlist(sample_barcodes), "
   )
GROUP BY
ParticipantBarcode,
SampleBarcode,
Study,
gene_symbol,
log2_expr
")
result <- query_exec(querySql, project=project)

Then we just have to convert the table to a matrix.

library(tidyr)

data_matrix <- result %>% select(SampleBarcode, gene_symbol, log2_expr)  %>%
    group_by(SampleBarcode, gene_symbol) %>%
    mutate(row=1:n()) %>%
    spread(gene_symbol, log2_expr)

Now we have a data frame of gene expression for downstream processing.