Update readme files for example data sets; update figure about data f…

…ormats; update gitignore; update main readme.
romanhaa · Oct 3, 2019 · 5cc493f · 5cc493f
1 parent b994ad1
commit 5cc493f
Show file tree

Hide file tree

Showing 12 changed files with 120 additions and 83 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,4 +1,5 @@
 .DS_Store
+Docker/log.stdout
 source/node_modules
 source/Cerebro
 source/Rplots.pdf

diff --git a/README.md b/README.md
@@ -210,7 +210,9 @@ docker run -p <port_of_choice>:<port_of_choice> -v <export_folder>:/plots romanh
 
 We provide documentation and commands for the following example data sets:
 
-* [`pbmc_10k_v3`](examples/pbmc_10k_v3/)
+* [`pbmc_10k_v3`](examples/pbmc_10k_v3): single sample of human peripheral blood mononuclear cells
+* [`GSE108041`](examples/GSE108041): 4 samples of A549 cells before and after infection with influenza virus
+* [`GSE129845`](examples/GSE129845): 3 samples of human bladder cells from (3 patients)
 
 ## Conversion of other single cell data formats
 

diff --git a/examples/GSE108041/README.md b/examples/GSE108041/README.md
@@ -1,6 +1,6 @@
 # `GSE108041` data set
 
-This data set comes from the publication "Extreme heterogeneity of influenza virus infection in single cells" by Russell *et al.*, eLIFE (2018) ([DOI](https://doi.org/10.7554/eLife.32303), [GEO submission](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE108041)).
+This data set is taken from the publication "Extreme heterogeneity of influenza virus infection in single cells" by Russell *et al.*, eLIFE (2018) ([DOI](https://doi.org/10.7554/eLife.32303), [GEO submission](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE108041)).
 It contains ~13,000 cells from 4 samples before and 6/8/10h after infection with the influenza virus.
 
 To test Cerebro, download the `.crb` file from either [Seurat v3](Seurat_v3) or [scanpy](scanpy) and load it into Cerebro.
@@ -35,5 +35,4 @@ Lastly, from the Seurat object we export a Cerebro file (`.crb` extension) that
 ## How to reproduce
 
 The example data sets were generated using the official Cerebro Docker image which was built in Docker ([Docker Hub](https://cloud.docker.com/u/romanhaa/repository/docker/romanhaa/cerebro)) and imported into [Singularity](https://singularity.lbl.gov/) (here I used Singularity 2.6.0).
-The workflows for Seurat v2 and Seurat v3 are conceptually identical with some differences due to changes in the Seurat package.
 Details and descriptions for all workflows can be found in the respective directories [Seurat v3](Seurat_v3), and [scanpy](scanpy).
diff --git a/examples/GSE108041/Seurat_v3/README.md b/examples/GSE108041/Seurat_v3/README.md
@@ -1,11 +1,11 @@
 # Seurat v3 workflow for `GSE108041` data set
 
-Here, we analyze a [`GSE108041`](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE108041) data set which was published by [Russell *et al.* in 2018 (eLIFE)](https://doi.org/10.7554/eLife.32303) using [Seurat](https://satijalab.org/seurat/) framework, following the basic [Seurat](https://satijalab.org/seurat/) workflow.
+Here, we analyze a `GSE108041` data set ("Extreme heterogeneity of influenza virus infection in single cells", Russell *et al.*, eLIFE (2018), [DOI](https://doi.org/10.7554/eLife.32303), [GEO submission](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE108041)) using [Seurat](https://satijalab.org/seurat/) framework, following the basic Seurat workflow.
 
 ## Preparation
 
 Before starting, we clone the Cerebro repository (or manually download it) because it contains the raw data of our example data set.
-One (optional) step of our analysis will require us to provide some gene sets in a GMT file.
+One (optional) step of our analysis will require us to provide some gene sets in a `GMT` file.
 We manually download the `c2.all.v7.0.symbols.gmt` file from [MSigDB](http://software.broadinstitute.org/gsea/downloads.jsp#msigdb) and put it in our current working directory.
 Then, we pull the Docker image from the Docker Hub, convert it to Singularity, and start an R session inside.
 
@@ -36,7 +36,7 @@ library('cerebroApp')
 
 ## Load transcript counts
 
-We load the sparse transcript count matrices downloaded from the 10x Genomics website, add the respective sample info to the cell barcode, and merge them into a big matrix.
+For each of the four samples we load the transcript count matrix (`.h5` format), add a tag for the sample of origin to the cellular barcodes, and then merge the transcript counts together.
 
 ```r
 sample_uninfected <- Read10X_h5('raw_data/GSM2888370_Uninfected.h5') %>%
@@ -80,8 +80,15 @@ feature_matrix <- dplyr::select(feature_matrix, -gene)
 
 ## Pre-processing with Seurat
 
-With the merged transcript count matrix ready, we create a Seurat object, add sample info to meta data, and remove cells with less than `100` transcripts or fewer than `50` expressed genes.
-Then, we follow the standard Seurat workflow, including normalization, identifying highly variably genes, scaling expression values and regressing out the number of transcripts per cell, perform principal component analysis (PCA), find neighbors and clusters.
+With the merged transcript count matrix ready, we create a Seurat object, add sample info to meta data, and remove cells with less fewer `100` transcripts or `50` expressed genes.
+Then, we follow the standard Seurat workflow, including...
+
+* normalization,
+* identifying highly variably genes,
+* scaling expression values and regressing out the number of transcripts per cell,
+* perform principal component analysis (PCA),
+* find neighbors and clusters.
+
 Furthermore, we build a cluster tree that represents the similarity between clusters and create a dedicated `cluster` column in the meta data.
 
 ```r

diff --git a/examples/GSE108041/scanpy/README.md b/examples/GSE108041/scanpy/README.md
@@ -1,12 +1,12 @@
 # scanpy workflow for `GSE108041` data set
 
-Here, we analyze the `GSE108041` data set using [scanpy](https://scanpy.readthedocs.io), following the [basics workflow](https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html) described on their website which includes similar steps as those performed in Seurat.
-Then, import the [AnnData](https://anndata.readthedocs.io/en/stable) object produced by scanpy, import it into Seurat, and from there export it to Cerebro.
+Here, we analyze the `GSE108041` data set ("Extreme heterogeneity of influenza virus infection in single cells", Russell *et al.*, eLIFE (2018), [DOI](https://doi.org/10.7554/eLife.32303), [GEO submission](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE108041)) using [scanpy](https://scanpy.readthedocs.io), following the [basics workflow described on the scanpy website](https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html) which includes similar steps as those performed in Seurat.
+Then, import the [AnnData](https://anndata.readthedocs.io/en/stable) object produced by scanpy into Seurat, and from there export it to Cerebro.
 
 ## Preparation
 
 Before starting, we clone the Cerebro repository (or manually download it) because it contains the raw data of our example data set.
-One (optional) step of our analysis will require us to provide some gene sets in a GMT file.
+One (optional) step of our analysis will require us to provide some gene sets in a `GMT` file.
 We manually download the `c2.all.v7.0.symbols.gmt` file from [MSigDB](http://software.broadinstitute.org/gsea/downloads.jsp#msigdb) and put it in our current working directory.
 Then, we pull the Docker image from the Docker Hub, convert it to Singularity, and start an R session inside.
 
@@ -32,7 +32,7 @@ import scanpy as sc
 
 ## Load data
 
-For each of the three samples we load the transcript count matrix and then merge them together.
+For each of the four samples we load the transcript count matrix (`.h5` format), make feature names unique (some gene IDs share the same gene name), and then merge the transcript counts together.
 
 ```python
 adata_uninfected = sc.read_10x_h5('raw_data/GSM2888370_Uninfected.h5')
@@ -71,7 +71,7 @@ adata.obs['sample'].cat.reorder_categories(
 
 Now, we...
 
-* remove cells with less than `100` transcripts or fewer than `50` expressed genes,
+* remove cells with fewer than `100` transcripts or `50` expressed genes,
 * calculate the number of transcripts per cell, and
 * remove genes expressed in fewer than `10` cells.
 
@@ -91,7 +91,7 @@ np.savetxt('scanpy/raw_counts_genes.tsv', adata.var.index, fmt = '%s', delimiter
 np.savetxt('scanpy/raw_counts_cells.tsv', adata.obs.index, fmt = '%s', delimiter = '\t')
 ```
 
-What follows is the standard pre-processing procedure of...
+What follows is the standard pre-processing procedure, including the following steps...
 
 * normalizing transcript counts per cell,
 * bringing transcript counts to log-scale,
@@ -151,7 +151,7 @@ sc.logging.print_versions()
 Next,...
 
 * we hop into R,
-* set up some parameters,
+* set some parameters,
 * load packages, and
 * import the `.h5ad` file we just wrote to disk using the `ReadH5AD()` function from the Seurat package.
 
@@ -181,7 +181,7 @@ levels([email protected]$phase) <- c('G1','G2M','S')
 
 ## Optional (but recommended) steps
 
-We could already export this object and visualize the contained in Cerebro.
+We could already export this object and visualize the contained data in Cerebro.
 However, data exploration in Cerebro would greatly benefit from additional data generated by the functions of cerebroApp.
 What follows is a set of (mostly) optional steps.
 
@@ -234,7 +234,7 @@ [email protected]$tree.ident <- NULL
 
 ### Add 3D projections
 
-Let's also add 3D dimensional reductions for tSNE and UMAP.
+We also add 3D dimensional reductions made with tSNE and UMAP.
 
 ```r
 seurat <- RunTSNE(

diff --git a/examples/GSE129845/README.md b/examples/GSE129845/README.md
@@ -35,5 +35,4 @@ Lastly, from the Seurat object we export a Cerebro file (`.crb` extension) that
 ## How to reproduce
 
 The example data sets were generated using the official Cerebro Docker image which was built in Docker ([Docker Hub](https://cloud.docker.com/u/romanhaa/repository/docker/romanhaa/cerebro)) and imported into [Singularity](https://singularity.lbl.gov/) (here I used Singularity 2.6.0).
-The workflows for Seurat v2 and Seurat v3 are conceptually identical with some differences due to changes in the Seurat package.
 Details and descriptions for the workflow can be found in the respective directory [Seurat v3](Seurat_v3).
diff --git a/examples/GSE129845/Seurat_v3/README.md b/examples/GSE129845/Seurat_v3/README.md
@@ -1,11 +1,11 @@
 # Seurat v3 workflow for `GSE129845` data set
 
-Here, we analyze a [`GSE129845`](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE129845) data set which was published by [Russell *et al.* in 2018 (eLIFE)](https://doi.org/10.7554/eLife.32303) using [Seurat](https://satijalab.org/seurat/) framework, following the basic [Seurat](https://satijalab.org/seurat/) workflow.
+Here, we analyze a `GSE129845` data set ("Single-Cell Transcriptomic Map of the Human and Mouse Bladders", Yu *et al.*, J Am Soc Nephrol (2019), [DOI](https://doi.org/10.1681/ASN.2019040335), [GEO submission](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE129845)) using [Seurat](https://satijalab.org/seurat/) framework, following the basic Seurat workflow.
 
 ## Preparation
 
 Before starting, we clone the Cerebro repository (or manually download it) because it contains the raw data of our example data set.
-One (optional) step of our analysis will require us to provide some gene sets in a GMT file.
+One (optional) step of our analysis will require us to provide some gene sets in a `GMT` file.
 We manually download the `c2.all.v7.0.symbols.gmt` file from [MSigDB](http://software.broadinstitute.org/gsea/downloads.jsp#msigdb) and put it in our current working directory.
 Then, we pull the Docker image from the Docker Hub, convert it to Singularity, and start an R session inside.
 
@@ -36,10 +36,12 @@ library('cerebroApp')
 
 ## Load transcript counts
 
-We load the sparse transcript count matrices downloaded from the 10x Genomics website, add the respective sample info to the cell barcode, and merge them into a big matrix.
+For each of the three patient samples we load the transcript count matrix (`.mtx` format), add a tag for the sample of origin to the cellular barcodes, merge transcripts from genes with the same name, and then merge the transcript counts from the different patients together.
 
 ### Patient 1
 
+Load transcript counts from patient 1.
+
 ```r
 path_to_data <- "./raw_data/GSM3723357"
 
@@ -69,6 +71,8 @@ feature_matrix_patient_1 <- feature_matrix
 
 ### Patient 2
 
+Load transcript counts from patient 2.
+
 ```r
 path_to_data <- "./raw_data/GSM3723358"
 
@@ -98,6 +102,8 @@ feature_matrix_patient_2 <- feature_matrix
 
 ### Patient 3
 
+Load transcript counts from patient 3.
+
 ```r
 path_to_data <- "./raw_data/GSM3723359"
 
@@ -127,6 +133,8 @@ feature_matrix_patient_3 <- feature_matrix
 
 ### Merge patient samples
 
+Merge transcript counts from all three patients.
+
 ```r
 feature_matrix <- dplyr::full_join(feature_matrix_patient_1, feature_matrix_patient_2, by = 'gene') %>%
   dplyr::full_join(feature_matrix_patient_3, by = 'gene')
@@ -136,8 +144,15 @@ feature_matrix <- dplyr::select(feature_matrix, -gene)
 
 ## Pre-processing with Seurat
 
-With the merged transcript count matrix ready, we create a Seurat object, add sample info to meta data, and remove cells with less than `100` transcripts or fewer than `50` expressed genes.
-Then, we follow the standard Seurat workflow, including normalization, identifying highly variably genes, scaling expression values and regressing out the number of transcripts per cell, perform principal component analysis (PCA), find neighbors and clusters.
+With the merged transcript count matrix ready, we create a Seurat object, add sample info to meta data, and remove cells with fewer than `100` transcripts  `50` expressed genes.
+Then, we follow the standard Seurat workflow, including...
+
+* normalization,
+* identifying highly variably genes,
+* scaling expression values and regressing out the number of transcripts per cell,
+* perform principal component analysis (PCA),
+* find neighbors and clusters.
+
 Furthermore, we build a cluster tree that represents the similarity between clusters and create a dedicated `cluster` column in the meta data.
 
 ```r

diff --git a/examples/README.md b/examples/README.md
@@ -2,6 +2,6 @@
 
 Examples of the Cerebro workflow are available for the following public data sets:
 
-* [`pbmc_10k_v3`](pbmc_10k_v3): single sample of peripheral blood mononuclear cells
+* [`pbmc_10k_v3`](pbmc_10k_v3): single sample of human peripheral blood mononuclear cells
 * [`GSE108041`](GSE108041): 4 samples of A549 cells before and after infection with influenza virus
 * [`GSE129845`](GSE129845): 3 samples of human bladder cells from (3 patients)
diff --git a/examples/pbmc_10k_v3/Seurat_v2/README.md b/examples/pbmc_10k_v3/Seurat_v2/README.md
@@ -2,8 +2,10 @@
 
 Here, we analyze the `pbmc_10k_v3` data set using [Seurat](https://satijalab.org/seurat/) framework, following the basic [Seurat](https://satijalab.org/seurat/) workflow.
 
+## Preparation
+
 Before starting, we clone the Cerebro repository (or manually download it) because it contains the raw data of our example data set.
-One (optional) step of our analysis will require us to provide some gene sets in a GMT file.
+One (optional) step of our analysis will require us to provide some gene sets in a `GMT` file.
 We manually download the `c2.all.v7.0.symbols.gmt` file from [MSigDB](http://software.broadinstitute.org/gsea/downloads.jsp#msigdb) and put it in our current working directory.
 Then, we pull the Docker image from the Docker Hub, convert it to Singularity, and start an R session inside.
 
@@ -34,7 +36,7 @@ library('cerebroApp')
 
 ## Load transcript counts
 
-Unfortunately, the `Read10X_h5()` function of Seurat v2 has problems with the `.h5` file downloaded from the 10x Genomics website so instead we load it manually and convert it to a sparse matrix.
+Unfortunately, the `Read10X_h5()` function of Seurat v2 has problems with the `.h5` file downloaded from the 10x Genomics website so instead we load it manually, convert it to a sparse matrix, and merge transcripts from genes with the same name.
 
 ```r
 h5_data <- hdf5r::H5File$new('raw_data/filtered_feature_bc_matrix.h5', mode = 'r')
@@ -50,17 +52,44 @@ feature_matrix <- Matrix::sparseMatrix(
   dims = h5_data[['matrix/shape']][],
   index1 = FALSE
 )
+
+genes <- rownames(feature_matrix)
+
+feature_matrix <- feature_matrix %>%
+  as.matrix() %>%
+  as.data.frame() %>%
+  dplyr::mutate(gene = genes) %>%
+  dplyr::select(gene, dplyr::everything()) %>%
+  dplyr::group_by(gene) %>%
+  dplyr::summarise_all(sum) %>%
+  dplyr::ungroup()
+
+genes <- feature_matrix$gene
+
+feature_matrix <- feature_matrix %>%
+  dplyr::select(-gene) %>%
+  as.matrix() %>%
+  as('sparseMatrix')
+
+rownames(feature_matrix) <- genes
 ```
 
 ## Pre-processing with Seurat
 
 With the transcript count loaded, we create a Seurat object and remove cells with less than `100` transcripts or fewer than `50` expressed genes.
-Then, we follow the standard Seurat workflow, including normalization, identifying highly variably genes, scaling expression values and regressing out the number of transcripts per cell, perform principal component analysis (PCA), find neighbors and clusters.
+Then, we follow the standard Seurat workflow, including...
+
+* normalization,
+* identifying highly variably genes,
+* scaling expression values and regressing out the number of transcripts per cell,
+* perform principal component analysis (PCA),
+* find neighbors and clusters.
+
 Furthermore, we build a cluster tree that represents the similarity between clusters and create a dedicated `cluster` column in the meta data.
 
 ```r
 seurat <- CreateSeuratObject(
-  project = 'PBMC_10k_v3',
+  project = 'pbmc_10k_v3',
   raw.data = feature_matrix,
   min.cells = 10
 )
@@ -162,24 +191,14 @@ seurat <- RunUMAP(
 
 ## Meta data
 
-This example data set consists of a single sample.
-To highlight the functionality of Cerebro when working with a multi-sample data set, we the cells of clusters 1-5 to `sample_A`, those in clusters 6-10 to `sample_B`, and those of clusters 11-16 to `sample_C`.
+This example data set consists of a single sample so we just add that name to the meta data.
+Moreover, in order to later be able to understand how we did the analysis, we add some meta data to the `misc` slot of the Seurat object.
 
 ```r
-meta_sample <- seurat@meta.data$cluster %>% as.character()
-meta_sample[which(meta_sample %in% c('1','2','3','4','5'))] <- 'sample_A'
-meta_sample[which(meta_sample %in% c('6','7','8','9','10'))] <- 'sample_B'
-meta_sample[which(meta_sample %in% c('11','12','13','14','15','16'))] <- 'sample_C'
-seurat@meta.data$sample <- factor(meta_sample, levels = c('sample_A','sample_B','sample_C'))
-```
+seurat@meta.data$sample <- factor('pbmc_10k_v3', levels = 'pbmc_10k_v3')
 
-## Preparation
-
-In order to later be able to understand how we did the analysis, we add some meta data to the `misc` slot of the Seurat object.
-
-```r
 seurat@misc$experiment <- list(
-  experiment_name = 'PBMC_10k',
+  experiment_name = 'pbmc_10k_v3',
   organism = 'hg',
   date_of_analysis = Sys.Date()
 )
@@ -313,8 +332,8 @@ Finally, we use the `exportFromSeurat()` function of cerebroApp to export our Se
 ```r
 cerebroApp::exportFromSeurat(
   seurat,
-  experiment_name = 'PBMC_10k',
-  file = paste0('Seurat_v2/cerebro_PBMC_10k_', Sys.Date(), '.crb'),
+  experiment_name = 'pbmc_10k_v3',
+  file = paste0('Seurat_v2/cerebro_pbmc_10k_v3_', Sys.Date(), '.crb'),
   organism = 'hg',
   column_cell_cycle_seurat = 'Phase'
 )