Skip to content

Commit

Permalink
Update readme files for example data sets; update figure about data f…
Browse files Browse the repository at this point in the history
…ormats; update gitignore; update main readme.
  • Loading branch information
romanhaa committed Oct 3, 2019
1 parent b994ad1 commit 5cc493f
Show file tree
Hide file tree
Showing 12 changed files with 120 additions and 83 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
.DS_Store
Docker/log.stdout
source/node_modules
source/Cerebro
source/Rplots.pdf
Expand Down
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -210,7 +210,9 @@ docker run -p <port_of_choice>:<port_of_choice> -v <export_folder>:/plots romanh

We provide documentation and commands for the following example data sets:

* [`pbmc_10k_v3`](examples/pbmc_10k_v3/)
* [`pbmc_10k_v3`](examples/pbmc_10k_v3): single sample of human peripheral blood mononuclear cells
* [`GSE108041`](examples/GSE108041): 4 samples of A549 cells before and after infection with influenza virus
* [`GSE129845`](examples/GSE129845): 3 samples of human bladder cells from (3 patients)

## Conversion of other single cell data formats

Expand Down
3 changes: 1 addition & 2 deletions examples/GSE108041/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# `GSE108041` data set

This data set comes from the publication "Extreme heterogeneity of influenza virus infection in single cells" by Russell *et al.*, eLIFE (2018) ([DOI](https://doi.org/10.7554/eLife.32303), [GEO submission](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE108041)).
This data set is taken from the publication "Extreme heterogeneity of influenza virus infection in single cells" by Russell *et al.*, eLIFE (2018) ([DOI](https://doi.org/10.7554/eLife.32303), [GEO submission](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE108041)).
It contains ~13,000 cells from 4 samples before and 6/8/10h after infection with the influenza virus.

To test Cerebro, download the `.crb` file from either [Seurat v3](Seurat_v3) or [scanpy](scanpy) and load it into Cerebro.
Expand Down Expand Up @@ -35,5 +35,4 @@ Lastly, from the Seurat object we export a Cerebro file (`.crb` extension) that
## How to reproduce

The example data sets were generated using the official Cerebro Docker image which was built in Docker ([Docker Hub](https://cloud.docker.com/u/romanhaa/repository/docker/romanhaa/cerebro)) and imported into [Singularity](https://singularity.lbl.gov/) (here I used Singularity 2.6.0).
The workflows for Seurat v2 and Seurat v3 are conceptually identical with some differences due to changes in the Seurat package.
Details and descriptions for all workflows can be found in the respective directories [Seurat v3](Seurat_v3), and [scanpy](scanpy).
17 changes: 12 additions & 5 deletions examples/GSE108041/Seurat_v3/README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# Seurat v3 workflow for `GSE108041` data set

Here, we analyze a [`GSE108041`](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE108041) data set which was published by [Russell *et al.* in 2018 (eLIFE)](https://doi.org/10.7554/eLife.32303) using [Seurat](https://satijalab.org/seurat/) framework, following the basic [Seurat](https://satijalab.org/seurat/) workflow.
Here, we analyze a `GSE108041` data set ("Extreme heterogeneity of influenza virus infection in single cells", Russell *et al.*, eLIFE (2018), [DOI](https://doi.org/10.7554/eLife.32303), [GEO submission](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE108041)) using [Seurat](https://satijalab.org/seurat/) framework, following the basic Seurat workflow.

## Preparation

Before starting, we clone the Cerebro repository (or manually download it) because it contains the raw data of our example data set.
One (optional) step of our analysis will require us to provide some gene sets in a GMT file.
One (optional) step of our analysis will require us to provide some gene sets in a `GMT` file.
We manually download the `c2.all.v7.0.symbols.gmt` file from [MSigDB](http://software.broadinstitute.org/gsea/downloads.jsp#msigdb) and put it in our current working directory.
Then, we pull the Docker image from the Docker Hub, convert it to Singularity, and start an R session inside.

Expand Down Expand Up @@ -36,7 +36,7 @@ library('cerebroApp')

## Load transcript counts

We load the sparse transcript count matrices downloaded from the 10x Genomics website, add the respective sample info to the cell barcode, and merge them into a big matrix.
For each of the four samples we load the transcript count matrix (`.h5` format), add a tag for the sample of origin to the cellular barcodes, and then merge the transcript counts together.

```r
sample_uninfected <- Read10X_h5('raw_data/GSM2888370_Uninfected.h5') %>%
Expand Down Expand Up @@ -80,8 +80,15 @@ feature_matrix <- dplyr::select(feature_matrix, -gene)

## Pre-processing with Seurat

With the merged transcript count matrix ready, we create a Seurat object, add sample info to meta data, and remove cells with less than `100` transcripts or fewer than `50` expressed genes.
Then, we follow the standard Seurat workflow, including normalization, identifying highly variably genes, scaling expression values and regressing out the number of transcripts per cell, perform principal component analysis (PCA), find neighbors and clusters.
With the merged transcript count matrix ready, we create a Seurat object, add sample info to meta data, and remove cells with less fewer `100` transcripts or `50` expressed genes.
Then, we follow the standard Seurat workflow, including...

* normalization,
* identifying highly variably genes,
* scaling expression values and regressing out the number of transcripts per cell,
* perform principal component analysis (PCA),
* find neighbors and clusters.

Furthermore, we build a cluster tree that represents the similarity between clusters and create a dedicated `cluster` column in the meta data.

```r
Expand Down
18 changes: 9 additions & 9 deletions examples/GSE108041/scanpy/README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# scanpy workflow for `GSE108041` data set

Here, we analyze the `GSE108041` data set using [scanpy](https://scanpy.readthedocs.io), following the [basics workflow](https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html) described on their website which includes similar steps as those performed in Seurat.
Then, import the [AnnData](https://anndata.readthedocs.io/en/stable) object produced by scanpy, import it into Seurat, and from there export it to Cerebro.
Here, we analyze the `GSE108041` data set ("Extreme heterogeneity of influenza virus infection in single cells", Russell *et al.*, eLIFE (2018), [DOI](https://doi.org/10.7554/eLife.32303), [GEO submission](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE108041)) using [scanpy](https://scanpy.readthedocs.io), following the [basics workflow described on the scanpy website](https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html) which includes similar steps as those performed in Seurat.
Then, import the [AnnData](https://anndata.readthedocs.io/en/stable) object produced by scanpy into Seurat, and from there export it to Cerebro.

## Preparation

Before starting, we clone the Cerebro repository (or manually download it) because it contains the raw data of our example data set.
One (optional) step of our analysis will require us to provide some gene sets in a GMT file.
One (optional) step of our analysis will require us to provide some gene sets in a `GMT` file.
We manually download the `c2.all.v7.0.symbols.gmt` file from [MSigDB](http://software.broadinstitute.org/gsea/downloads.jsp#msigdb) and put it in our current working directory.
Then, we pull the Docker image from the Docker Hub, convert it to Singularity, and start an R session inside.

Expand All @@ -32,7 +32,7 @@ import scanpy as sc

## Load data

For each of the three samples we load the transcript count matrix and then merge them together.
For each of the four samples we load the transcript count matrix (`.h5` format), make feature names unique (some gene IDs share the same gene name), and then merge the transcript counts together.

```python
adata_uninfected = sc.read_10x_h5('raw_data/GSM2888370_Uninfected.h5')
Expand Down Expand Up @@ -71,7 +71,7 @@ adata.obs['sample'].cat.reorder_categories(

Now, we...

* remove cells with less than `100` transcripts or fewer than `50` expressed genes,
* remove cells with fewer than `100` transcripts or `50` expressed genes,
* calculate the number of transcripts per cell, and
* remove genes expressed in fewer than `10` cells.

Expand All @@ -91,7 +91,7 @@ np.savetxt('scanpy/raw_counts_genes.tsv', adata.var.index, fmt = '%s', delimiter
np.savetxt('scanpy/raw_counts_cells.tsv', adata.obs.index, fmt = '%s', delimiter = '\t')
```

What follows is the standard pre-processing procedure of...
What follows is the standard pre-processing procedure, including the following steps...

* normalizing transcript counts per cell,
* bringing transcript counts to log-scale,
Expand Down Expand Up @@ -151,7 +151,7 @@ sc.logging.print_versions()
Next,...

* we hop into R,
* set up some parameters,
* set some parameters,
* load packages, and
* import the `.h5ad` file we just wrote to disk using the `ReadH5AD()` function from the Seurat package.

Expand Down Expand Up @@ -181,7 +181,7 @@ levels([email protected]$phase) <- c('G1','G2M','S')

## Optional (but recommended) steps

We could already export this object and visualize the contained in Cerebro.
We could already export this object and visualize the contained data in Cerebro.
However, data exploration in Cerebro would greatly benefit from additional data generated by the functions of cerebroApp.
What follows is a set of (mostly) optional steps.

Expand Down Expand Up @@ -234,7 +234,7 @@ [email protected]$tree.ident <- NULL

### Add 3D projections

Let's also add 3D dimensional reductions for tSNE and UMAP.
We also add 3D dimensional reductions made with tSNE and UMAP.

```r
seurat <- RunTSNE(
Expand Down
1 change: 0 additions & 1 deletion examples/GSE129845/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,5 +35,4 @@ Lastly, from the Seurat object we export a Cerebro file (`.crb` extension) that
## How to reproduce

The example data sets were generated using the official Cerebro Docker image which was built in Docker ([Docker Hub](https://cloud.docker.com/u/romanhaa/repository/docker/romanhaa/cerebro)) and imported into [Singularity](https://singularity.lbl.gov/) (here I used Singularity 2.6.0).
The workflows for Seurat v2 and Seurat v3 are conceptually identical with some differences due to changes in the Seurat package.
Details and descriptions for the workflow can be found in the respective directory [Seurat v3](Seurat_v3).
25 changes: 20 additions & 5 deletions examples/GSE129845/Seurat_v3/README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# Seurat v3 workflow for `GSE129845` data set

Here, we analyze a [`GSE129845`](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE129845) data set which was published by [Russell *et al.* in 2018 (eLIFE)](https://doi.org/10.7554/eLife.32303) using [Seurat](https://satijalab.org/seurat/) framework, following the basic [Seurat](https://satijalab.org/seurat/) workflow.
Here, we analyze a `GSE129845` data set ("Single-Cell Transcriptomic Map of the Human and Mouse Bladders", Yu *et al.*, J Am Soc Nephrol (2019), [DOI](https://doi.org/10.1681/ASN.2019040335), [GEO submission](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE129845)) using [Seurat](https://satijalab.org/seurat/) framework, following the basic Seurat workflow.

## Preparation

Before starting, we clone the Cerebro repository (or manually download it) because it contains the raw data of our example data set.
One (optional) step of our analysis will require us to provide some gene sets in a GMT file.
One (optional) step of our analysis will require us to provide some gene sets in a `GMT` file.
We manually download the `c2.all.v7.0.symbols.gmt` file from [MSigDB](http://software.broadinstitute.org/gsea/downloads.jsp#msigdb) and put it in our current working directory.
Then, we pull the Docker image from the Docker Hub, convert it to Singularity, and start an R session inside.

Expand Down Expand Up @@ -36,10 +36,12 @@ library('cerebroApp')

## Load transcript counts

We load the sparse transcript count matrices downloaded from the 10x Genomics website, add the respective sample info to the cell barcode, and merge them into a big matrix.
For each of the three patient samples we load the transcript count matrix (`.mtx` format), add a tag for the sample of origin to the cellular barcodes, merge transcripts from genes with the same name, and then merge the transcript counts from the different patients together.

### Patient 1

Load transcript counts from patient 1.

```r
path_to_data <- "./raw_data/GSM3723357"

Expand Down Expand Up @@ -69,6 +71,8 @@ feature_matrix_patient_1 <- feature_matrix

### Patient 2

Load transcript counts from patient 2.

```r
path_to_data <- "./raw_data/GSM3723358"

Expand Down Expand Up @@ -98,6 +102,8 @@ feature_matrix_patient_2 <- feature_matrix

### Patient 3

Load transcript counts from patient 3.

```r
path_to_data <- "./raw_data/GSM3723359"

Expand Down Expand Up @@ -127,6 +133,8 @@ feature_matrix_patient_3 <- feature_matrix

### Merge patient samples

Merge transcript counts from all three patients.

```r
feature_matrix <- dplyr::full_join(feature_matrix_patient_1, feature_matrix_patient_2, by = 'gene') %>%
dplyr::full_join(feature_matrix_patient_3, by = 'gene')
Expand All @@ -136,8 +144,15 @@ feature_matrix <- dplyr::select(feature_matrix, -gene)

## Pre-processing with Seurat

With the merged transcript count matrix ready, we create a Seurat object, add sample info to meta data, and remove cells with less than `100` transcripts or fewer than `50` expressed genes.
Then, we follow the standard Seurat workflow, including normalization, identifying highly variably genes, scaling expression values and regressing out the number of transcripts per cell, perform principal component analysis (PCA), find neighbors and clusters.
With the merged transcript count matrix ready, we create a Seurat object, add sample info to meta data, and remove cells with fewer than `100` transcripts `50` expressed genes.
Then, we follow the standard Seurat workflow, including...

* normalization,
* identifying highly variably genes,
* scaling expression values and regressing out the number of transcripts per cell,
* perform principal component analysis (PCA),
* find neighbors and clusters.

Furthermore, we build a cluster tree that represents the similarity between clusters and create a dedicated `cluster` column in the meta data.

```r
Expand Down
2 changes: 1 addition & 1 deletion examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,6 @@

Examples of the Cerebro workflow are available for the following public data sets:

* [`pbmc_10k_v3`](pbmc_10k_v3): single sample of peripheral blood mononuclear cells
* [`pbmc_10k_v3`](pbmc_10k_v3): single sample of human peripheral blood mononuclear cells
* [`GSE108041`](GSE108041): 4 samples of A549 cells before and after infection with influenza virus
* [`GSE129845`](GSE129845): 3 samples of human bladder cells from (3 patients)
59 changes: 39 additions & 20 deletions examples/pbmc_10k_v3/Seurat_v2/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,10 @@

Here, we analyze the `pbmc_10k_v3` data set using [Seurat](https://satijalab.org/seurat/) framework, following the basic [Seurat](https://satijalab.org/seurat/) workflow.

## Preparation

Before starting, we clone the Cerebro repository (or manually download it) because it contains the raw data of our example data set.
One (optional) step of our analysis will require us to provide some gene sets in a GMT file.
One (optional) step of our analysis will require us to provide some gene sets in a `GMT` file.
We manually download the `c2.all.v7.0.symbols.gmt` file from [MSigDB](http://software.broadinstitute.org/gsea/downloads.jsp#msigdb) and put it in our current working directory.
Then, we pull the Docker image from the Docker Hub, convert it to Singularity, and start an R session inside.

Expand Down Expand Up @@ -34,7 +36,7 @@ library('cerebroApp')

## Load transcript counts

Unfortunately, the `Read10X_h5()` function of Seurat v2 has problems with the `.h5` file downloaded from the 10x Genomics website so instead we load it manually and convert it to a sparse matrix.
Unfortunately, the `Read10X_h5()` function of Seurat v2 has problems with the `.h5` file downloaded from the 10x Genomics website so instead we load it manually, convert it to a sparse matrix, and merge transcripts from genes with the same name.

```r
h5_data <- hdf5r::H5File$new('raw_data/filtered_feature_bc_matrix.h5', mode = 'r')
Expand All @@ -50,17 +52,44 @@ feature_matrix <- Matrix::sparseMatrix(
dims = h5_data[['matrix/shape']][],
index1 = FALSE
)

genes <- rownames(feature_matrix)

feature_matrix <- feature_matrix %>%
as.matrix() %>%
as.data.frame() %>%
dplyr::mutate(gene = genes) %>%
dplyr::select(gene, dplyr::everything()) %>%
dplyr::group_by(gene) %>%
dplyr::summarise_all(sum) %>%
dplyr::ungroup()

genes <- feature_matrix$gene

feature_matrix <- feature_matrix %>%
dplyr::select(-gene) %>%
as.matrix() %>%
as('sparseMatrix')

rownames(feature_matrix) <- genes
```

## Pre-processing with Seurat

With the transcript count loaded, we create a Seurat object and remove cells with less than `100` transcripts or fewer than `50` expressed genes.
Then, we follow the standard Seurat workflow, including normalization, identifying highly variably genes, scaling expression values and regressing out the number of transcripts per cell, perform principal component analysis (PCA), find neighbors and clusters.
Then, we follow the standard Seurat workflow, including...

* normalization,
* identifying highly variably genes,
* scaling expression values and regressing out the number of transcripts per cell,
* perform principal component analysis (PCA),
* find neighbors and clusters.

Furthermore, we build a cluster tree that represents the similarity between clusters and create a dedicated `cluster` column in the meta data.

```r
seurat <- CreateSeuratObject(
project = 'PBMC_10k_v3',
project = 'pbmc_10k_v3',
raw.data = feature_matrix,
min.cells = 10
)
Expand Down Expand Up @@ -162,24 +191,14 @@ seurat <- RunUMAP(

## Meta data

This example data set consists of a single sample.
To highlight the functionality of Cerebro when working with a multi-sample data set, we the cells of clusters 1-5 to `sample_A`, those in clusters 6-10 to `sample_B`, and those of clusters 11-16 to `sample_C`.
This example data set consists of a single sample so we just add that name to the meta data.
Moreover, in order to later be able to understand how we did the analysis, we add some meta data to the `misc` slot of the Seurat object.

```r
meta_sample <- seurat@meta.data$cluster %>% as.character()
meta_sample[which(meta_sample %in% c('1','2','3','4','5'))] <- 'sample_A'
meta_sample[which(meta_sample %in% c('6','7','8','9','10'))] <- 'sample_B'
meta_sample[which(meta_sample %in% c('11','12','13','14','15','16'))] <- 'sample_C'
seurat@meta.data$sample <- factor(meta_sample, levels = c('sample_A','sample_B','sample_C'))
```
seurat@meta.data$sample <- factor('pbmc_10k_v3', levels = 'pbmc_10k_v3')

## Preparation

In order to later be able to understand how we did the analysis, we add some meta data to the `misc` slot of the Seurat object.

```r
seurat@misc$experiment <- list(
experiment_name = 'PBMC_10k',
experiment_name = 'pbmc_10k_v3',
organism = 'hg',
date_of_analysis = Sys.Date()
)
Expand Down Expand Up @@ -313,8 +332,8 @@ Finally, we use the `exportFromSeurat()` function of cerebroApp to export our Se
```r
cerebroApp::exportFromSeurat(
seurat,
experiment_name = 'PBMC_10k',
file = paste0('Seurat_v2/cerebro_PBMC_10k_', Sys.Date(), '.crb'),
experiment_name = 'pbmc_10k_v3',
file = paste0('Seurat_v2/cerebro_pbmc_10k_v3_', Sys.Date(), '.crb'),
organism = 'hg',
column_cell_cycle_seurat = 'Phase'
)
Expand Down
Loading

0 comments on commit 5cc493f

Please sign in to comment.