This repository contains matadata necessary for alignment and analysis of the human NSCLC CITEseq dataset presented in Leader, A.M., Grout, J., et al. Cancer Cell, in press (2021), also cited in Maier, B., Leader, A.M., Chen, S.T. et al. A conserved dendritic-cell regulatory program limits antitumour immunity. Nature 580, 257–262 (2020). https://doi.org/10.1038/s41586-020-2134-y
All human sequencing data is available on NCBI with BioProject ID PRJNA609924 and GEO accession GSE154826.
Alignment was performed using Cellranger v3.1.0 using feature barcoding. Feature barcode tables for the alignment are in the Leader, et al. supplemental tables.
Table S1 contains sample metadata for each sample included in the study.
Please contact [email protected] with any questions.
A .csv file with cell-ID to cluster and cell-ID to sample associations can be downloaded from /input_tables/cell_metadata.csv. Cluster annotations are available in /input_tables/annots_list.csv (column "sub_lineage"). Sample-level metadata is available in the published Table S1 and provided in /input_tables/table_s1_sample_table.csv.
The full dataset can either be downloaded automatically by running the script to reproduce the figures (see below). Alternatively, .rd files can be downloaded using the following dropbox links:
human NSCLC scRNA & CITEseq data: https://www.dropbox.com/s/vjbide8ro5iwrfh/lung_ldm.rd?dl=1
This link will download an R data structure, the components of which contain the count matrices and cell metadata.
The data structure is called "lung_ldm" and has the following components:
-
model -> containing elements
- models: a matrix with average cluster expresssion values
- params: a list of parameters used in the initial clustering
-
dataset -> The entire Mount Sinai 10x chromium 3' dataset presented in the paper including CITEseq data, with the following elements:
- umitab: raw scRNA count data of filtered cells
- adt_by_sample: list of raw CITEseq adt count data by sample
- hto_by_sample: list of raw CITEseq hto count data by sample
- ds: matrix of cells downsampled to 2000 UMI each
- cell_to_sample: array of cell to sample associations
- ll: log-likelihood scores for each cell mapping to each cluster
- ds_numis: the number of UMIs to which ds is downsampled
- gated_out_umitabs: raw count data for barcodes filtered during the QC filtering step
- counts: 3-dimensional array of samples x genes x total UMI observed per cluster
- samples: array of samples included in the dataset
- numis_before_filtering: list of arrays of number of UMIs observed per barcode in each sample prior to the filtering step
- max_umis: upper threshold of total UMIs per barcode used for QC filtering
- noise_counts: 3-dimensional array (samples x genes x cluster) of estimated # of UMI that is predicted to be attributed to noise
- noise_models: total average signal per sample, used as the noise component in the modified multinomial model for probabalistic classification of cells to clusters
- min_umis: lower threshold of total UMIs per barcode used for QC filtering
- avg_numis_per_sample_model: matrix of samples x clusters with values represented the average #UMI per cluster in each sample
- cell_to_cluster: array with cell to cluster associations
- alpha_noise: estimated noise fraction in each sample
Tested on Windows 10
-
R
-
R packages:
- gplots
- MASS
- Matrix
- matrixStats
- Matrix.utils
- mixtools
- plotrix
- data.table
- tidyr
- CePa
- seriation
- sp
- scales
- skmeans
- RColorBrewer
- R.devices
- TCGAbiolinks
- SummarizedExperiment
- GenomicDataCommons
- viridis
- scDissector
-
Downloaded and unzipped version of this repository on a local path.
Assuming Leader_et_al is the local path of the repository we need to load the script files:
source("scripts/figures_main.R")`
The above referenced dropbox link will download automatically to a new /data/ directory. Additional data files necessary to reproduce the plots will also download.
Figure will be generated in a new directory:
- output/figures/
- Figure S1A: This panel is generated during clustering, in the call to cluster() in the run_clustering.R script
- Figures S1B, S1C: Functions to generate these plots are in the figure_s1bc.R script, but specific reproduction of these panels in the figures_main.R script has not yet been implemented.
- Figures 7E, F and S7C, D requires downloading additional TCGA expression and mutation data. The script figure_7ef_s7cd.R performs all downloading and analysis but is not implemented inline with figures_main.R because the downloading step is time- and memory-intensive and sometimes quits unexpectedly.
- Figures 7G-J and S7E-H analyze data from the POPLAR trial from Genentech but is not publically available at the time of our publication.
Tested on linux LSF HPC. Due to lack of support of some of the depdendencies, the script cannot run on macOS or Windows.
- R
- R packages:
- Matrix
- Matrix.utils
- gplots
- seriation
- tglkmeans
- scDissector
- Downloaded and unzipped version of this repository on a local path.
Running the scripts in R Assuming Leader_et_al/ is the local path of the repository, the following script will run the clustering distributedly on LSF:
source("scripts/clustering/run_clustering.r")
Note: Each run of the clustering might produce slightly different results due to different random seeds.