Data Processing Pipeline for PRISM Sequencing.
Package can be installed using devtools
by running the following command:
devtools::install_github('https://github.com/cmap/sushi')
Note that the package requires ShortRead
from BioconductR which may not be automatically installed.
Instructions for installing ShortRead
can be found on the BioconductR page
Each PrismSeq project should have its own folder within which to process data and save output files. For the purposes of this tutorial, we will assume you are working with the example_project folder. To begin, your project folder must contains a sample_meta.csv file and a fastq folder with 3 files per sample (index1, index2, read).
A template for the sample_meta.csv file may be found here. Create a copy of the Sequencing Metadata Template [internal] tab, fill it out according to the specifications in the instructions tab, and export the metadata sheet as a csv to your project directory.
Please contact [email protected] for access, if needed.
In contrast to the conventional MTS assay which uses Luminex detection, the Extended Assay (EPS) uses sequencing-based detection.
Raw data for EPS is thus first processed using SUSHI (yellow), the PRISM sequencing data processing pipeline, to obtain normalized counts and log fold-change data.
This data is then passed to the MTS pipeline (orange) for downstream processing such as dose-response fitting and biomarker analysis via the EPS conversion modules (green). Documentation for the MTS pipeline can be found here.
The processed data and visualizations are then available on the PRISM Portal (blue).
In order to run the PrismSeq processing pipeline, you must have copies of the follow three metadata files. For the purposes of this tutorial, we will assume they are saved in the metadata folder.
- cell_line_meta.csv
- cell_set_meta.csv
- CB_meta.csv
For additional information on the outputs of the PrismSeq processing pipeline see the notes and FAQs here.
Please contact [email protected] to request metadata files and for access to the notes and FAQs document, if needed.
The PrismSeq processing pipeline may run in R using the following functions.
- Generate a read count table from fastq files
The function write_df_from_fastq reads in fastq files associated with the project and organizes the reads into a table or dataframe saved as raw_counts.csv.
For assistance in generating fastq files from BCL files please contact [email protected], if needed.
# define invariant strings to identify fastq files
index_1 = "_I1_"
index_2 = "_I2_"
barcode_suffix = "_R1_"
read_directory_contents = c("example_project/fastq/") %>%
purrr::map(list.files, full.names = T) %>%
purrr::reduce(union)
barcode_read_files = read_directory_contents %>%
purrr::keep(stringr::str_detect, fixed(barcode_suffix)) %>%
sort()
index_1_files = read_directory_contents %>%
purrr::keep(stringr::str_detect, fixed(index_1)) %>%
sort()
index_2_files = read_directory_contents %>%
purrr::keep(stringr::str_detect, fixed(index_2)) %>%
sort()
raw_counts = write_df_from_fastq(forward_read_fastq_files = barcode_read_files,
index_1_files = index_1_files,
index_2_files = index_2_files,
write_interval = NA) # define write_interval to intermitently write out raw_counts file
raw_counts %>% write.csv("example_project/raw_counts.csv", row.names=F, quote=F)
- Filter raw read counts
The function filter_raw_reads filters raw_counts.csv for only the reads associated with the project as defined in the submitted sample meta.
# load relevant metadata
sample_meta = read.csv("example_project/sample_meta.csv")
cell_line_meta = read.csv("metadata/cell_line_meta.csv")
cell_set_meta = read.csv("metadata/cell_set_meta.csv")
CB_meta = read.csv("metadata/CB_meta.csv")
filtered_counts = filter_raw_reads(raw_counts,
sample_meta,
cell_line_meta,
cell_set_meta,
CB_meta,
id_cols=c('cell_set', 'treatment', 'dose','dose_unit','day','bio_rep','tech_rep')) # change the id_cols parameter to designate which metadata column uniquely define each profile
QC_table = filtered_counts$qc_table
annotated_counts= filtered_counts$annotated_counts
filtered_counts = filtered_counts$filtered_counts
QC_table %>% write.csv("example_project/QC_table.csv", row.names=F, quote=F)
annotated_counts %>% write.csv("example_project/annotated_counts.csv", row.names=F, quote=F)
filtered_counts %>% write.csv("example_project/filtered_counts.csv", row.names=F, quote=F)
- [Optional] Normalize filtered counts
The function normalize fits a linear model to a set of control barcodes (usually 10) supplied at specific doses. This model is then used to calculate the equivalent dose of cell line barcodes detected in each well. The normalized values are in the units of control barcode doses, and are not interpretable on their own.
This function can only be used if control barcodes are included in the run. If control barcodes were not used, this function can be skipped.
normalized_counts = normalize(filtered_counts,
CB_meta$Name)
normalized_counts %>% write.csv("example_project/normalized_counts.csv", row.names=F, quote=F)
- Generate log-fold change values
The function l2fc computes the log2 fold change between treatments and negative controls. This is done by collecting the negative control and collapsing across all replicates. The negative controls are then joined with the treatments to calculate log2 fold change values for each treatment replicate. If normalize was called, then log2 fold change values should be generated from normalized counts. If normalize was not called, log2 fold change values can be calculated from the read counts in filtered_counts.csv by adjusting the count_col_name parameter.
The example code below assues that the previous step, normalize, was performed.
l2fc = compute_l2fc(normalized_counts,
control_type = "negcon",
sig_cols=c('cell_set', 'treatment', 'dose','dose_unit','day'),
count_col_name="normalized_n") # change based on whether you are running compute_l2fc on normalized_counts or filtered_counts
l2fc %>% write.csv("example_project/l2fc.csv", row.names=F, quote=F)
- Collapse replicates
This function collapses the biological replicates of the treatments to get a single value for each unique cell line and treatment combination.
collapsed_values = collapse_counts(l2fc)
collapsed_values %>% write.csv("example_project/collapsed_values.csv", row.names=F, quote=F)
- [Optional] Generate QC images
This function outputs several files and images that can be used to assess the quality of the run.
QC_images(sample_meta,
annotated_counts,
filtered_counts,
cell_set_meta,
out = "example_project/")
The PrismSeq processing tools may alternatively be run through the command line. For the purposes of this tutorial, we will assume that the PrismSeq processing scripts are saved in the tools/scripts/ folder.
Rscript [toolname] --help
will provide information about arguments for specified tools.
The same considerations as above apply to the following tools.
- Generate a read count table from fastq files
Rscript tools/scripts/fastq2readcounts.R --fastq fastq\ --index_1 _I1_ --index_2 _I2_ --barcode_suffix _R1_
- Filter raw read counts
Rscript tools/scripts/filter_counts.R --raw_counts raw_counts.csv --sample_meta sample_meta.csv --id_cols cell_set,treatment,dose,dose_unit,day,bio_rep,tech_rep --cell_line_meta metadata/cell_line_meta.csv --cell_set_meta metadata/cell_set_meta.csv --CB_meta metadata/CB_meta.csv
- [Optional] Normalize filtered counts
Rscript tools/scripts/CBnormalize.R --filtered_counts filtered_counts.csv --CB_meta metadata/CB_meta.csv
- Generate log-fold change values
Rscript tools/scripts/compute_l2fc.R --normalized_counts normalized_counts.csv --control_type negcon --count_col_name normalized_n
- Collapse replicates
Rscript tools/scripts/collapse_replicates.R --lfc l2fc.csv
- [Optional] Generate QC images
Rscript filteredCounts_QC.R --sample_meta sample_meta.csv --annotated_counts annotated_counts.csv --filtered_counts filtered_counts.csv --cell_set_meta metadata/cell_set_meta.csv
When processing data from the Extended Day assay, the EPS conversion modules are also run. The 2 EPS conversion modules (green) perform separate tasks:
-
Quality metric calculations ( EPS_QC.R )
-
Adding or renaming the column headers of the SUSHI output for downstream compatibility ( seq_to_mts.py )
Files generated by the EPS pipeline that are shared with collaborators
-
EPS_QC_table.csv: Contains a table with day 10 count information and quality control metrics for each cell line indicating if the data passed raw count QC.
-
LEVEL3_NORMALIZED_COUNTS.csv: Contains normalized counts in EPS format, generated by running a conversion module to align headers to our downstream EPS pipeline.
-
LEVEL4_LFC.csv: Contains log-fold change data in EPS format, generated by running a conversion module to align headers to our downstream EPS pipeline.
-
LEVEL5_LFC.csv: Contains collapsed log fold change data in EPS format, generated by running a conversion module to align headers to our downstream EPS pipeline.
Intermediate files generated by the EPS pipeline that are for internal use (not shared)
-
normalized_counts.csv: Contains normalized counts, generated by fitting a linear model to a set of 10 control barcodes. Used to generate LEVEL3_NORMALIZED_COUNTS.csv.
-
l2fc.csv: Contains log-fold change values for each biological replicate you submitted, as compared to your annotated negative control. Used to generate LEVEL4_LFC.csv.
-
collapsed_values.csv: Contains median-collapsed log-fold change values.Used to generate LEVEL5_LFC.csv.
-
compound_key.csv: Contains an overview of the perturbations, projects, and doses. Used by EPS conversion modules.
-
inst_info.csv: Contains a comprehensive summary of compound, dose, cell line, and experimental information. Used by EPS conversion modules.
Column header conversions are outlined below for counts and LFC data, QC, and compound key files. Column headers which are present in only select levels of data are specified in the Description column (i.e. LEVEL4_LFC, LEVEL5_LFC, etc). Headers which are renamed during the SUSHI to EPS Data Conversion show the original name in SUSHI Column and the rename in EPS Column. Headers which are generated during the SUSHI to EPS Data Conversion or downstream in the pipeline have no entries for SUSHI Column.
Column names in LEVEL3_NORMALIZED_COUNTS, LEVEL4_LFC, and LEVEL5_LFC:
SUSHI Column | EPS Column | Description |
---|---|---|
bio_rep | replicate | Plate replicate |
cb_intercept | cb_intercept | Intercept of the control barcode normalization |
CCLE_name | ccle_name | Cell line name |
cell_set | cell_set | Cell set information |
control_barcode | control_barcode | Presence of control barcode in a given well |
control_MAD_QC | control_MAD_QC | TRUE if the variance across the biological replicates in the negative control is below threshold of ~1.66 (LEVEL4_LFC) |
control_mad_sqrtN | control_mad_sqrtN | Median Absolute Deviation over the square root of the number of negative controls (LEVEL4_LFC) |
control_median_n | control_median_n | Median collapsed n across the biological replicates of the negative controls (LEVEL4_LFC) |
control_median_normalized_n | control_median_normalized_n | Median normalized n across the biological replicates of the negative control (LEVEL4_LFC) |
counts_flag | counts_flag | This flags entries where the collapsed raw count in the negative control is below a threshold as “negcon<__”. (LEVEL4_LFC) |
DepMap_ID | depmap_id | DepMap cell line ID |
feature_id | Culture concatenated with cell line name | |
flag | flag | This column is filled with “Missing” if the number of reads n is zero. The column is filled with “low counts” if the number of reads n is below the count threshold. This threshold defaults to 40. |
log2_dose | log2_dose | In reference to the control barcode |
log2_n | log2_n | log2(n+pseudocount). A pseudocount (default of 20) is added so that missing cell lines can be carried through the normalization. |
log2_normalized_n | log2_normalized_n | Log normalized count including pseudocount |
mean_n | mean_n | Mean of n across technical reps (LEVEL4_LFC) |
mean_normalized_n | mean_normalized_n | Mean of normalized_n across technical reps (LEVEL4_LFC) |
n | n | Number of Reads |
Name | Name | Control barcode name |
normalized_n | normalized_n | Normalized read count |
num_bio_reps | num_bio_reps | Number of biological replicates of treatment that were collapsed. (LEVEL5_LFC) |
num_ctrl_bio_reps | num_ctrl_bio_reps | Number of biological replicates that were collapsed in the negative controls. (LEVEL4_LFC) |
num_tech_reps | num_tech_reps | Number of technical replicates that were collapsed. (LEVEL4_LFC) |
dose | pert_dose | Perturbation dose (numeric) |
dose_unit | pert_dose_unit | Perturbation dose units |
treatment | pert_id | Uppercase of treatment values. Perturbation Broad ID. |
pert_idose | Perturbation dose with units | |
treatment | pert_iname | Perturbation name (e.g. AZ-628) |
day | pert_itime | Assay length with units |
pert_plate | Sample compound plate | |
day | pert_time | Assay length |
pert_time_unit | Assay length units | |
trt_type | pert_type | Perturbation type (e.g. negative control) ('poscon': 'trt_poscon', 'negcon': 'ctl_vehicle') |
Perturbation vehicle (e.g. DMSO) | ||
pcr_plate | pcr_plate | PCR plate information |
pcr_well | pcr_well | PCR well information |
pcr_well | pert_well | Sample well |
pool_id | Assay pool | |
prism_cell_set | culture | Cell line culture (e.g. PR500) |
profile_id | profile_id | Concatenation of cell set, replicate, compound, and dose information. |
treatment | prc_id | Uppercase of treatment values. Perturbation Broad ID. |
x_project_id | x_project_id | Project name (e.g. Validation Compounds) |
screen | screen | Screen type and iteration |
tech_rep | tech_rep | Technical replicate |
trt_MAD_QC | trt_MAD_QC | TRUE if the variance across the biological replicates in the negative control is below a threshold of ~1.66. (LEVEL5_LFC) |
trt_mad_sqrtN | trt_mad_sqrtN | Median Absolute Ddeviation over the square root of the number of treatment biological replicates. (LEVEL5_LFC) |
trt_median_n | trt_median_n | Median collapsed n across the biological replicates of the treatments. (LEVEL5_LFC) |
trt_median_normalized_n | trt_median_normalized_n | Median normalized n across the biological replicates of the treatments. (LEVEL5_LFC) |
feature_id | Unique feature ID | |
l2fc | LFC | log2 fold-change relative to control vehicle (LEVEL4_LFC) |
median_l2fc | LFC | median log2 fold-change relative to control vehicle (LEVEL5_LFC) |
Column names in EPS QC file:
Column | Description |
---|---|
CCLE_name | Cancer cell line encyclopedia cell line name |
DepMap_ID | DepMap cell line ID |
cell_set | Cell set information |
day | Day at end of experiment (0, 6, 10, etc) |
med_log_raw_counts | Median log raw counts |
med_log_norm_counts | Median log normalized counts |
med_raw_counts | Median raw counts |
pass_raw_count_qc | True/False value indicating whether raw counts passed QC |
pert_plate | Pert plate information |
Column names in compound key file:
Column | Description |
---|---|
pert_iname | Compound |
pert_id | Compound ID |
pert_plate | Pert plate |
x_project_id | Project code |
pert_dose_1 | Number of unique doses per compound per plate |
This document was last update on Feb 25, 2024