The blueprints to development, response to the environment, and cellular function are largely the manifestation of distinct gene expression programs controlled by the spatiotemporal activity of cis-regulatory elements. Although biochemical methods for identifying accessible chromatin – a hallmark of cis-regulatory elements – have been developed, approaches capable of measuring and quantifying cis-regulatory activity are only beginning to be realized. Massively Parallel Reporter Assays coupled to chromatin accessibility profiling presents a high-throughput solution for testing the transcription activating capacity of millions of putatively regulatory DNA sequences. However, clear computational pipelines for analyzing these high-throughput sequencing-based reporter assays are lacking. In this protocol, I layout and rationalize a computational framework for the processing and analysis of Assay for Transposase Accessible Chromatin profiling followed by Self-Transcribed Active Regulatory Region sequencing (ATAC-STARR-seq) data from a recent study in Zea mays. The approach described herein can be adapted to other sequencing-based reporter assays and it largely agnostic to the model organism.
BWA MEM see install instructions here
SAMtools
BEDtools
SRA-toolkit
fastp
MACS2
UCSC binaries
tabix
IGV
MEME
CrossMap
DeepTools
The computational pipeline uses paired-end sequencing data from an ATAC-STARR-seq experiment performed on maize protoplasts (Ricci et al., 2019). The ATAC-STARR-seq experiment consisted of a DNA input (ATAC-seq library) and a mRNA readout (self-transcribed regulatory regions) to identify genomic regions exhibiting transcription-activating regulatory activity.
Additional details can be found in paper.
- Download data
# set variables and download FASTQ files
mkdir FASTQ_files
cd FASTQ_files
fasterq-dump -o B73_maize_DNA_input.fastq SRR10964904
fasterq-dump -o B73_maize_mRNA_output.fastq SRR10964905
# compress fastq files
pigz *.fastq
# NOT RUN
# Tip: gzip can be used as an alternative to pigz (parallel gzip)
# gzip *.fastq
# download reference data
cd ../
mkdir Genome_Reference
cd Genome_Reference
wget https://download.maizegdb.org/Zm-B73-REFERENCE-NAM-5.0/Zm-B73-REFERENCE-NAM-5.0.fa.gz
wget https://download.maizegdb.org/Zm-B73-REFERENCE-NAM-5.0/Zm-B73-REFERENCE-NAM-5.0_Zm00001eb.1.gff3.gz
# create indices for reference genome FASTA
gunzip Zm-B73-REFERENCE-NAM-5.0.fa.gz
samtools faidx Zm-B73-REFERENCE-NAM-5.0.fa
bwa index Zm-B73-REFERENCE-NAM-5.0.fa
- Trim adapters and remove low quality reads
# run step 1
sbatch step01_trim_raw_reads.sh
- Align and process sequenced reads
# run step 2
sbatch step02_align_STARR_data.sh
- Extract fragments
# run step 3
sbatch step03_extract_fragments.sh
- Call peaks
# run step 4
sbatch step04_call_peaks.sh
- Estimate enhancer activity
# run step 5
sbatch step05_estimate_enhancer_activity.sh
# estimate enhancer activity
cd BED_files/
Rscript Estimate_Enhancer_Activity.R
- Filter noisy STARR peaks using empirical FDR
# run step 6
sbatch step06_create_control_regions.sh
# create directory to contain analysis
cd ../
mkdir 01_Peak_Analysis
cd 01_Peak_Analysis
# map maximum enhancer activity to putative regulatory regions (wdups)
bedtools map -a ../Peak_data/STARR_merged_peaks.bed -b ../BED_files/B73_maize.enhancer_activity.bdg -o max -c 4 > STARR_merged_peaks.enhancer_activity.bed
# map maximum enhancer activity to control
bedtools map -a ../Peak_data/STARR_CONTROL.bed -b ../BED_files/B73_maize.enhancer_activity.bdg -o max -c 4 > STARR_CONTROL.enhancer_activity.bed
# run eFDR filter
Rscript eFDR_Filter_STARR_Peaks.R
- Plot heatmaps
# run step 7
sbatch step07_plot_enhancer_activity.sh
See the paper for a downstream analysis and expected results.