Home

Table of Contents Introduction Setup / Installation External Dependencies MACS 1.4 MACS2 SPP IDR SJM R MySQL-Python SAMTools Globals.conf Fields Control Locking ChIP-seq Pipeline Database Table Adding Genomes Running Pipeline Configuration Files control.conf sample.conf pipeline.py options Parameters MACS MACS2 SPP PeakSeq IDR Cross-Correlation Results Description of Result Files [RunName]_conservative_narrowPeak.bed [RunName]_optimal_narrowPeak.bed full_report.txt rep_stats idr_results.txt pbc_results.txt spp_stats.txt idr/ RepN/ RepN_PR1/ and RepN_PR2/ Future Directions / Wish List

Introduction

The ChIPSeqPipeline (https://github.com/StanfordBioInformatics/Scoring/wiki/Pipeline-Overview) was created to provide a standardized way to analyze ChIPSeq experiments in a high-throughput environment. It was developed as part of the ENCODE project and follows the standards for that project. A fuller discussion of the ENCODE standards for ChIPSeq can be found in the paper ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. S. Landt, G. Marinov, A. Kundaje et al. (2012). Genome Research 22: 1813-31.

The pipeline was designed to run on a HPC cluster using Sun Grid Engine. Other setups may work, but will require some interpretations of the created job files.

Setup / Installation

External Dependencies

Most of the external software dependencies are related to the actual peak callers. The pipeline supports SPP, MACS1.4, MACS2 and PeakSeq out of the box. SPP is recommended for ideal IDR results, but both versions of MACS should work. PeakSeq IDR support is weak due to it not reporting enough noise, so is included mainly for legacy support. Highly recommend using unfiltered PeakSeq results and applying external filtering. (i.e. using a q-value cutoff)

MACS 1.4

Software can be downloaded at http://liulab.dfci.harvard.edu/MACS/Download

MACS2

Software is still under development and is not in a stable state, but can be downloaded at the MACS github: https://github.com/taoliu/MACS/downloads

SPP

We use a modified version of the official 1.10 SPP package found at http://code.google.com/p/phantompeakqualtools/ This package is required for cross-correlation analysis. (The pipeline will fail without this version of SPP installed)

IDR

Software can be downloaded at https://sites.google.com/site/anshulkundaje/projects/idr

SJM

sjm (Simple Job Manager) is required to submit jobs to a Sun Grid Engine cluster. If SGE isn't installed then a jobs file can be created describing the commands need to be run and their dependencies, but an external script will be required to interpret it.

sjm can be downloaded at http://sourceforge.net/projects/hpcsjm/ (Tested on version 1.2.0)

R

Statistical programming package R is required. (Note, it is a dependency of SPP and IDR). We developed using R 2.15.1. R is included in most standard distributions, but can also be downloaded at http://www.r-project.org

MySQL-Python

The MySQLdb libraries must be installed in order to use the control locking database. On scg3 they are installed in the standard python/2.7 (use modules add python/2.7). The libraries can be downloaded at http://mysql-python.sourceforge.net

SAMTools

Software can be downloaded at http://samtools.sourceforge.net

Globals.conf

All the global variables are stored in a globals.conf configuration file. The default globals.conf is located in script directory (currently /srv/gs1/projects/scg/Scoring/pipeline2/). Individual runs can override this by placing a custom globals.conf file in the working directory of a run.

Fields

BIN_DIR -- Location of pipeline scripts.
R_BINARY -- Location of Rscript binary
SPP_BINARY -- Command to call SPP
SPP_BINARY_NO_DUPS -- Command to call special version of SPP designed for runs where duplicated reads have been filtered out
MACS_BINARY -- Binary of MACS1.4
MACS_LIBRARY -- Location of MACS python library file
MACS2_BINARY -- Binary of MACS2
MACS2_LIBRARY -- Location of MACS2 python library files
PEAKSEQ_BINARY -- Location of PeakSeq binary
PEAKSEQ_BIN_SIZE -- Number of bins for PeakSeq to use when scoring. 10000 is the standard for human.
ARCHIVE_DIR -- Location on file system for where to store archived results
DOWNLOAD_BASE -- URL prefix for downloading archived results
SJM_NOTIFY -- Python-style list of email addresses to use for SJM's notifications. (When runs succeed and fail)
QUEUE -- SGE queue to use (optional)
SGE_PROJECT -- SGE project to use (optional)
MYSQL_PASSWORD_FILE -- Local file system location of a file which contains the MySQL password for the control locking database
CONTROL_DB_HOST -- Control locking MySQL host
CONTROL_DB_USER -- Control locking MySQL user
CONTROL_DB -- Control locking MySQL database name
CONTROL_DB_PORT -- Control locking MySQL database port
TMP_DIR -- Temp directory to store intermediate mapped read files
SAMTOOLS_BINARY -- Binary of samtools

Control Locking

ChIP-seq Pipeline Database Table

The ChIP-seq pipeline requires a small database for keeping track of control data sets that are currently being processed. The database allows a scoring job to determine if it is the first job to use a particular control, in which case it must be preprocessed, and it prevents subsequent jobs from attempting to preprocess the same control at the same time.

This system is optional and can be disabled by setting the USE_CONTROL_LOCK global variables in the peakcaller modules to False.

Adding Genomes

Adding additional genomes is controlled in the chr_maps.py file. The following steps must be done in order to add a valid genome to the pipeline.

Add a mapping from genome chromosome FASTA file to chromosome name in chr_maps.py. This is necessary because eland displays the results by the source chromosome reference file.
Add chr mapping to the genomes map in chr_maps.py
Create a IDR binary directory for the genome. The directory should contain all the IDR binaries along with a genome_table.txt tailored for the genome. (See IDR documentation) The directory path should be added to the IDR_BIN_DIR mapping. This duplication is necessary because of a limitation of IDR. Manually edit batch-consistency-analysis.r to change the hard coded source (line 55) and chr.file (line 58) paths.
Select the IDR filtering thresholds in the IDR_THRESHOLDS mapping in chr_maps.py
Specify the whole genome size in macs_genome_size to be used as a parameter for MACS. The default MACS parameters should already be set.
If using PeakSeq, set the location of the pre-generated mappability file in peakseq_mappability_file

Running Pipeline

Running the pipeline outside of SNAP requires two basic steps: creating the configuration files and setting options for the pipeline.py script.

Configuration Files

The locations of the input and output files are contained within two configuration files passed into the pipeline.py script. If control locking is setup, the expectation is that a single control config file will be frequently reused and the processing steps of the control will only occur once.

control.conf

Fields:

control_mapped_reads: Comma-deliminated list of locations of alignment files. (eland*, bam or sam)
results_dir: Output directory for results
temporary_dir: Directory to place temporary files created during processing
run_name: Unique identifier for the control
genome: The genome the reads were mapped in. A list of valid genomes can be found in the chr_maps.py script.

Example:

[peakseq]
control_mapped_reads = /path/to/control1.bam,/path/to/control2.bam
results_dir = /scoring/results/HumanControl
temporary_dir = /srv/gs1/projects/scg/Scoring/tmp
run_name = HumanControl
genome = hg19_male

sample.conf

Fields:

run_name: Unique identifier for the sample
results_dir: Output directory for results. Must not already exist.
genome: The genome the reads were mapped in. A list of valid genomes can be found in the chr_maps.py script.
temporary_dir: Directory to place temporary files created during processing
mapped_reads (per replicate): Comma-deliminated list of locations of alignment files. (eland*, bam or sam)

Example:

[replicate1]
mapped_reads = /path/to/rep1.bam

[replicate2]
mapped_reads = /path/to/rep2a.bam,/path/to/rep2b.bam

pipeline.py options

Runs PeakSeq scoring pipeline for ChipSeq data.

Usage:  pipeline.py [-f] [-p] [-h] [-s] [-a] [-m &lt;email address&gt;] 
[-l &lt;directory&gt;] [-n &lt;run_name&gt;] [-c &lt;peakcaller&gt;] &lt;control_config_file&gt;
 [&lt;sample_config_file&gt;]

Arguments:
	-c, --peakcaller &lt;peakcaller&gt;
	specify the peakcaller to be used.  Current options are peakseq, macs, 
	macs2, spp. Defaults to macs2.
	
	-a, --no_archive
	does not archive the control and sample results.  
	
	-f, --force
	forces running of pipeline, even if results already exist
	
	-p, --print
	prints the job commands, but does not dispatch them to the cluster
	
	-d, --no_duplicates
	runs cross correlation analysis assuming duplicated reads have
	already been filtered out of the mapped reads.  Uncommon, so
	defaults to false.
	
	-h, --help
	displays this usage information and exits
	
	-l &lt;directory&gt;, --log &lt;directory&gt;
	log directory, current working directory if not specified
	
	-n &lt;run_name&gt;, --name &lt;run_name&gt;
	name for the pipeline run
	
	-m &lt;email_address&gt;, --mail &lt;email_address&gt;
	email address to send summary and result location
	
	-s, --snap
	make a call to the SNAP LIMS after completion
	
	--filtchr &lt;chromosome&gt;
	SPP option to ignore a chromosome during analysis.  Used to fix bug that 
	chrs with low read counts causes SPP to fail. 
	
	--rmdups
	Filter out all duplicate reads in sample read files before peakcalling.  Use
	when PCR amplification errors are present.  (i.e., PBC value is low)
	
	&lt;control_config_file&gt;
	(required) configuration file for the experiment's control
	
	&lt;sample_config_file&gt;
	configuration file for the sample replicates in the experiment.  Optional, 
	but in most cases this is specified.

Parameters

Parameters given to external dependencies. Provided for reference only. Most options will not need to be changed.

MACS

macs14 -t [sample_eland_file] -c [control_eland_file] -n [name] -g [genome_size] -w -p 1e-2 --nomodel --shiftsize=[frag_size]

sample and control eland files are pre-filtered for uniquely mapping reads with no more than 2 mismatches
frag_size is determined from the cross-correlation analysis (take frag_length / 2 from that)

MACS2

macs2 callpeak -t [sample_eland_file] -c [control_eland_file] -f ELAND -n [name] -g [genome_size] -B -p 0.1 --to-large --nomodel 
--shiftsize=[frag_size]

sample and control eland files are pre-filtered for uniquely mapping reads with no more than 2 mismatches
frag_size is determined from the cross-correlation analysis (take frag_length / 2 from that)

SPP

Rscript run_spp.R -rf -c=[sample_tagAlign_file] -i=[control_tagAlign_file] -npeak=300000 -odir=[results_directory] -savr -savp 
-x=-500:50 -out=[output_statistics_file]

Optionally, can specify -filtchr flag to filter out reads from specified chromosome
sample and control tagAlign files are pre-filtered for uniquely mapping reads with no more than 2 mismatches

PeakSeq

Peak-Seq_v1.02 [sample_eland_file] [control_eland_file] [output_sgr_file] [output_hits_file] [bin_size] [mappability_file]

bin_size is defaulted to 10000
mappability_file is precomputed file describing how uniquely mappable each region of the genome is

IDR

Rscript batch-consistency-analysis.r [narrowPeak_rep_a] [narrowPeak_rep_b] -1 [output_file] 0 F [ranking_measure]

ranking_measure is specific to each peakcaller:
- PeakSeq = q.value
- SPP = signal.value
- MACS = p.value
- MACS2 = p.value

Cross-Correlation

Rscript run_spp.R -rf -c=[sample_tagAlign_file] -savp -x=-50:40 -out=[output_file]

Optionally, can specify -filtchr flag to filter out reads from specified chromosome
sample tagAlign file is pre-filtered for uniquely mapping reads with no more than 2 mismatches

Results

All of the scoring results will be put into the specified results directory. That directory will also be archived and the archive deposited in the specified global archive directory.

Description of Result Files

[RunName]_conservative_narrowPeak.bed

Conservatively IDR thresholded list of peaks in narrowPeak format.

[RunName]_optimal_narrowPeak.bed

IDR thresholded list of peaks in narrowPeak format.

full_report.txt

Human-readable report of peak calling statistics. Used as body for email notification.

rep_stats

Parsable report of peak calling statistics Fields:

sample_tar_complete: Location of archived sample results
control_tar_complete: Location of archived control results
num_reads=Repi: Number of reads for Rep i
read_files=Repi: Original aligned read file(s) used for repi
[deprecated] total_hits1=RepN_VS_RepM=q_value: Number of hits from RepN found in RepM above the specified q value
[deprecated] total_hits2=RepN_VS_RepM=q_value: Number of hits from RepM found in RepN above the specified q value
[deprecated] rep_overlap=RepN_VS_RepM: Percentage of hits from RepN found in RepM

idr_results.txt

File containing the number of hits passing the IDR threshold for each of the IDR tests IDR tests include:

Rep vs Rep (Nt)
Self-consistency Reps (PR1_VS_PR2) (Ns)
Pooled Self-consistency (RepAll_PR1_VS_PR2) (Np)

pbc_results.txt

Per replicate statistics on PCR Bottlenecking Coefficient (measure of library complexity) Columns are:

Rep Name
Genomic Locations with exactly one read
Total mapped genomic locations
PBC value (percent of genomic locations mapped exactly once)

spp_stats.txt

Cross-correlation statistics.

COL1: Filename: tagAlign/BAM filename
COL2: numReads: effective sequencing depth i.e. total number of mapped reads in input file
COL3: estFragLen: comma separated strand cross-correlation peak(s) in decreasing order of correlation.

          The top 3 local maxima locations that are within 90% of the maximum cross-correlation value are output.
      In almost all cases, the top (first) value in the list represents the predominant fragment length.
      If you want to keep only the top value simply run
      sed -r 's/,[^\t]+//g' &lt;outFile&gt; &gt; &lt;newOutFile&gt;

COL4: corr_estFragLen: comma separated strand cross-correlation value(s) in decreasing order (col2 follows the same order)
COL5: phantomPeak: Read length/phantom peak strand shift
COL6: corr_phantomPeak: Correlation value at phantom peak
COL7: argmin_corr: strand shift at which cross-correlation is lowest
COL8: min_corr: minimum value of cross-correlation
COL9: Normalized strand cross-correlation coefficient (NSC) = COL4 / COL8
COL10: Relative strand cross-correlation coefficient (RSC) = (COL4 - COL8) / (COL6 - COL8)
COL11: QualityTag: Quality tag based on thresholded RSC (codes: -2:veryLow,-1:Low,0:Medium,1:High,2:veryHigh)

idr/

Files generated during IDR threshold calculations. Typically not useful for the average user. Results described at https://sites.google.com/site/anshulkundaje/projects/idr

RepN/

Directory containing the raw peak calling results from the specified peak caller. These are results produced prior to IDR thresholding so they may contain a lot of noise. Signal maps are also included, although the format of each may change depending on which peak caller was chosen.

RepN_PR1/ and RepN_PR2/

Directory containing the raw peak calling results for the pseudoreplicates. These are used during the IDR calculations and should be ignored by most users. Provided for debugging purposes only.

Future Directions / Wish List

Proper deletions of temporary files
Better recovery from failed control runs (i.e. update MySQL control locking table_)

Home

Table of Contents

Introduction

Setup / Installation

External Dependencies

MACS 1.4

MACS2

SPP

IDR

SJM

R

MySQL-Python

SAMTools

Globals.conf

Fields

Control Locking

ChIP-seq Pipeline Database Table

Adding Genomes

Running Pipeline

Configuration Files

control.conf

sample.conf

pipeline.py options

Parameters

MACS

MACS2

SPP

PeakSeq

IDR

Cross-Correlation

Results

Description of Result Files

[RunName]_conservative_narrowPeak.bed

[RunName]_optimal_narrowPeak.bed

full_report.txt

rep_stats

idr_results.txt

pbc_results.txt

spp_stats.txt

idr/

RepN/

RepN_PR1/ and RepN_PR2/

Future Directions / Wish List

Clone this wiki locally