Skip to content
somalee edited this page Feb 21, 2013 · 4 revisions

Table of Contents

Introduction

The ChIPSeqPipeline (https://github.com/StanfordBioInformatics/Scoring/wiki/Pipeline-Overview) was created to provide a standardized way to analyze ChIPSeq experiments in a high-throughput environment. It was developed as part of the ENCODE project and follows the standards for that project. A fuller discussion of the ENCODE standards for ChIPSeq can be found in the paper ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. S. Landt, G. Marinov, A. Kundaje et al. (2012). Genome Research 22: 1813-31.

The pipeline was designed to run on a HPC cluster using Sun Grid Engine. Other setups may work, but will require some interpretations of the created job files.

Setup / Installation

External Dependencies

Most of the external software dependencies are related to the actual peak callers. The pipeline supports SPP, MACS1.4, MACS2 and PeakSeq out of the box. SPP is recommended for ideal IDR results, but both versions of MACS should work. PeakSeq IDR support is weak due to it not reporting enough noise, so is included mainly for legacy support. Highly recommend using unfiltered PeakSeq results and applying external filtering. (i.e. using a q-value cutoff)

MACS 1.4

Software can be downloaded at http://liulab.dfci.harvard.edu/MACS/Download

MACS2

Software is still under development and is not in a stable state, but can be downloaded at the MACS github: https://github.com/taoliu/MACS/downloads

SPP

We use a modified version of the official 1.10 SPP package found at http://code.google.com/p/phantompeakqualtools/ This package is required for cross-correlation analysis. (The pipeline will fail without this version of SPP installed)

IDR

Software can be downloaded at https://sites.google.com/site/anshulkundaje/projects/idr

SJM

sjm (Simple Job Manager) is required to submit jobs to a Sun Grid Engine cluster. If SGE isn't installed then a jobs file can be created describing the commands need to be run and their dependencies, but an external script will be required to interpret it.

sjm can be downloaded at http://sourceforge.net/projects/hpcsjm/ (Tested on version 1.2.0)

R

Statistical programming package R is required. (Note, it is a dependency of SPP and IDR). We developed using R 2.15.1. R is included in most standard distributions, but can also be downloaded at http://www.r-project.org

MySQL-Python

The MySQLdb libraries must be installed in order to use the control locking database. On scg3 they are installed in the standard python/2.7 (use modules add python/2.7). The libraries can be downloaded at http://mysql-python.sourceforge.net

SAMTools

Software can be downloaded at http://samtools.sourceforge.net

Globals.conf

All the global variables are stored in a globals.conf configuration file. The default globals.conf is located in script directory (currently /srv/gs1/projects/scg/Scoring/pipeline2/). Individual runs can override this by placing a custom globals.conf file in the working directory of a run.

Fields

  • BIN_DIR -- Location of pipeline scripts.
  • R_BINARY -- Location of Rscript binary
  • SPP_BINARY -- Command to call SPP
  • SPP_BINARY_NO_DUPS -- Command to call special version of SPP designed for runs where duplicated reads have been filtered out
  • MACS_BINARY -- Binary of MACS1.4
  • MACS_LIBRARY -- Location of MACS python library file
  • MACS2_BINARY -- Binary of MACS2
  • MACS2_LIBRARY -- Location of MACS2 python library files
  • PEAKSEQ_BINARY -- Location of PeakSeq binary
  • PEAKSEQ_BIN_SIZE -- Number of bins for PeakSeq to use when scoring. 10000 is the standard for human.
  • ARCHIVE_DIR -- Location on file system for where to store archived results
  • DOWNLOAD_BASE -- URL prefix for downloading archived results
  • SJM_NOTIFY -- Python-style list of email addresses to use for SJM's notifications. (When runs succeed and fail)
  • QUEUE -- SGE queue to use (optional)
  • SGE_PROJECT -- SGE project to use (optional)
  • MYSQL_PASSWORD_FILE -- Local file system location of a file which contains the MySQL password for the control locking database
  • CONTROL_DB_HOST -- Control locking MySQL host
  • CONTROL_DB_USER -- Control locking MySQL user
  • CONTROL_DB -- Control locking MySQL database name
  • CONTROL_DB_PORT -- Control locking MySQL database port
  • TMP_DIR -- Temp directory to store intermediate mapped read files
  • SAMTOOLS_BINARY -- Binary of samtools

Control Locking

ChIP-seq Pipeline Database Table

The ChIP-seq pipeline requires a small database for keeping track of control data sets that are currently being processed. The database allows a scoring job to determine if it is the first job to use a particular control, in which case it must be preprocessed, and it prevents subsequent jobs from attempting to preprocess the same control at the same time.

This system is optional and can be disabled by setting the USE_CONTROL_LOCK global variables in the peakcaller modules to False.

Adding Genomes

Adding additional genomes is controlled in the chr_maps.py file. The following steps must be done in order to add a valid genome to the pipeline.

  • Add a mapping from genome chromosome FASTA file to chromosome name in chr_maps.py. This is necessary because eland displays the results by the source chromosome reference file.
  • Add chr mapping to the genomes map in chr_maps.py
  • Create a IDR binary directory for the genome. The directory should contain all the IDR binaries along with a genome_table.txt tailored for the genome. (See IDR documentation) The directory path should be added to the IDR_BIN_DIR mapping. This duplication is necessary because of a limitation of IDR. Manually edit batch-consistency-analysis.r to change the hard coded source (line 55) and chr.file (line 58) paths.
  • Select the IDR filtering thresholds in the IDR_THRESHOLDS mapping in chr_maps.py
  • Specify the whole genome size in macs_genome_size to be used as a parameter for MACS. The default MACS parameters should already be set.
  • If using PeakSeq, set the location of the pre-generated mappability file in peakseq_mappability_file

Running Pipeline

Running the pipeline outside of SNAP requires two basic steps: creating the configuration files and setting options for the pipeline.py script.

Configuration Files

The locations of the input and output files are contained within two configuration files passed into the pipeline.py script. If control locking is setup, the expectation is that a single control config file will be frequently reused and the processing steps of the control will only occur once.

control.conf

Fields:

  • control_mapped_reads: Comma-deliminated list of locations of alignment files. (eland*, bam or sam)
  • results_dir: Output directory for results
  • temporary_dir: Directory to place temporary files created during processing
  • run_name: Unique identifier for the control
  • genome: The genome the reads were mapped in. A list of valid genomes can be found in the chr_maps.py script.
Example:
[peakseq]
control_mapped_reads = /path/to/control1.bam,/path/to/control2.bam
results_dir = /scoring/results/HumanControl
temporary_dir = /srv/gs1/projects/scg/Scoring/tmp
run_name = HumanControl
genome = hg19_male

sample.conf

Fields:

  • run_name: Unique identifier for the sample
  • results_dir: Output directory for results. Must not already exist.
  • genome: The genome the reads were mapped in. A list of valid genomes can be found in the chr_maps.py script.
  • temporary_dir: Directory to place temporary files created during processing
  • mapped_reads (per replicate): Comma-deliminated list of locations of alignment files. (eland*, bam or sam)
Example:
[replicate1]
mapped_reads = /path/to/rep1.bam

[replicate2]
mapped_reads = /path/to/rep2a.bam,/path/to/rep2b.bam

pipeline.py options

Runs PeakSeq scoring pipeline for ChipSeq data.

Usage:  pipeline.py [-f] [-p] [-h] [-s] [-a] [-m <email address>] 
[-l <directory>] [-n <run_name>] [-c <peakcaller>] <control_config_file>
 [<sample_config_file>]

Arguments:
	-c, --peakcaller <peakcaller>
	specify the peakcaller to be used.  Current options are peakseq, macs, 
	macs2, spp. Defaults to macs2.
	
	-a, --no_archive
	does not archive the control and sample results.  
	
	-f, --force
	forces running of pipeline, even if results already exist
	
	-p, --print
	prints the job commands, but does not dispatch them to the cluster
	
	-d, --no_duplicates
	runs cross correlation analysis assuming duplicated reads have
	already been filtered out of the mapped reads.  Uncommon, so
	defaults to false.
	
	-h, --help
	displays this usage information and exits
	
	-l <directory>, --log <directory>
	log directory, current working directory if not specified
	
	-n <run_name>, --name <run_name>
	name for the pipeline run
	
	-m <email_address>, --mail <email_address>
	email address to send summary and result location
	
	-s, --snap
	make a call to the SNAP LIMS after completion
	
	--filtchr <chromosome>
	SPP option to ignore a chromosome during analysis.  Used to fix bug that 
	chrs with low read counts causes SPP to fail. 
	
	--rmdups
	Filter out all duplicate reads in sample read files before peakcalling.  Use
	when PCR amplification errors are present.  (i.e., PBC value is low)
	
	<control_config_file>
	(required) configuration file for the experiment's control
	
	<sample_config_file>
	configuration file for the sample replicates in the experiment.  Optional, 
	but in most cases this is specified.

Parameters

Parameters given to external dependencies. Provided for reference only. Most options will not need to be changed.

MACS

macs14 -t [sample_eland_file] -c [control_eland_file] -n [name] -g [genome_size] -w -p 1e-2 --nomodel --shiftsize=[frag_size] 
  • sample and control eland files are pre-filtered for uniquely mapping reads with no more than 2 mismatches
  • frag_size is determined from the cross-correlation analysis (take frag_length / 2 from that)

MACS2

macs2 callpeak -t [sample_eland_file] -c [control_eland_file] -f ELAND -n [name] -g [genome_size] -B -p 0.1 --to-large --nomodel 
--shiftsize=[frag_size]
  • sample and control eland files are pre-filtered for uniquely mapping reads with no more than 2 mismatches
  • frag_size is determined from the cross-correlation analysis (take frag_length / 2 from that)

SPP

Rscript run_spp.R -rf -c=[sample_tagAlign_file] -i=[control_tagAlign_file] -npeak=300000 -odir=[results_directory] -savr -savp 
-x=-500:50 -out=[output_statistics_file]
  • Optionally, can specify -filtchr flag to filter out reads from specified chromosome
  • sample and control tagAlign files are pre-filtered for uniquely mapping reads with no more than 2 mismatches 

PeakSeq

Peak-Seq_v1.02 [sample_eland_file] [control_eland_file] [output_sgr_file] [output_hits_file] [bin_size] [mappability_file]
  • bin_size is defaulted to 10000
  • mappability_file is precomputed file describing how uniquely mappable each region of the genome is

IDR

Rscript batch-consistency-analysis.r [narrowPeak_rep_a] [narrowPeak_rep_b] -1 [output_file] 0 F [ranking_measure]
  • ranking_measure is specific to each peakcaller:
    • PeakSeq = q.value
    • SPP = signal.value
    • MACS = p.value
    • MACS2 = p.value

Cross-Correlation

Rscript run_spp.R -rf -c=[sample_tagAlign_file] -savp -x=-50:40 -out=[output_file]
  • Optionally, can specify -filtchr flag to filter out reads from specified chromosome
  • sample tagAlign file is pre-filtered for uniquely mapping reads with no more than 2 mismatches 

Results

All of the scoring results will be put into the specified results directory. That directory will also be archived and the archive deposited in the specified global archive directory.

Description of Result Files

[RunName]_conservative_narrowPeak.bed

Conservatively IDR thresholded list of peaks in narrowPeak format.

[RunName]_optimal_narrowPeak.bed

IDR thresholded list of peaks in narrowPeak format.

full_report.txt

Human-readable report of peak calling statistics. Used as body for email notification.

rep_stats

Parsable report of peak calling statistics Fields:

  • sample_tar_complete: Location of archived sample results
  • control_tar_complete: Location of archived control results
  • num_reads=Repi: Number of reads for Rep i
  • read_files=Repi: Original aligned read file(s) used for repi
  • [deprecated] total_hits1=RepN_VS_RepM=q_value: Number of hits from RepN found in RepM above the specified q value
  • [deprecated] total_hits2=RepN_VS_RepM=q_value: Number of hits from RepM found in RepN above the specified q value
  • [deprecated] rep_overlap=RepN_VS_RepM: Percentage of hits from RepN found in RepM

idr_results.txt

File containing the number of hits passing the IDR threshold for each of the IDR tests IDR tests include:

  • Rep vs Rep (Nt)
  • Self-consistency Reps (PR1_VS_PR2) (Ns)
  • Pooled Self-consistency (RepAll_PR1_VS_PR2) (Np)

pbc_results.txt

Per replicate statistics on PCR Bottlenecking Coefficient (measure of library complexity) Columns are:

  • Rep Name
  • Genomic Locations with exactly one read
  • Total mapped genomic locations
  • PBC value (percent of genomic locations mapped exactly once)

spp_stats.txt

Cross-correlation statistics.

  • COL1: Filename: tagAlign/BAM filename
  • COL2: numReads: effective sequencing depth i.e. total number of mapped reads in input file
  • COL3: estFragLen: comma separated strand cross-correlation peak(s) in decreasing order of correlation.
          The top 3 local maxima locations that are within 90% of the maximum cross-correlation value are output.
      In almost all cases, the top (first) value in the list represents the predominant fragment length.
      If you want to keep only the top value simply run
      sed -r 's/,[^\t]+//g' <outFile> > <newOutFile>
  • COL4: corr_estFragLen: comma separated strand cross-correlation value(s) in decreasing order (col2 follows the same order)
  • COL5: phantomPeak: Read length/phantom peak strand shift
  • COL6: corr_phantomPeak: Correlation value at phantom peak
  • COL7: argmin_corr: strand shift at which cross-correlation is lowest
  • COL8: min_corr: minimum value of cross-correlation
  • COL9: Normalized strand cross-correlation coefficient (NSC) = COL4 / COL8
  • COL10: Relative strand cross-correlation coefficient (RSC) = (COL4 - COL8) / (COL6 - COL8)
  • COL11: QualityTag: Quality tag based on thresholded RSC (codes: -2:veryLow,-1:Low,0:Medium,1:High,2:veryHigh)

idr/

Files generated during IDR threshold calculations. Typically not useful for the average user. Results described at https://sites.google.com/site/anshulkundaje/projects/idr

RepN/

Directory containing the raw peak calling results from the specified peak caller. These are results produced prior to IDR thresholding so they may contain a lot of noise. Signal maps are also included, although the format of each may change depending on which peak caller was chosen.

RepN_PR1/ and RepN_PR2/

Directory containing the raw peak calling results for the pseudoreplicates. These are used during the IDR calculations and should be ignored by most users. Provided for debugging purposes only.

Future Directions / Wish List

  • Proper deletions of temporary files
  • Better recovery from failed control runs (i.e. update MySQL control locking table_)