-
Notifications
You must be signed in to change notification settings - Fork 6
Home
The ChIPSeqPipeline (https://github.com/StanfordBioInformatics/Scoring/wiki/Pipeline-Overview) was created to provide a standardized way to analyze ChIPSeq experiments in a high-throughput environment. It was developed as part of the ENCODE project and follows the standards for that project. A fuller discussion of the ENCODE standards for ChIPSeq can be found in the paper ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. S. Landt, G. Marinov, A. Kundaje et al. (2012). Genome Research 22: 1813-31.
The pipeline was designed to run on a HPC cluster using Sun Grid Engine. Other setups may work, but will require some interpretations of the created job files.
Most of the external software dependencies are related to the actual peak callers. The pipeline supports SPP, MACS1.4, MACS2 and PeakSeq out of the box. SPP is recommended for ideal IDR results, but both versions of MACS should work. PeakSeq IDR support is weak due to it not reporting enough noise, so is included mainly for legacy support. Highly recommend using unfiltered PeakSeq results and applying external filtering. (i.e. using a q-value cutoff)
Software can be downloaded at http://liulab.dfci.harvard.edu/MACS/Download
Software is still under development and is not in a stable state, but can be downloaded at the MACS github: https://github.com/taoliu/MACS/downloads
We use a modified version of the official 1.10 SPP package found at http://code.google.com/p/phantompeakqualtools/ This package is required for cross-correlation analysis. (The pipeline will fail without this version of SPP installed)
Software can be downloaded at https://sites.google.com/site/anshulkundaje/projects/idr
sjm (Simple Job Manager) is required to submit jobs to a Sun Grid Engine cluster. If SGE isn't installed then a jobs file can be created describing the commands need to be run and their dependencies, but an external script will be required to interpret it.
sjm can be downloaded at http://sourceforge.net/projects/hpcsjm/ (Tested on version 1.2.0)
Statistical programming package R is required. (Note, it is a dependency of SPP and IDR). We developed using R 2.15.1. R is included in most standard distributions, but can also be downloaded at http://www.r-project.org
The MySQLdb libraries must be installed in order to use the control locking database. On scg3 they are installed in the standard python/2.7 (use modules add python/2.7). The libraries can be downloaded at http://mysql-python.sourceforge.net
Software can be downloaded at http://samtools.sourceforge.net
All the global variables are stored in a globals.conf configuration file. The default globals.conf is located in script directory (currently /srv/gs1/projects/scg/Scoring/pipeline2/). Individual runs can override this by placing a custom globals.conf file in the working directory of a run.
- BIN_DIR -- Location of pipeline scripts.
- R_BINARY -- Location of Rscript binary
- SPP_BINARY -- Command to call SPP
- SPP_BINARY_NO_DUPS -- Command to call special version of SPP designed for runs where duplicated reads have been filtered out
- MACS_BINARY -- Binary of MACS1.4
- MACS_LIBRARY -- Location of MACS python library file
- MACS2_BINARY -- Binary of MACS2
- MACS2_LIBRARY -- Location of MACS2 python library files
- PEAKSEQ_BINARY -- Location of PeakSeq binary
- PEAKSEQ_BIN_SIZE -- Number of bins for PeakSeq to use when scoring. 10000 is the standard for human.
- ARCHIVE_DIR -- Location on file system for where to store archived results
- DOWNLOAD_BASE -- URL prefix for downloading archived results
- SJM_NOTIFY -- Python-style list of email addresses to use for SJM's notifications. (When runs succeed and fail)
- QUEUE -- SGE queue to use (optional)
- SGE_PROJECT -- SGE project to use (optional)
- MYSQL_PASSWORD_FILE -- Local file system location of a file which contains the MySQL password for the control locking database
- CONTROL_DB_HOST -- Control locking MySQL host
- CONTROL_DB_USER -- Control locking MySQL user
- CONTROL_DB -- Control locking MySQL database name
- CONTROL_DB_PORT -- Control locking MySQL database port
- TMP_DIR -- Temp directory to store intermediate mapped read files
- SAMTOOLS_BINARY -- Binary of samtools
The ChIP-seq pipeline requires a small database for keeping track of control data sets that are currently being processed. The database allows a scoring job to determine if it is the first job to use a particular control, in which case it must be preprocessed, and it prevents subsequent jobs from attempting to preprocess the same control at the same time.
This system is optional and can be disabled by setting the USE_CONTROL_LOCK global variables in the peakcaller modules to False.
Adding additional genomes is controlled in the chr_maps.py file. The following steps must be done in order to add a valid genome to the pipeline.
- Add a mapping from genome chromosome FASTA file to chromosome name in chr_maps.py. This is necessary because eland displays the results by the source chromosome reference file.
- Add chr mapping to the genomes map in chr_maps.py
- Create a IDR binary directory for the genome. The directory should contain all the IDR binaries along with a genome_table.txt tailored for the genome. (See IDR documentation) The directory path should be added to the IDR_BIN_DIR mapping. This duplication is necessary because of a limitation of IDR. Manually edit batch-consistency-analysis.r to change the hard coded source (line 55) and chr.file (line 58) paths.
- Select the IDR filtering thresholds in the IDR_THRESHOLDS mapping in chr_maps.py
- Specify the whole genome size in macs_genome_size to be used as a parameter for MACS. The default MACS parameters should already be set.
- If using PeakSeq, set the location of the pre-generated mappability file in peakseq_mappability_file
Running the pipeline outside of SNAP requires two basic steps: creating the configuration files and setting options for the pipeline.py script.
The locations of the input and output files are contained within two configuration files passed into the pipeline.py script. If control locking is setup, the expectation is that a single control config file will be frequently reused and the processing steps of the control will only occur once.
Fields:
- control_mapped_reads: Comma-deliminated list of locations of alignment files. (eland*, bam or sam)
- results_dir: Output directory for results
- temporary_dir: Directory to place temporary files created during processing
- run_name: Unique identifier for the control
- genome: The genome the reads were mapped in. A list of valid genomes can be found in the chr_maps.py script.
[peakseq] control_mapped_reads = /path/to/control1.bam,/path/to/control2.bam results_dir = /scoring/results/HumanControl temporary_dir = /srv/gs1/projects/scg/Scoring/tmp run_name = HumanControl genome = hg19_male
Fields:
- run_name: Unique identifier for the sample
- results_dir: Output directory for results. Must not already exist.
- genome: The genome the reads were mapped in. A list of valid genomes can be found in the chr_maps.py script.
- temporary_dir: Directory to place temporary files created during processing
- mapped_reads (per replicate): Comma-deliminated list of locations of alignment files. (eland*, bam or sam)
[replicate1] mapped_reads = /path/to/rep1.bam [replicate2] mapped_reads = /path/to/rep2a.bam,/path/to/rep2b.bam
Runs PeakSeq scoring pipeline for ChipSeq data. Usage: pipeline.py [-f] [-p] [-h] [-s] [-a] [-m <email address>] [-l <directory>] [-n <run_name>] [-c <peakcaller>] <control_config_file> [<sample_config_file>] Arguments: -c, --peakcaller <peakcaller> specify the peakcaller to be used. Current options are peakseq, macs, macs2, spp. Defaults to macs2. -a, --no_archive does not archive the control and sample results. -f, --force forces running of pipeline, even if results already exist -p, --print prints the job commands, but does not dispatch them to the cluster -d, --no_duplicates runs cross correlation analysis assuming duplicated reads have already been filtered out of the mapped reads. Uncommon, so defaults to false. -h, --help displays this usage information and exits -l <directory>, --log <directory> log directory, current working directory if not specified -n <run_name>, --name <run_name> name for the pipeline run -m <email_address>, --mail <email_address> email address to send summary and result location -s, --snap make a call to the SNAP LIMS after completion --filtchr <chromosome> SPP option to ignore a chromosome during analysis. Used to fix bug that chrs with low read counts causes SPP to fail. --rmdups Filter out all duplicate reads in sample read files before peakcalling. Use when PCR amplification errors are present. (i.e., PBC value is low) <control_config_file> (required) configuration file for the experiment's control <sample_config_file> configuration file for the sample replicates in the experiment. Optional, but in most cases this is specified.
Parameters given to external dependencies. Provided for reference only. Most options will not need to be changed.
macs14 -t [sample_eland_file] -c [control_eland_file] -n [name] -g [genome_size] -w -p 1e-2 --nomodel --shiftsize=[frag_size]
- sample and control eland files are pre-filtered for uniquely mapping reads with no more than 2 mismatches
- frag_size is determined from the cross-correlation analysis (take frag_length / 2 from that)
macs2 callpeak -t [sample_eland_file] -c [control_eland_file] -f ELAND -n [name] -g [genome_size] -B -p 0.1 --to-large --nomodel --shiftsize=[frag_size]
- sample and control eland files are pre-filtered for uniquely mapping reads with no more than 2 mismatches
- frag_size is determined from the cross-correlation analysis (take frag_length / 2 from that)
Rscript run_spp.R -rf -c=[sample_tagAlign_file] -i=[control_tagAlign_file] -npeak=300000 -odir=[results_directory] -savr -savp -x=-500:50 -out=[output_statistics_file]
- Optionally, can specify -filtchr flag to filter out reads from specified chromosome
- sample and control tagAlign files are pre-filtered for uniquely mapping reads with no more than 2 mismatches
Peak-Seq_v1.02 [sample_eland_file] [control_eland_file] [output_sgr_file] [output_hits_file] [bin_size] [mappability_file]
- bin_size is defaulted to 10000
- mappability_file is precomputed file describing how uniquely mappable each region of the genome is
Rscript batch-consistency-analysis.r [narrowPeak_rep_a] [narrowPeak_rep_b] -1 [output_file] 0 F [ranking_measure]
- ranking_measure is specific to each peakcaller:
- PeakSeq = q.value
- SPP = signal.value
- MACS = p.value
- MACS2 = p.value
Rscript run_spp.R -rf -c=[sample_tagAlign_file] -savp -x=-50:40 -out=[output_file]
- Optionally, can specify -filtchr flag to filter out reads from specified chromosome
- sample tagAlign file is pre-filtered for uniquely mapping reads with no more than 2 mismatches
All of the scoring results will be put into the specified results directory. That directory will also be archived and the archive deposited in the specified global archive directory.
Conservatively IDR thresholded list of peaks in narrowPeak format.
IDR thresholded list of peaks in narrowPeak format.
Human-readable report of peak calling statistics. Used as body for email notification.
Parsable report of peak calling statistics Fields:
- sample_tar_complete: Location of archived sample results
- control_tar_complete: Location of archived control results
- num_reads=Repi: Number of reads for Rep i
- read_files=Repi: Original aligned read file(s) used for repi
- [deprecated] total_hits1=RepN_VS_RepM=q_value: Number of hits from RepN found in RepM above the specified q value
- [deprecated] total_hits2=RepN_VS_RepM=q_value: Number of hits from RepM found in RepN above the specified q value
- [deprecated] rep_overlap=RepN_VS_RepM: Percentage of hits from RepN found in RepM
File containing the number of hits passing the IDR threshold for each of the IDR tests IDR tests include:
- Rep vs Rep (Nt)
- Self-consistency Reps (PR1_VS_PR2) (Ns)
- Pooled Self-consistency (RepAll_PR1_VS_PR2) (Np)
Per replicate statistics on PCR Bottlenecking Coefficient (measure of library complexity) Columns are:
- Rep Name
- Genomic Locations with exactly one read
- Total mapped genomic locations
- PBC value (percent of genomic locations mapped exactly once)
Cross-correlation statistics.
- COL1: Filename: tagAlign/BAM filename
- COL2: numReads: effective sequencing depth i.e. total number of mapped reads in input file
- COL3: estFragLen: comma separated strand cross-correlation peak(s) in decreasing order of correlation.
The top 3 local maxima locations that are within 90% of the maximum cross-correlation value are output. In almost all cases, the top (first) value in the list represents the predominant fragment length. If you want to keep only the top value simply run sed -r 's/,[^\t]+//g' <outFile> > <newOutFile>
- COL4: corr_estFragLen: comma separated strand cross-correlation value(s) in decreasing order (col2 follows the same order)
- COL5: phantomPeak: Read length/phantom peak strand shift
- COL6: corr_phantomPeak: Correlation value at phantom peak
- COL7: argmin_corr: strand shift at which cross-correlation is lowest
- COL8: min_corr: minimum value of cross-correlation
- COL9: Normalized strand cross-correlation coefficient (NSC) = COL4 / COL8
- COL10: Relative strand cross-correlation coefficient (RSC) = (COL4 - COL8) / (COL6 - COL8)
- COL11: QualityTag: Quality tag based on thresholded RSC (codes: -2:veryLow,-1:Low,0:Medium,1:High,2:veryHigh)
Files generated during IDR threshold calculations. Typically not useful for the average user. Results described at https://sites.google.com/site/anshulkundaje/projects/idr
Directory containing the raw peak calling results from the specified peak caller. These are results produced prior to IDR thresholding so they may contain a lot of noise. Signal maps are also included, although the format of each may change depending on which peak caller was chosen.
Directory containing the raw peak calling results for the pseudoreplicates. These are used during the IDR calculations and should be ignored by most users. Provided for debugging purposes only.
- Proper deletions of temporary files
- Better recovery from failed control runs (i.e. update MySQL control locking table_)