From 65a7b7519cd0aa53cd1057955164d1fce3e15e46 Mon Sep 17 00:00:00 2001
From: Ning-Yi SHAO <shaoningyi@gmail.com>
Date: Fri, 5 Sep 2014 11:19:32 -0700
Subject: [PATCH 1/2] SGE support is added.

---
 README.md                            |  67 ++++----
 project/script-sge/config.yaml       |  42 +++++
 project/script-sge/pipeline.py       | 227 +++++++++++++++++++++++++++
 project/script-sge/results_parser.py | 221 ++++++++++++++++++++++++++
 4 files changed, 526 insertions(+), 31 deletions(-)
 create mode 100644 project/script-sge/config.yaml
 create mode 100644 project/script-sge/pipeline.py
 create mode 100644 project/script-sge/results_parser.py

diff --git a/README.md b/README.md
index 5e669dc..a50dbaa 100644
--- a/README.md
+++ b/README.md
@@ -1,15 +1,16 @@
-# Pipeline for ChIP-seq preprocessing
+Pipeline for ChIP-seq preprocessing
+===================================
 
 ### Overview
 
 Here is the pipeline I used for ChIP-seq preprocessing, including:
 
-* align the fastq data to reference genome by bowtie or bowtie2.
-* run FastQC to check the sequencing quality.
-* remove all reads duplications of the aligned data.
-* generate TDF files for browsing in IGV.
-* run PhantomPeak to check the quality of ChIP.
-* run ngs.plot to investigate the enrichment of ChIP-seq data at TSS, TES, and genebody.
+- align the fastq data to reference genome by bowtie or bowtie2.
+- run FastQC to check the sequencing quality.
+- remove all reads duplications of the aligned data.
+- generate TDF files for browsing in IGV.
+- run PhantomPeak to check the quality of ChIP.
+- run ngs.plot to investigate the enrichment of ChIP-seq data at TSS, TES, and genebody.
 
 The pipeline work flow is:
 
@@ -19,14 +20,14 @@ The pipeline work flow is:
 
 The softwares used in this pipeline are:
 
-* [ruffus](https://code.google.com/p/ruffus/)
-* [Bowtie](http://bowtie-bio.sourceforge.net/index.shtml)
-* [FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
-* [samtools](http://samtools.sourceforge.net/)
-* [IGVTools](http://www.broadinstitute.org/igv/igvtools)
-* [PhantomPeak](http://code.google.com/p/phantompeakqualtools/) __In fact, the script **run_spp_nodups.R** is from PhantomPeak, but PhantomPeak still need to be installed in R.__
-* [ngs.plot](https://code.google.com/p/ngsplot/)
-* If cluster supporting needed, [drmaa_for_python](https://pypi.python.org/pypi/drmaa) is needed. Now only LSF is supported, but it is easy to modify it to fit your demands.
+- [ruffus](https://code.google.com/p/ruffus/)
+- [Bowtie](http://bowtie-bio.sourceforge.net/index.shtml)
+- [FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
+- [samtools](http://samtools.sourceforge.net/)
+- [IGVTools](http://www.broadinstitute.org/igv/igvtools)
+- [PhantomPeak](http://code.google.com/p/phantompeakqualtools/) **In fact, the script **run_spp_nodups.R** is from PhantomPeak, but PhantomPeak still need to be installed in R.**
+- [ngs.plot](https://code.google.com/p/ngsplot/)
+- If cluster supporting needed, [drmaa_for_python](https://pypi.python.org/pypi/drmaa) is needed. Now LSF and SGE are supported, but it is easy to modify it to fit your demands.
 
 Install above softwares and make sure they are in $PATH.
 
@@ -40,7 +41,7 @@ Put the scripts in ./bin to a place in $PATH or add ./bin to $PATH.
 python pipeline.py config.yaml
 ```
 
-Or on an LSF cluster:
+Or on an LSF or SGE cluster:
 
 ```bash
 nohup python pipeline.py config.yaml &
@@ -56,38 +57,42 @@ python results_parser.py config.yaml
 
 For the organization of projects, I generally follow this paper: [A Quick Guide to Organizing Computational Biology Projects](http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000424). Here because it is preprocessing, and real analysis will be peak calling, chromatin segmentation, and differential enrichment detection, so I just put the results of the preprocess in the data folder.
 
-For the configuration yaml file, __project_dir: `~/projects/test_ChIP-seq`__ and __data_dir: "data"__ mean the data folder is `~/projects/test_ChIP-seq/data`, and the results will be put in the same folder. Fastq files should be under `~/projects/test_ChIP-seq/data/fastq` folder. Now *.fastq, *.fq, *.gz (compressed fastq) files are acceptable. `aligner` now could be `bowtie` or `bowtie2`, if not assigned, then default aligner is `bowtie`. For `bowtie2`, the system variable `$BOWTIE2_INDEXES` should be set before running.
+For the configuration yaml file, **project_dir: `~/projects/test_ChIP-seq`** and **data_dir: "data"** mean the data folder is `~/projects/test_ChIP-seq/data`, and the results will be put in the same folder. Fastq files should be under `~/projects/test_ChIP-seq/data/fastq` folder. Now *.fastq, *.fq, *.gz (compressed fastq) files are acceptable. `aligner` now could be `bowtie` or `bowtie2`, if not assigned, then default aligner is `bowtie`. For `bowtie2`, the system variable `$BOWTIE2_INDEXES` should be set before running.
 
 The position of pipeline.py, results_parser.py, and config.yaml doesn't matter at all. But I prefer to put them under project/script/preprocess folder.
 
 **Important:**
 
-+ To make ngs.plot part work, please name the fastq files in this way:
+- To make ngs.plot part work, please name the fastq files in this way:
+
 ```
-Say condition A, B, each with 2 replicates, and one DNA input per condition. 
-Name the files as A_rep1.fastq, A_rep2.fastq, A_input.fastq, B_rep1.fastq, 
+Say condition A, B, each with 2 replicates, and one DNA input per condition.
+Name the files as A_rep1.fastq, A_rep2.fastq, A_input.fastq, B_rep1.fastq,
 B_rep2.fastq, and B_input.fastq.The key point is to make the same condition
  samples with common letters and input samples contain "input" or "Input"
  strings.
 ```
-+ If use want to only run to some specific step, just modify the function name in `pipeline_run` in pipeline.py.
-+ If the data are pair-end, follow this step:
-	+ Modify the `config.yaml` file, change "pair_end" to "yes".
-	+ Modify the `config.yaml` file, change "input_files" to "\*R1\*.fastq.gz" or "\*R1\*.fastq".
-	+ Make sure the fastq files named as "\*R1\*" and "\*R2\*" pattern.
-+ if you want to use cluster:
-	+ Edit '~/.bash_profile' to make sure all paths in $PATH.
-	+ Modify `config.yaml` to fit your demands.
-	+ `multithread` in `pipeline.py` determines the number of concurrent jobs to be submitted to cluster nodes by ruffus. A default value of 10 is used.
+
+- If use want to only run to some specific step, just modify the function name in `pipeline_run` in pipeline.py.
+- If the data are pair-end, follow this step:
+	- Modify the `config.yaml` file, change "pair_end" to "yes".
+	- Modify the `config.yaml` file, change "input_files" to "*R1*.fastq.gz" or "*R1*.fastq".
+	- Make sure the fastq files named as "*R1*" and "*R2*" pattern.
+- if you want to use cluster:
+	- Edit '~/.bash_profile' to make sure all paths in $PATH.
+	- Modify `config.yaml` to fit your demands.
+	- `multithread` in `pipeline.py` determines the number of concurrent jobs to be submitted to cluster nodes by ruffus. A default value of 10 is used.
 
 **Warning:**
 
-`Bowtie2` allows multiple hits reads, and breaks the assumption of `phantomPeak`: 
+`Bowtie2` allows multiple hits reads, and breaks the assumption of `phantomPeak`:
+
 ```
 It is EXTREMELY important to filter out multi-mapping reads from the BAM/tagAlign
  files. Large number of multimapping reads can severly affect the phantom peak
  coefficient and peak calling results.
 ```
+
 So be careful to interpret `NSC` and `RSC` in `Bowtie2` alignment results.
 
 ### Notes
@@ -98,4 +103,4 @@ In Bowtie2, default parameters are used.
 
 ### ToDos
 
-+ Method to skip some steps if the user doesn't run.
+- Method to skip some steps if the user doesn't run.
diff --git a/project/script-sge/config.yaml b/project/script-sge/config.yaml
new file mode 100644
index 0000000..4e9286f
--- /dev/null
+++ b/project/script-sge/config.yaml
@@ -0,0 +1,42 @@
+project_name: "test_ChIP-seq"
+project_dir: "~/projects/test_ChIP-seq"
+
+## Aligner: bowtie or bowtie2
+aligner: "bowtie"
+pair_end: "yes"
+
+## If use bowtie, then bowtie1 index path to be assigned here.
+## "Genome" replaced by genome name as "hg19" or "mm9", and
+## uncomment next line.
+bowtie_index: "~/data/bowtie_index/Genome"
+
+## If use bowtie2, then "Genome" should be replace by
+## bowtie2 index name, as "hg19" or "mm9"
+## $BOWTIE2_INDEXES should be added in the environment variables
+# bowtie_index: "Genome"
+
+bam_sort_buff: "2G"
+IGV_genome: "hg19"
+# ngsplot_genome: "reference_genome"
+# ngsplot_fraglen: 150
+
+## the folder under project folder, containing fastq folder.
+data_dir: "data"
+
+## match pattern for input files. Now it could be:
+## *.fastq, *.fq, *.gz
+## for pair end data, use "*R1*.fastq.gz" or "*R1*.fastq"
+input_files: "*R1*.fastq.gz"
+
+## cluster settings
+cores: 4  # number of cores to use for multi-threaded programs.
+queue: "queue_name"
+h_vmem: "32G"
+
+## wall_time for every step, hh:mm
+wall_time:
+    alignFastqByBowtie: "23:59"
+    runFastqc: "4:00"
+    rmdupBam: "20:00"
+    genTDF: "20:00"
+    runPhantomPeak: "20:00"
diff --git a/project/script-sge/pipeline.py b/project/script-sge/pipeline.py
new file mode 100644
index 0000000..dd67f94
--- /dev/null
+++ b/project/script-sge/pipeline.py
@@ -0,0 +1,227 @@
+#! /usr/bin/env python
+
+import os
+import sys
+import yaml
+from ruffus import *
+import glob
+import subprocess
+import string
+import drmaa
+from ruffus.drmaa_wrapper import run_job, error_drmaa_job
+
+my_drmaa_session = drmaa.Session()
+my_drmaa_session.initialize()
+
+def expandOsPath(path):
+    """
+    To expand the path with shell variables.
+    Arguments:
+    - `path`: path string
+    """
+    return os.path.expanduser(os.path.expandvars(path))
+
+def genFilesWithPattern(pathList, Pattern):
+    """
+    To generate files list on the fly based on wildcards pattern.
+    Arguments:
+    - `pathList`: the path of the files
+    - `Pattern`: pattern like config["input_files"]
+    """
+    pathList.append(Pattern)
+    Files = expandOsPath(os.path.join(*pathList))
+    return Files
+
+def cluster_options(config, task_name, cores, logfile):
+    """
+    Generate a string of cluster options to feed an LSF job.
+    Arguments:
+    - `config`: configuration as associative array from the YAML file.
+    - `task_name`: the specific task name, such as runPhantomPeak.
+    - `cores`: number of cores to use for this task.
+    - `logfile`: log file name.
+    """
+    ## Here are the paramters for SGE.
+    str_options = "-cwd -V -pe shm %d -q %s -j y -o %s" % \
+        (cores, config["queue"], logfile)
+    if "h_vmem" in config:
+        str_options = str_options + " -l h_vmem=%s" % (config["h_vmem"])
+    return str_options
+
+
+config_name = sys.argv[1]
+config_f = open(config_name, "r")
+config = yaml.load(config_f)
+config_f.close()
+inputfiles = expandOsPath(os.path.join(config["project_dir"], config["data_dir"], "fastq", config["input_files"]))
+FqFiles = [x for x in glob.glob(inputfiles)]
+fq_name, fq_ext = os.path.splitext(config["input_files"])
+fq_ext_suffix = ".alignment.log"
+Bam_path = expandOsPath(os.path.join(config["project_dir"], config["data_dir"])) + "/"
+FastQC_path = expandOsPath(os.path.join(config["project_dir"], config["data_dir"], "FastQC"))
+rmdup_path = expandOsPath(os.path.join(config["project_dir"], config["data_dir"], "rmdup"))
+
+scipt_path = os.path.dirname(os.path.realpath(__file__))
+
+@transform(FqFiles, formatter(fq_ext), os.path.join(Bam_path, "{basename[0]}.bam"), config)
+def alignFastqByBowtie(FqFileName, OutputBamFileName, config):
+    """
+    To align '.fastq' to genome.
+    Arguments:
+    - `FqFileName`: file to be processed
+    """
+    if "aligner" in config:
+        if config["aligner"] == "bowtie":
+            cmds = ['fastq2bam_by_bowtie.sh']
+            cmds.append(FqFileName)
+            cmds.append(expandOsPath(config['bowtie_index']))
+        elif config["aligner"] == "bowtie2":
+            cmds = ['fastq2bam_by_bowtie2.sh']
+            cmds.append(FqFileName)
+            cmds.append(config['bowtie_index'])
+        else:
+            raise KeyError
+    else:
+        cmds = ['fastq2bam_by_bowtie.sh']
+        cmds.append(FqFileName)
+        cmds.append(expandOsPath(config['bowtie_index']))
+
+    target = expandOsPath(os.path.join(config["project_dir"], config["data_dir"]))
+    cmds.append(target)
+    cmds.append(config["pair_end"])
+    cores = int(config['cores'])
+    if cores == 0:
+        cores = 1
+    cmds.append(str(cores))
+    logfile = FqFileName + ".alignment.log"
+
+    run_job(" ".join(cmds),
+        job_name = "alignFastqByBowtie_" + os.path.basename(FqFileName),
+        job_other_options = cluster_options(config, "alignFastqByBowtie", cores, logfile),
+        job_script_directory = os.path.dirname(os.path.realpath(__file__)),
+        job_environment={ 'BASH_ENV' : '/srv/gsfs0/home/nshao/.bashrc' },
+        retain_job_scripts = True, drmaa_session=my_drmaa_session)
+
+    return 0
+
+@follows(alignFastqByBowtie, mkdir(FastQC_path))
+@transform(alignFastqByBowtie, suffix(".bam"), ".bam.fastqc.log", config)
+def runFastqc(BamFileName, fastqcLog, config):
+    """
+    To run FastQC
+    Arguments:
+    - `BamFileName`: bam file
+    - `config`: config
+    """
+    cmds = ['fastqc']
+    cmds.append("-o")
+    cmds.append(expandOsPath(os.path.join(config["project_dir"], config["data_dir"], "FastQC")))
+    cores = int(config['cores'])
+    if cores == 0:
+        cores = 1
+    cmds.append("-t")
+    cmds.append(str(cores))
+    cmds.append(BamFileName)
+    logfile = BamFileName + ".fastqc.log"
+
+    run_job(" ".join(cmds),
+        job_name = "fastqc_" + os.path.basename(BamFileName),
+        job_other_options = cluster_options(config, "runFastqc", cores, logfile),
+        job_script_directory = os.path.dirname(os.path.realpath(__file__)),
+        job_environment={ 'BASH_ENV' : '~/.bashrc' },
+        retain_job_scripts = True, drmaa_session=my_drmaa_session)
+
+    return 0
+
+@follows(runFastqc, mkdir(rmdup_path))
+@transform(alignFastqByBowtie, formatter(".bam"), os.path.join(rmdup_path, "{basename[0]}_rmdup.bam"), config)
+def rmdupBam(BamFileName, rmdupFile, config):
+    """
+    To remove duplicates
+    Arguments:
+    - `BamFileName`: bam file
+    - `config`: config
+    """
+    if config["pair_end"]=="no":
+        cmds = ['rmdup.bam.sh']
+    else:
+        cmds = ['rmdup_PE.bam.sh']
+    cmds.append(BamFileName)
+    cmds.append(rmdup_path)
+    #if "bam_sort_buff" in config:
+    #    cmds.append(config["bam_sort_buff"])
+    logfile = BamFileName + ".rmdup.log"
+
+    cores = 1
+
+    run_job(" ".join(cmds),
+        job_name = "rmdup_" + os.path.basename(BamFileName),
+        job_other_options = cluster_options(config, "rmdupBam", cores, logfile),
+        job_script_directory = os.path.dirname(os.path.realpath(__file__)),
+        job_environment={ 'BASH_ENV' : '~/.bashrc' },
+        retain_job_scripts = True, drmaa_session=my_drmaa_session)
+
+    return 0
+
+@follows(rmdupBam, mkdir(expandOsPath(os.path.join(rmdup_path, "tdf"))))
+@transform(rmdupBam, suffix(".bam"), ".bam.tdf.log", config)
+def genTDF(BamFileName, tdfLog, config):
+    """
+    To generate TDF files for IGV
+    Arguments:
+    - `BamFileName`: bam file
+    - `config`: config
+    """
+    cmds = ['igvtools']
+    cmds.append("count")
+    cmds.append(BamFileName)
+    TDFPath = expandOsPath(os.path.join(rmdup_path, "tdf"))
+    baseName = os.path.basename(BamFileName)
+    cmds.append(os.path.join(TDFPath, baseName.replace(".bam", ".tdf")))
+    cmds.append(config["IGV_genome"])
+    logfile = BamFileName + ".tdf.log"
+
+    cores = 1
+
+    run_job(" ".join(cmds),
+        job_name = "genTDF_" + os.path.basename(BamFileName),
+        job_other_options = cluster_options(config, "genTDF", cores, logfile),
+        job_script_directory = os.path.dirname(os.path.realpath(__file__)),
+        job_environment={ 'BASH_ENV' : '~/.bashrc' },
+        retain_job_scripts = True, drmaa_session=my_drmaa_session)
+
+    return 0
+
+@follows(genTDF)
+@transform(rmdupBam, suffix(".bam"), ".bam.phantomPeak.log", config)
+def runPhantomPeak(BamFileName, Log, config):
+    """
+    To check data with phantomPeak
+    Arguments:
+    - `BamFileName`: bam file
+    - `config`: config
+    """
+    cmds = ['runPhantomPeak.sh']
+    cmds.append(BamFileName)
+    cmds.append(str(config["cores"]))
+    logfile = BamFileName + ".phantomPeak.log"
+
+    cores = int(config['cores'])
+    if cores == 0:
+        cores = 1
+
+    run_job(" ".join(cmds),
+        job_name = "runPhantomPeak_" + os.path.basename(BamFileName),
+        job_other_options = cluster_options(config, "runPhantomPeak", cores, logfile),
+        job_script_directory = os.path.dirname(os.path.realpath(__file__)),
+        job_environment={ 'BASH_ENV' : '~/.bashrc' },
+        retain_job_scripts = True, drmaa_session=my_drmaa_session)
+
+    return 0
+
+if __name__ == '__main__':
+    ## run to step of PhantomPeak
+    ## multithread number need to be changed!
+    pipeline_run([runPhantomPeak], multithread=200)
+
+    my_drmaa_session.exit()
diff --git a/project/script-sge/results_parser.py b/project/script-sge/results_parser.py
new file mode 100644
index 0000000..0111507
--- /dev/null
+++ b/project/script-sge/results_parser.py
@@ -0,0 +1,221 @@
+#! /usr/bin/env python
+
+import sys
+import os
+import glob
+import re
+import yaml
+from collections import namedtuple
+
+def expandOsPath(path):
+    """
+    To expand the path with shell variables.
+    Arguments:
+    - `path`: path string
+    """
+    return os.path.expanduser(os.path.expandvars(path))
+
+def genFilesWithPattern(pathList, Pattern):
+    """
+    To generate files list on the fly.
+    Arguments:
+    - `pathList`: the path of the files
+    - `Pattern`: pattern like config["input_files"]
+    """
+    pathList.append(Pattern)
+    Files = glob.glob(expandOsPath(os.path.join(*pathList)))
+    return Files
+
+def parse_bowtie1_log(s):
+    total_pattern = re.compile(r"""\#\s+reads\s+processed:\s(?P<total_reads>.+)\s*""",  # total_reads
+                        re.VERBOSE)
+    unique_mapped_pattern = re.compile("""\#\s+reads\s+with\s+at\s+least\s+one\s+reported\s+alignment:\s+(?P<unique_mapped_reads>\S+)\s+\(\S+\)""", # unique_mapped_reads
+                        re.VERBOSE)
+    multiple_mapped_pattern = re.compile("""\#\s+reads\s+with\s+alignments\s+suppressed\s+due\s+to\s+-m:\s+(?P<multiple_mapped_reads>\d+)\s+\(\S+\)""",  #multiple_mapped_reads
+                        re.VERBOSE)
+    for line in s:
+        match = total_pattern.match(line)
+        if match:
+            total_reads = match.group("total_reads")
+        match = unique_mapped_pattern.match(line)
+        if match:
+            unique_mapped_reads = match.group("unique_mapped_reads")
+        match = multiple_mapped_pattern.match(line)
+        if match:
+            multiple_mapped_reads = match.group("multiple_mapped_reads")
+    res = namedtuple('res', ['total_reads', 'unique_mapped_reads', 'suppressed_multiple_mapped_reads'])
+    r = res(total_reads=total_reads, 
+        unique_mapped_reads=unique_mapped_reads, 
+        suppressed_multiple_mapped_reads=multiple_mapped_reads)
+    return r
+
+def parse_bowtie2_log(s):
+    total_pattern = re.compile(r"""(?P<total_reads>\d+)\s+reads;\s+of\s+these:""",  # total_reads
+                        re.VERBOSE)
+    unique_mapped_pattern = re.compile("""\s*(?P<unique_mapped_reads>\d+)\s+\(\S+\).+exactly\s+1\s+time""", # unique_mapped_reads
+                        re.VERBOSE)
+    multiple_mapped_pattern = re.compile("""\s+(?P<multiple_mapped_reads>\d+)\s+\(\S+\).+aligned\s+>1\s+times""", # unique_mapped_reads
+                        re.VERBOSE)
+    for line in s:
+        match = total_pattern.match(line)
+        if match:
+            total_reads = match.group("total_reads")
+        match = unique_mapped_pattern.match(line)
+        if match:
+            unique_mapped_reads = match.group("unique_mapped_reads")
+        match = multiple_mapped_pattern.match(line)
+        if match:
+            multiple_mapped_reads = match.group("multiple_mapped_reads")
+    res = namedtuple('res', ['total_reads', 'unique_mapped_reads', 'multiple_mapped_reads'])
+    r = res(total_reads=total_reads, 
+        unique_mapped_reads=unique_mapped_reads, 
+        multiple_mapped_reads=multiple_mapped_reads)
+    return r
+
+def parse_rmdup_log(s):
+    pattern = re.compile(r'\[bam_rmdupse_core\]\s+(?P<dup_reads>\d+)\s/\s\d+', re.VERBOSE)
+    for line in s:
+        match = pattern.match(line)
+        if match:
+            dup_reads = match.group("dup_reads")
+    res = namedtuple('res', ['dup_reads'])
+    r = res(dup_reads=dup_reads)
+    return r
+
+def parse_phantomPeak_log(s):
+    NSC_pattern = re.compile(r'.*\(NSC\)\s*(?P<NSC>\d*\.\d*).+', re.VERBOSE)
+    RSC_pattern = re.compile(r'.*\(RSC\)\s*(?P<RSC>\d*\.\d*).+', re.VERBOSE)
+    for line in s:
+        match = NSC_pattern.match(line)
+        if match:
+            NSC = match.group("NSC")
+        match = RSC_pattern.match(line)
+        if match:
+            RSC = match.group("RSC")
+    res = namedtuple('res', ['NSC', 'RSC'])
+    r = res(NSC=NSC, RSC=RSC)
+    return r
+
+def getSummaryFiles(input_type, config, search_paths):
+    """
+    Get all summary files under the folders.
+    input_type: file types.
+    config: config loaded from yaml.
+    """
+    input_type = "*" + input_type
+    files = genFilesWithPattern([config["project_dir"], config["data_dir"]], input_type)
+    for search_path in search_paths:
+        files.extend(genFilesWithPattern([config["project_dir"], config["data_dir"], search_path],
+            input_type))
+    return files
+
+def getFileId(file_basename):
+    """
+    Remove suffix of the summary file to get file id.
+    """
+    suffixes = ['.fastq.alignment.log', '.fq.alignment.log', '.gz.alignment.log', '.bam.rmdup.log', '_rmdup.bam.phantomPeak.log']
+    for suffix in suffixes:
+        file_basename = file_basename.replace(suffix, '')
+    return file_basename
+
+## Search subdirectories under data folder.
+search_paths = ["fastq", "rmdup"]
+
+## Used for final results.
+summary_dict = {}
+
+## Load the same config yaml file of the pipeline.
+config_name = sys.argv[1]
+config_f = open(config_name, "r")
+config = yaml.load(config_f)
+config_f.close()
+
+if config["aligner"] == "bowtie":
+    ## To be used in debug
+    # input_files = {".alignment.log":("total_reads", "unique_mapped_reads")}
+
+    ## Summary files used for summarizing.
+    input_files = {
+    ".alignment.log":("total_reads", "unique_mapped_reads", "suppressed_multiple_mapped_reads"), 
+    ".rmdup.log":("dup_reads"), 
+    ".phantomPeak.log":("NSC", "RSC")
+    }
+
+    ## Decide the parser here by a dict.
+    parser_dict = {
+    ".alignment.log": parse_bowtie1_log,
+    ".rmdup.log": parse_rmdup_log,
+    ".phantomPeak.log": parse_phantomPeak_log
+    }
+
+    ## Used to assign the output field in output file.
+    output_header = [
+        "sample", 
+        "total_reads", 
+        "unique_mapped_reads", 
+        "suppressed_multiple_mapped_reads", 
+        "dup_reads", 
+        "NSC", 
+        "RSC"]
+
+elif config["aligner"] == "bowtie2":
+    ## to be used in debug
+    # input_files = {".alignment.log":("total_reads", "unique_mapped_reads", "multiple_mapped_reads")}
+
+    ## Summary files used for summarizing.
+    input_files = {
+    ".alignment.log":("total_reads", "unique_mapped_reads", "multiple_mapped_reads"), 
+    ".rmdup.log":("dup_reads"), 
+    ".phantomPeak.log":("NSC", "RSC")
+    }
+
+    ## Decide the parser here by a dict.
+    parser_dict = {
+    ".alignment.log": parse_bowtie2_log,
+    ".rmdup.log": parse_rmdup_log,
+    ".phantomPeak.log": parse_phantomPeak_log
+    }
+
+    ## Used to assign the output field in output file.
+    output_header = [
+        "sample", 
+        "total_reads", 
+        "unique_mapped_reads", 
+        "multiple_mapped_reads", 
+        "dup_reads", 
+        "NSC", 
+        "RSC"]
+
+## Scan the files to summarize the pipeline.
+for input_type, summary_types in input_files.items():
+    summary_files = getSummaryFiles(input_type, config, search_paths)
+    if len(summary_files) != 0:
+        for summary_file in summary_files:
+            file_id = getFileId(os.path.basename(summary_file))
+            if file_id not in summary_dict:
+                summary_dict[file_id] = {'sample':file_id}
+            input_file = file(summary_file)
+            lines = input_file.readlines()
+            input_file.close()
+            ## Here the value of the dict is the parser function!
+            res = parser_dict[input_type](lines)
+            ## Unpack the results into dict.
+            for i in range(len(res._fields)):
+                if res._fields[i] not in output_header:
+                    output_header.append(res._fields[i])
+                summary_dict[file_id][res._fields[i]] = res[i]
+
+## Output to file, and the columns order is decided by output_header.
+output_file = file("summary_stats.txt", "w")
+header_line = "\t".join(output_header) + "\n"
+output_file.write(header_line)
+for sample in summary_dict.keys():
+    output_list = []
+    for stat in output_header:
+        if stat in summary_dict[sample]:
+            output_list.append(summary_dict[sample][stat])
+        else:
+            output_list.append("NA")
+    line = "\t".join(output_list) + "\n"
+    output_file.write(line)
+output_file.close()

From 2e1a6ed6e5ee4fc9fa3d40b09ebd49d4464fab8a Mon Sep 17 00:00:00 2001
From: Ning-Yi SHAO <shaoningyi@gmail.com>
Date: Fri, 5 Sep 2014 11:29:22 -0700
Subject: [PATCH 2/2] Update of README.

---
 README.md | 65 +++++++++++++++++++++++++------------------------------
 1 file changed, 30 insertions(+), 35 deletions(-)

diff --git a/README.md b/README.md
index a50dbaa..49622f6 100644
--- a/README.md
+++ b/README.md
@@ -1,16 +1,15 @@
-Pipeline for ChIP-seq preprocessing
-===================================
+# Pipeline for ChIP-seq preprocessing
 
 ### Overview
 
 Here is the pipeline I used for ChIP-seq preprocessing, including:
 
-- align the fastq data to reference genome by bowtie or bowtie2.
-- run FastQC to check the sequencing quality.
-- remove all reads duplications of the aligned data.
-- generate TDF files for browsing in IGV.
-- run PhantomPeak to check the quality of ChIP.
-- run ngs.plot to investigate the enrichment of ChIP-seq data at TSS, TES, and genebody.
+* align the fastq data to reference genome by bowtie or bowtie2.
+* run FastQC to check the sequencing quality.
+* remove all reads duplications of the aligned data.
+* generate TDF files for browsing in IGV.
+* run PhantomPeak to check the quality of ChIP.
+* run ngs.plot to investigate the enrichment of ChIP-seq data at TSS, TES, and genebody.
 
 The pipeline work flow is:
 
@@ -20,14 +19,14 @@ The pipeline work flow is:
 
 The softwares used in this pipeline are:
 
-- [ruffus](https://code.google.com/p/ruffus/)
-- [Bowtie](http://bowtie-bio.sourceforge.net/index.shtml)
-- [FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
-- [samtools](http://samtools.sourceforge.net/)
-- [IGVTools](http://www.broadinstitute.org/igv/igvtools)
-- [PhantomPeak](http://code.google.com/p/phantompeakqualtools/) **In fact, the script **run_spp_nodups.R** is from PhantomPeak, but PhantomPeak still need to be installed in R.**
-- [ngs.plot](https://code.google.com/p/ngsplot/)
-- If cluster supporting needed, [drmaa_for_python](https://pypi.python.org/pypi/drmaa) is needed. Now LSF and SGE are supported, but it is easy to modify it to fit your demands.
+* [ruffus](https://code.google.com/p/ruffus/)
+* [Bowtie](http://bowtie-bio.sourceforge.net/index.shtml)
+* [FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
+* [samtools](http://samtools.sourceforge.net/)
+* [IGVTools](http://www.broadinstitute.org/igv/igvtools)
+* [PhantomPeak](http://code.google.com/p/phantompeakqualtools/) __In fact, the script **run_spp_nodups.R** is from PhantomPeak, but PhantomPeak still need to be installed in R.__
+* [ngs.plot](https://code.google.com/p/ngsplot/)
+* If cluster supporting needed, [drmaa_for_python](https://pypi.python.org/pypi/drmaa) is needed. Now LSF and SGE are supported, but it is easy to modify it to fit your demands.
 
 Install above softwares and make sure they are in $PATH.
 
@@ -57,42 +56,38 @@ python results_parser.py config.yaml
 
 For the organization of projects, I generally follow this paper: [A Quick Guide to Organizing Computational Biology Projects](http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000424). Here because it is preprocessing, and real analysis will be peak calling, chromatin segmentation, and differential enrichment detection, so I just put the results of the preprocess in the data folder.
 
-For the configuration yaml file, **project_dir: `~/projects/test_ChIP-seq`** and **data_dir: "data"** mean the data folder is `~/projects/test_ChIP-seq/data`, and the results will be put in the same folder. Fastq files should be under `~/projects/test_ChIP-seq/data/fastq` folder. Now *.fastq, *.fq, *.gz (compressed fastq) files are acceptable. `aligner` now could be `bowtie` or `bowtie2`, if not assigned, then default aligner is `bowtie`. For `bowtie2`, the system variable `$BOWTIE2_INDEXES` should be set before running.
+For the configuration yaml file, __project_dir: `~/projects/test_ChIP-seq`__ and __data_dir: "data"__ mean the data folder is `~/projects/test_ChIP-seq/data`, and the results will be put in the same folder. Fastq files should be under `~/projects/test_ChIP-seq/data/fastq` folder. Now *.fastq, *.fq, *.gz (compressed fastq) files are acceptable. `aligner` now could be `bowtie` or `bowtie2`, if not assigned, then default aligner is `bowtie`. For `bowtie2`, the system variable `$BOWTIE2_INDEXES` should be set before running.
 
 The position of pipeline.py, results_parser.py, and config.yaml doesn't matter at all. But I prefer to put them under project/script/preprocess folder.
 
 **Important:**
 
-- To make ngs.plot part work, please name the fastq files in this way:
-
++ To make ngs.plot part work, please name the fastq files in this way:
 ```
-Say condition A, B, each with 2 replicates, and one DNA input per condition.
-Name the files as A_rep1.fastq, A_rep2.fastq, A_input.fastq, B_rep1.fastq,
+Say condition A, B, each with 2 replicates, and one DNA input per condition. 
+Name the files as A_rep1.fastq, A_rep2.fastq, A_input.fastq, B_rep1.fastq, 
 B_rep2.fastq, and B_input.fastq.The key point is to make the same condition
  samples with common letters and input samples contain "input" or "Input"
  strings.
 ```
-
-- If use want to only run to some specific step, just modify the function name in `pipeline_run` in pipeline.py.
-- If the data are pair-end, follow this step:
-	- Modify the `config.yaml` file, change "pair_end" to "yes".
-	- Modify the `config.yaml` file, change "input_files" to "*R1*.fastq.gz" or "*R1*.fastq".
-	- Make sure the fastq files named as "*R1*" and "*R2*" pattern.
-- if you want to use cluster:
-	- Edit '~/.bash_profile' to make sure all paths in $PATH.
-	- Modify `config.yaml` to fit your demands.
-	- `multithread` in `pipeline.py` determines the number of concurrent jobs to be submitted to cluster nodes by ruffus. A default value of 10 is used.
++ If use want to only run to some specific step, just modify the function name in `pipeline_run` in pipeline.py.
++ If the data are pair-end, follow this step:
+	+ Modify the `config.yaml` file, change "pair_end" to "yes".
+	+ Modify the `config.yaml` file, change "input_files" to "\*R1\*.fastq.gz" or "\*R1\*.fastq".
+	+ Make sure the fastq files named as "\*R1\*" and "\*R2\*" pattern.
++ if you want to use cluster:
+	+ Edit '~/.bash_profile' to make sure all paths in $PATH.
+	+ Modify `config.yaml` to fit your demands.
+	+ `multithread` in `pipeline.py` determines the number of concurrent jobs to be submitted to cluster nodes by ruffus. A default value of 10 is used.
 
 **Warning:**
 
-`Bowtie2` allows multiple hits reads, and breaks the assumption of `phantomPeak`:
-
+`Bowtie2` allows multiple hits reads, and breaks the assumption of `phantomPeak`: 
 ```
 It is EXTREMELY important to filter out multi-mapping reads from the BAM/tagAlign
  files. Large number of multimapping reads can severly affect the phantom peak
  coefficient and peak calling results.
 ```
-
 So be careful to interpret `NSC` and `RSC` in `Bowtie2` alignment results.
 
 ### Notes
@@ -103,4 +98,4 @@ In Bowtie2, default parameters are used.
 
 ### ToDos
 
-- Method to skip some steps if the user doesn't run.
++ Method to skip some steps if the user doesn't run.