From 2e1a6ed6e5ee4fc9fa3d40b09ebd49d4464fab8a Mon Sep 17 00:00:00 2001
From: Ning-Yi SHAO <shaoningyi@gmail.com>
Date: Fri, 5 Sep 2014 11:29:22 -0700
Subject: [PATCH] Update of README.

---
 README.md | 65 +++++++++++++++++++++++++------------------------------
 1 file changed, 30 insertions(+), 35 deletions(-)

diff --git a/README.md b/README.md
index a50dbaa..49622f6 100644
--- a/README.md
+++ b/README.md
@@ -1,16 +1,15 @@
-Pipeline for ChIP-seq preprocessing
-===================================
+# Pipeline for ChIP-seq preprocessing
 
 ### Overview
 
 Here is the pipeline I used for ChIP-seq preprocessing, including:
 
-- align the fastq data to reference genome by bowtie or bowtie2.
-- run FastQC to check the sequencing quality.
-- remove all reads duplications of the aligned data.
-- generate TDF files for browsing in IGV.
-- run PhantomPeak to check the quality of ChIP.
-- run ngs.plot to investigate the enrichment of ChIP-seq data at TSS, TES, and genebody.
+* align the fastq data to reference genome by bowtie or bowtie2.
+* run FastQC to check the sequencing quality.
+* remove all reads duplications of the aligned data.
+* generate TDF files for browsing in IGV.
+* run PhantomPeak to check the quality of ChIP.
+* run ngs.plot to investigate the enrichment of ChIP-seq data at TSS, TES, and genebody.
 
 The pipeline work flow is:
 
@@ -20,14 +19,14 @@ The pipeline work flow is:
 
 The softwares used in this pipeline are:
 
-- [ruffus](https://code.google.com/p/ruffus/)
-- [Bowtie](http://bowtie-bio.sourceforge.net/index.shtml)
-- [FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
-- [samtools](http://samtools.sourceforge.net/)
-- [IGVTools](http://www.broadinstitute.org/igv/igvtools)
-- [PhantomPeak](http://code.google.com/p/phantompeakqualtools/) **In fact, the script **run_spp_nodups.R** is from PhantomPeak, but PhantomPeak still need to be installed in R.**
-- [ngs.plot](https://code.google.com/p/ngsplot/)
-- If cluster supporting needed, [drmaa_for_python](https://pypi.python.org/pypi/drmaa) is needed. Now LSF and SGE are supported, but it is easy to modify it to fit your demands.
+* [ruffus](https://code.google.com/p/ruffus/)
+* [Bowtie](http://bowtie-bio.sourceforge.net/index.shtml)
+* [FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
+* [samtools](http://samtools.sourceforge.net/)
+* [IGVTools](http://www.broadinstitute.org/igv/igvtools)
+* [PhantomPeak](http://code.google.com/p/phantompeakqualtools/) __In fact, the script **run_spp_nodups.R** is from PhantomPeak, but PhantomPeak still need to be installed in R.__
+* [ngs.plot](https://code.google.com/p/ngsplot/)
+* If cluster supporting needed, [drmaa_for_python](https://pypi.python.org/pypi/drmaa) is needed. Now LSF and SGE are supported, but it is easy to modify it to fit your demands.
 
 Install above softwares and make sure they are in $PATH.
 
@@ -57,42 +56,38 @@ python results_parser.py config.yaml
 
 For the organization of projects, I generally follow this paper: [A Quick Guide to Organizing Computational Biology Projects](http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000424). Here because it is preprocessing, and real analysis will be peak calling, chromatin segmentation, and differential enrichment detection, so I just put the results of the preprocess in the data folder.
 
-For the configuration yaml file, **project_dir: `~/projects/test_ChIP-seq`** and **data_dir: "data"** mean the data folder is `~/projects/test_ChIP-seq/data`, and the results will be put in the same folder. Fastq files should be under `~/projects/test_ChIP-seq/data/fastq` folder. Now *.fastq, *.fq, *.gz (compressed fastq) files are acceptable. `aligner` now could be `bowtie` or `bowtie2`, if not assigned, then default aligner is `bowtie`. For `bowtie2`, the system variable `$BOWTIE2_INDEXES` should be set before running.
+For the configuration yaml file, __project_dir: `~/projects/test_ChIP-seq`__ and __data_dir: "data"__ mean the data folder is `~/projects/test_ChIP-seq/data`, and the results will be put in the same folder. Fastq files should be under `~/projects/test_ChIP-seq/data/fastq` folder. Now *.fastq, *.fq, *.gz (compressed fastq) files are acceptable. `aligner` now could be `bowtie` or `bowtie2`, if not assigned, then default aligner is `bowtie`. For `bowtie2`, the system variable `$BOWTIE2_INDEXES` should be set before running.
 
 The position of pipeline.py, results_parser.py, and config.yaml doesn't matter at all. But I prefer to put them under project/script/preprocess folder.
 
 **Important:**
 
-- To make ngs.plot part work, please name the fastq files in this way:
-
++ To make ngs.plot part work, please name the fastq files in this way:
 ```
-Say condition A, B, each with 2 replicates, and one DNA input per condition.
-Name the files as A_rep1.fastq, A_rep2.fastq, A_input.fastq, B_rep1.fastq,
+Say condition A, B, each with 2 replicates, and one DNA input per condition. 
+Name the files as A_rep1.fastq, A_rep2.fastq, A_input.fastq, B_rep1.fastq, 
 B_rep2.fastq, and B_input.fastq.The key point is to make the same condition
  samples with common letters and input samples contain "input" or "Input"
  strings.
 ```
-
-- If use want to only run to some specific step, just modify the function name in `pipeline_run` in pipeline.py.
-- If the data are pair-end, follow this step:
-	- Modify the `config.yaml` file, change "pair_end" to "yes".
-	- Modify the `config.yaml` file, change "input_files" to "*R1*.fastq.gz" or "*R1*.fastq".
-	- Make sure the fastq files named as "*R1*" and "*R2*" pattern.
-- if you want to use cluster:
-	- Edit '~/.bash_profile' to make sure all paths in $PATH.
-	- Modify `config.yaml` to fit your demands.
-	- `multithread` in `pipeline.py` determines the number of concurrent jobs to be submitted to cluster nodes by ruffus. A default value of 10 is used.
++ If use want to only run to some specific step, just modify the function name in `pipeline_run` in pipeline.py.
++ If the data are pair-end, follow this step:
+	+ Modify the `config.yaml` file, change "pair_end" to "yes".
+	+ Modify the `config.yaml` file, change "input_files" to "\*R1\*.fastq.gz" or "\*R1\*.fastq".
+	+ Make sure the fastq files named as "\*R1\*" and "\*R2\*" pattern.
++ if you want to use cluster:
+	+ Edit '~/.bash_profile' to make sure all paths in $PATH.
+	+ Modify `config.yaml` to fit your demands.
+	+ `multithread` in `pipeline.py` determines the number of concurrent jobs to be submitted to cluster nodes by ruffus. A default value of 10 is used.
 
 **Warning:**
 
-`Bowtie2` allows multiple hits reads, and breaks the assumption of `phantomPeak`:
-
+`Bowtie2` allows multiple hits reads, and breaks the assumption of `phantomPeak`: 
 ```
 It is EXTREMELY important to filter out multi-mapping reads from the BAM/tagAlign
  files. Large number of multimapping reads can severly affect the phantom peak
  coefficient and peak calling results.
 ```
-
 So be careful to interpret `NSC` and `RSC` in `Bowtie2` alignment results.
 
 ### Notes
@@ -103,4 +98,4 @@ In Bowtie2, default parameters are used.
 
 ### ToDos
 
-- Method to skip some steps if the user doesn't run.
++ Method to skip some steps if the user doesn't run.