Analysis pipeline for ChIP-seq

Author：连明

12/24/2017 4:43 :25 PM

1. 参考基因组的准备 ^目录

参考基因组可以从以下两个途径获取：

UCSC: Download -> Mouse genome -> Full datasets -> chromFa.tar.gz

得到的是一个由多个染色体fasta文件组成的压缩文件，解压后需要合并成一个fasta文件，命令：

$ tar zxvf chromFa.tar.gz && cat *.fa >mm10.fa

ENSEMBL:Download -> Download data via FTP -> FTP site -> ../Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz

2. 建立参考基因组索引 ^目录

$ bowtie2-build ~/Ref/mm10/mm10.fa ~/Ref/mm10/mm10 1>Ref/mm10/mm10.bwt_index.log 2>&1

3. ChIP-seq数据获取 ^目录

可以从SRA上下载，也可以从ENA上下载。

SRA

从SRA上下载的数据是用特定的压缩方法得到的压缩文件格式sra，下载后需要进行格式转换

数据下载

下载数据先要根据数据的GEO id到NCBI上获取该数据所对应的SRP id(sra project id)，然后根据SRP id到FTP上下载，例如如果已知数据的GEO id为GSE42466,从NCBI的GEO数据库上找到它的SRP id为SRP017311，则FTP地址为

ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/ SRP017 / SRP017311 / SRR620$i / SRR620$i.sra

可以写一个循环来下载该数据集：

for ((i=204;i<=209;i++));
do
wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP017/SRP017311/SRR620$i/SRR620$i.sra;
done

格式转换

SRA格式有6~9倍的压缩了，比zip格式压缩的2~3倍高多了。将SRA格式转换为fastq格式，这里需要用到NCBI开发的sratoolkit中的fastq-dump命令

$ fastq-dump --split-3 -O ChIP_seq/ SRR***.sra

--split-3参数可以将PE的sra文件解压后的fastq文件拆分成*_1.fastq和*_2.fastq，由于本示例数据集是SE测序，不会进行拆分

ENA

从GEO数据可搜索中获得其所对应的SRP id，在EBI中直接搜索该id，下载样本信息的文本文件，然后根据该文件进行批量下载

$ tail -n +2 PRJNA182214.txt| cut -f16 | xargs wget -c -P Rawdata

4. 质控 ^目录

用fastqc查看质量

$ ls *.fastq | xargs fastqc -t 10 -o ChIP_seq/

用cutadapt进行QC，如果还没有安装cutadapt，可以通过以下方式安装：

$ conda install -c bioconda cutadapt

# 1. Single end
$ cutadapt -a ADAPTER -q 20,20 -m 20 -o outfile_QC.fastq infile.fastq
# 2. Paired end
$ cutadapt -j 8 -a ADAPTER -A ADAPTER -q 20,20 -m 20 -o outfile_QC_1.fastq -p outfile_QC_2.fastq infile_1.fastq infile_2.fastq

-j CORES Number of CPU cores to use. Default: 1

-a ADAPTER 3' adapter to be removed

-A ADAPTER 3' adapter to be removed from second read in a pair.

-q [5'CUTOFF,]3'CUTOFF

-m LENGTH Discard reads shorter than LENGTH. Default: 0

可以写一个循环来处理：

ls *.fastq | while read i;
do 
i=`basename $i .fastq`;
cutadapt -q 20,20 -m 20 -o ChIP_seq/${i}_QC.fastq ChIP_seq/${i}.fastq;
done

5. 比对参考基因组 ^目录

ls ChIP_seq/Rawdata/*QC.fastq | while read i;
do
i=`basename $i _QC.fastq`；
echo $i;
bowtie2 -p 10 --local -x Ref/mm10/mm10 -U ChIP_seq/Rawdata/${i}_QC.fastq | samtools sort -@ 8 -O bam -o ChIP_seq/Map/${i}.sorted.bam;
done 1>ChIP_seq/Map/map.log 2>&1

比对的统计信息保存在map.log中

bowtie2 参数

-p threads

--local local alignment

-x reference genome index

-U Files with unpaired reads

samtools 参数

-@ threads

-O Specify output format (SAM, BAM, CRAM)

-o Write final output to FILE rather than standard output

6. Peak calling ^目录

目前可用的peak calling工具很多，详见：http://wodaklab.org/nextgen/data/peakfinders.html

这一步我们使用MACS2，这是一个用python2.7写的工具，所以当你同时在使用python3.6和python2.7时，使用前请务必激活python2.7（将 python2.7/anaconda2 的安装目录添加到环境变量中），安装方法为：

下载安装anaconda2

$ wget -c -P basic_tool/ https://repo.continuum.io/archive/Anaconda2-5.0.1-Linux-x86_64.sh
$ sh Anaconda2-5.0.1-Linux-x86_64.sh
$ echo 'export PATH=../anaconda2/bin:$PATH' >>~/.bashrc

下载安装MACS2

# 1. 用源码安装
$ wget -c -P biosoft/ https://pypi.python.org/packages/9f/99/a8ac96b357f6b0a6f559fe0f5a81bcae12b98579551620ce07c5183aee2c/MACS2-2.1.1.20160309.tar.gz
$ cd biosoft  && tar zxvf MACS2-2.1.1.20160309.tar.gz
$ cd MACS2-2.1.1.20160309 && python setup.py install
$ echo 'export PATH=../MACS2-2.1.1.20160309/bin:$PATH' >>~/.bashrc

# 2. 用bioconda安装
$ conda install -c bioconda macs2

安装成功后就可以直接使用MACS2进行peak calling了，命令：

# MACS首先的工作是要确定一个模型，这个模型最关键的参数就是峰宽d，这个d就是bw(band width)，而它的一半就是shiftsize，具体可以参阅后文提到的原理部分
$ macs2 callpeak -c controlFile.bam -t treatmentFile.bam -m 10 30 -p pvalue -f BAM -g gsize -B -n filename.preffix --outdir ChIP_seq/CallPeak 2>ChIP_seq/CallPeak/filename.macs2.log

-c Control file

-t Treatment file

-m Select the regions within MFOLD range of high-confidence enrichment ratio against background to build model.

-g Effective genome size. shortcuts:'hs' for human (2.7e9), 'mm' for mouse(1.87e9), 'ce' for C. elegans (9e7) and 'dm' for fruitfly (1.2e8).

-p P value cutoff

-f File format

-B Output a file in BEDGRAPH format to visualize the peak profiles in a genome browser. There will be one file for the treatment, and one for the control.

-n Experiment name, which will be used to generate output file names.

MACS2 参数的探究可以参考： https://github.com/crazyhottommy/ChIP-seq-analysis/blob/master/part1_peak_calling.md

如果要进行循环处理，可以先准备一个如下格式的文本文件：

Treatment SRA id	Treatment SRA name	Control SRA id	Control SRA name

命令：

nohup cat ChIP_seq/CallPeak/ChIP_seq.pairs | while read i;
do
treat_id=`echo $i|perl -ane 'chomp;print $F[0]'`;
treat_name=`echo $i|perl -ane 'chomp;print $F[1]'`;
control_id=`echo $i|perl -ane 'chomp;print $F[2]'`;
control_name=`echo $i|perl -ane 'chomp;print $F[3]'`;
macs2 callpeak -c ChIP_seq/Map/${control_id}.sorted.bam -t ChIP_seq/Map/${treat_id}.sorted.bam -m 10 30 -p 1e-5 -f BAM -g mm -B -n $treat_name --outdir ChIP_seq/CallPeak/ 2>ChIP_seq/CallPeak/${treat_name}.macs2.log;
done &

Peak Calling 原理探究

具体统计学原理可以看这篇博客文章：https://www.plob.org/article/7227.html 具体peak calling原理可以看这篇文章：https://www.plob.org/article/3760.html 以下两张图很好的描述了peaks calling的过程：

(1) Building a signal profile

(2) Peak calling

也许看看在peaks calling分析早期，别人是怎么做的，对它的原理的理解会有启发：八年前的ChIP-seq怎么找peak

7. 可视化 ^目录

Viewing the peaks in IGV

在peak call步骤中，当给macs2 callpeak添加参数-B时会输出两个个bedgraph文件，其中保存着peak profile,分别为control和treat的peak profile，可以在genome browser上可视化peaks，可选择的genome browser：

Interactive Genome Viewer (IGV) 本地安装运行，避免了数据传输

UCSC genome browser 在线工具，必须有可用的参考基因组

Vizualisation with deeptools

将bam文件转换成bw（bigWig)文件，需要用到deeptools，这是一个用python编写的工具:

ls ChIP_seq/Map/*.bam | while read i;
do
sample=`basename $i .bam`;
samtools rmdup -s ChIP_seq/Map/${sample}.bam ChIP_seq/Map/${sample}.nodup.bam; # Remove the duplicated reads
samtools index -@ 10 ChIP_seq/Map/${sample}.nodup.bam ChIP_seq/Map/${sample}.nodup.bai; # Index the BAM file
bamCoverage -b ChIP_seq/Map/${sample}.nodup.bam -binSize 10 -o ChIP_seq/Map/${sample}.bw;
done

bamCoverage 参数

-b BAM file

--outFileFormat Output file type. Either "bigwig" or "bedgraph" (default: bigwig)

-binSize Bin size

-o Output file name

得到的bw文件可用用IGV进行可视化

因为peaks在基因组的分布是有规律的，如果是集中在TSS附近，就可以画TSS附近的信号强度图，一些人为处理可以改变peaks的分布，同理信号强度也会改变，这个是大家的注意分析结果以及生物学一样。

可以对每个sample分别画基因的TSS附近的profile和heatmap图，也可以整合所有的chipseq的bam文件，画基因的TSS附近的profile和heatmap图。首先要下载mm10基因组refseq注释数据，可以从ucsc的table browser上下载

$ computeMatrix reference-point -p 10 --referencePoint TSS -b 2000 -a 2000 -S ../*bw -R ~/annotation/CHIPseq/mm10/ucsc.refseq.bed --skipZeros -o tmp4.mat.gz
$ plotHeatmap -m tmp4.mat.gz -out tmp4.merge.png
$ plotProfile --dpi 720 -m tmp4.mat.gz -out tmp4.profile.pdf --plotFileFormat pdf --perGroup
$ plotHeatmap --dpi 720 -m tmp4.mat.gz -out tmp4.merge.pdf --plotFileFormat pdf

computeMatrix reference-point 参数

--referencePoint {TSS,TES,center}

-b Distance upstream of the reference-point selected

-a Distance downstream of the reference-point selected

-R Reference annotation file

-S bigWig file(s) containing the scores to be plotted

--skipZeros Whether regions with only scores of zero should be included or not

-out File name to save the gzipped matrix file needed by the "plotHeatmap" and "plotProfile" tools

--outFileSortedRegions BED file File name in which the regions are saved after skiping zeros or min/max threshold values

plotHeatmap 参数

-m Matrix file from the computeMatrix tool

-out File name to save the image to

plotProfile 参数

--dpi Set the DPI to save the figure. (default: 200)

-m Matrix file from the computeMatrix tool

-out File name to save the image to.The file ending will be used to determine the image format. The available options are: "png", "eps", "pdf" and "svg", e.g., MyHeatmap.png.

--plotFileFormat

8.Peaks注释 ^目录

经过前面的ChIP-seq测序数据处理的常规分析，我们已经成功的把测序仪下机数据变成了BED格式的peaks记录文件。所谓的peaks注释，就是想看看该peaks在基因组的哪一个区段，看看它们在各种基因组区域(基因上下游，5',3'端UTR，启动子，内含子，外显子，基因间区域，microRNA区域)分布情况，但是一般的peaks都有近万个，所以需要批量注释，如果脚本学的好，自己下载参考基因组的GFF注释文件，完全可以自己写一个。

用基因组注释文件GFF/GTF注释peaks

下载参考基因组的GFF注释文件，下载地址：ftp://ftp.ensembl.org/pub/release-91/gff3/mus_musculus/Mus_musculus.GRCm38.91.gff3.gz

使用BEDTools’ intersectBed注释peaks

$ bedtools intersect -wa -wb -a peaks.bed -b mm10.gff3 >ChIP_seq/CallPeak/peaks.anno.bed

-wa Write the original entry in A for each overlap

-wb Write the original entry in B for each overlap. Useful for knowing what A overlaps

-a BAM/BED/GFF/VCF file “A”

-b One or more BAM/BED/GFF/VCF file(s) “B”

注意：peaks.bed 和 mm10.gff3 的第一列的染色体写法是否一致，是[1-22]|[XY]还是chr[1-22]|[XY] ？如果不一致需要先统一：

# 将 chr[1-22]|[XY] 改成 [1-22]|[XY]
$ perl -ane 'chomp;$chr=substr($F[0],3,2);print "$F[0]\t$F[1]\t$F[2]\t$F[3]\t$F[4]\n" peaks.bed >peaks.convert.bed

查看peaks注释结果

$ cut -f8 peaks.anno.bed | sort | uniq -c

# 统计结果如下

#  30459 biological_region
#  2344 CDS
#    10 C_gene_segment
# 70534 chromosome
#  6181 exon
#   291 five_prime_UTR
# 27836 gene
#     2 gene_segment
# 22244 lnc_RNA
#     5 miRNA
# 76036 mRNA
#     5 ncRNA
#  4212 ncRNA_gene
#   491 pseudogene
#   401 pseudogenic_transcript
#     3 rRNA
#     2 snoRNA
#     3 snRNA
#  1546 three_prime_UTR
#   192 transcript
#     6 V_gene_segment

用R包ChIPpeakAnno注释peaks

这里我们使用一个bioconductor包ChIPpeakAnno来做CHIP-seq的peaks注释，下面的包自带的示例：

#这个包使用起来非常简单，只需要把我们做好的peaks文件(GSM1278641XuMUTrep1BAF155_MUT.peaks.bed等等)
#用toGRanges或者import读进去，成一个GRanges对象即可

# 比较两个peaks文件的overlap
library(ChIPpeakAnno)
bed <- system.file("extdata", "MACS_output.bed", package="ChIPpeakAnno")
gr1 <- toGRanges(bed, format="BED", header=FALSE)
## one can also try import from rtracklayer
library(rtracklayer)
gr1.import <- import(bed, format="BED")
identical(start(gr1), start(gr1.import))
gr1[1:2]
gr1.import[1:2] #note the name slot is different from gr1
gff <- system.file("extdata", "GFF_peaks.gff", package="ChIPpeakAnno")
gr2 <- toGRanges(gff, format="GFF", header=FALSE, skip=3)
ol <- findOverlapsOfPeaks(gr1, gr2)
makeVennDiagram(ol)

# peaks注释
data(TSS.human.GRCh37) ## 主要是借助于这个GRanges对象来做注释，也可以用getAnnotation来获取其它GRanges对象来做注释
## featureType ： TSS, miRNA, Exon, 5'UTR, 3'UTR, transcript or Exon plus UTR
peaks=MUT_rep1_peaks
macs.anno <- annotatePeakInBatch(peaks, AnnotationData=TSS.human.GRCh37,
	output="overlapping", maxgap=5000L)

可以查看peaks都出现在基因结构的哪些位置上：

require(TxDb.Hsapiens.UCSC.hg19.knownGene)
aCR<-assignChromosomeRegion(peaks, nucleotideLevel=FALSE,
	precedence=c("Promoters", "immediateDownstream",
	"fiveUTRs", "threeUTRs",
	"Exons", "Introns"),
	TxDb=TxDb.Hsapiens.UCSC.hg19.knownGene)
barplot(aCR$percentage)

得到的结果类似下图

peaks注释也可以选择网页版工具：

ChIPseek

GREAT

9. Motif 分析 ^目录

目前知名的motif搜寻工具可以参阅文献：https://biologydirect.biomedcentral.com/articles/10.1186/1745-6150-9-4

进行Motif分析，首先需要获取peaks区域所对应的序列，可以用bedtools进行序列提取

$ bedtools getfasta -fi input.fasta -bed input.bed -fo output.fasta

然后可以用RSAT（Regulatory Sequence Analysis Tools）来搜索motif: NGS ChIP-seq -> peak-motifs，设置合适的参数，然后在线运行

参考资料：

(1) Hands-on introduction to ChIP-seq analysis - VIB Training

(2) ChIP-seq实战分析

(3) 一篇文章学会ChIP-seq分析（上）

(4) 一篇文章学会ChIP-seq分析（下）

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ChIP-seq-pipeline.md

ChIP-seq-pipeline.md

目录

Analysis pipeline for ChIP-seq

1. 参考基因组的准备 ^目录

2. 建立参考基因组索引 ^目录

3. ChIP-seq数据获取 ^目录

SRA

数据下载

格式转换

ENA

4. 质控 ^目录

5. 比对参考基因组 ^目录

6. Peak calling ^目录

Peak Calling 原理探究

7. 可视化 ^目录

Viewing the peaks in IGV

Vizualisation with deeptools

8.Peaks注释 ^目录

用基因组注释文件GFF/GTF注释peaks

用R包ChIPpeakAnno注释peaks

9. Motif 分析 ^目录

Files

ChIP-seq-pipeline.md

Latest commit

History

ChIP-seq-pipeline.md

File metadata and controls

目录

Analysis pipeline for ChIP-seq

1. 参考基因组的准备 目录

2. 建立参考基因组索引 目录

3. ChIP-seq数据获取 目录

SRA

数据下载

格式转换

ENA

4. 质控 目录

5. 比对参考基因组 目录

6. Peak calling 目录

Peak Calling 原理探究

7. 可视化 目录

Viewing the peaks in IGV

Vizualisation with deeptools

8.Peaks注释 目录

用基因组注释文件GFF/GTF注释peaks

用R包ChIPpeakAnno注释peaks

9. Motif 分析 目录

1. 参考基因组的准备 ^目录

2. 建立参考基因组索引 ^目录

3. ChIP-seq数据获取 ^目录

4. 质控 ^目录

5. 比对参考基因组 ^目录

6. Peak calling ^目录

7. 可视化 ^目录

8.Peaks注释 ^目录

9. Motif 分析 ^目录