Skip to content

Latest commit

 

History

History
433 lines (310 loc) · 18.1 KB

ChIP-seq-pipeline.md

File metadata and controls

433 lines (310 loc) · 18.1 KB

目录

Analysis pipeline for ChIP-seq

Author:连明

12/24/2017 4:43 :25 PM

ChIP

1. 参考基因组的准备 目录

参考基因组可以从以下两个途径获取:

  1. UCSC: Download -> Mouse genome -> Full datasets -> chromFa.tar.gz

得到的是一个由多个染色体fasta文件组成的压缩文件,解压后需要合并成一个fasta文件,命令:

$ tar zxvf chromFa.tar.gz && cat *.fa >mm10.fa
  1. ENSEMBL:Download -> Download data via FTP -> FTP site -> ../Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz

2. 建立参考基因组索引 目录

$ bowtie2-build ~/Ref/mm10/mm10.fa ~/Ref/mm10/mm10 1>Ref/mm10/mm10.bwt_index.log 2>&1

3. ChIP-seq数据获取 目录

可以从SRA上下载,也可以从ENA上下载。

SRA

从SRA上下载的数据是用特定的压缩方法得到的压缩文件格式sra,下载后需要进行格式转换

数据下载

下载数据先要根据数据的GEO id到NCBI上获取该数据所对应的SRP id(sra project id),然后根据SRP id到FTP上下载,例如如果已知数据的GEO id为GSE42466,从NCBI的GEO数据库上找到它的SRP id为SRP017311,则FTP地址为

ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/ SRP017 / SRP017311 / SRR620$i / SRR620$i.sra

可以写一个循环来下载该数据集:

for ((i=204;i<=209;i++));
do
wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP017/SRP017311/SRR620$i/SRR620$i.sra;
done

格式转换

SRA格式有6~9倍的压缩了,比zip格式压缩的2~3倍高多了。将SRA格式转换为fastq格式,这里需要用到NCBI开发的sratoolkit中的fastq-dump命令

$ fastq-dump --split-3 -O ChIP_seq/ SRR***.sra 

--split-3参数可以将PE的sra文件解压后的fastq文件拆分成*_1.fastq和*_2.fastq,由于本示例数据集是SE测序,不会进行拆分

ENA

从GEO数据可搜索中获得其所对应的SRP id,在EBI中直接搜索该id,下载样本信息的文本文件,然后根据该文件进行批量下载

$ tail -n +2 PRJNA182214.txt| cut -f16 | xargs wget -c -P Rawdata 

4. 质控 目录

fastqc查看质量

$ ls *.fastq | xargs fastqc -t 10 -o ChIP_seq/ 

cutadapt进行QC,如果还没有安装cutadapt,可以通过以下方式安装:

$ conda install -c bioconda cutadapt
# 1. Single end
$ cutadapt -a ADAPTER -q 20,20 -m 20 -o outfile_QC.fastq infile.fastq
# 2. Paired end
$ cutadapt -j 8 -a ADAPTER -A ADAPTER -q 20,20 -m 20 -o outfile_QC_1.fastq -p outfile_QC_2.fastq infile_1.fastq infile_2.fastq
  • -j CORES Number of CPU cores to use. Default: 1
  • -a ADAPTER 3' adapter to be removed
  • -A ADAPTER 3' adapter to be removed from second read in a pair.
  • -q [5'CUTOFF,]3'CUTOFF
  • -m LENGTH Discard reads shorter than LENGTH. Default: 0

可以写一个循环来处理:

ls *.fastq | while read i;
do 
i=`basename $i .fastq`;
cutadapt -q 20,20 -m 20 -o ChIP_seq/${i}_QC.fastq ChIP_seq/${i}.fastq;
done

5. 比对参考基因组 目录

ls ChIP_seq/Rawdata/*QC.fastq | while read i;
do
i=`basename $i _QC.fastq`;
echo $i;
bowtie2 -p 10 --local -x Ref/mm10/mm10 -U ChIP_seq/Rawdata/${i}_QC.fastq | samtools sort -@ 8 -O bam -o ChIP_seq/Map/${i}.sorted.bam;
done 1>ChIP_seq/Map/map.log 2>&1

比对的统计信息保存在map.log

bowtie2 参数

  • -p threads
  • --local local alignment
  • -x reference genome index
  • -U Files with unpaired reads

samtools 参数

  • -@ threads
  • -O Specify output format (SAM, BAM, CRAM)
  • -o Write final output to FILE rather than standard output

6. Peak calling 目录

目前可用的peak calling工具很多,详见:http://wodaklab.org/nextgen/data/peakfinders.html

这一步我们使用MACS2,这是一个用python2.7写的工具,所以当你同时在使用python3.6和python2.7时,使用前请务必激活python2.7(将 python2.7/anaconda2 的安装目录添加到环境变量中),安装方法为:

下载安装anaconda2

$ wget -c -P basic_tool/ https://repo.continuum.io/archive/Anaconda2-5.0.1-Linux-x86_64.sh
$ sh Anaconda2-5.0.1-Linux-x86_64.sh
$ echo 'export PATH=../anaconda2/bin:$PATH' >>~/.bashrc

下载安装MACS2

# 1. 用源码安装
$ wget -c -P biosoft/ https://pypi.python.org/packages/9f/99/a8ac96b357f6b0a6f559fe0f5a81bcae12b98579551620ce07c5183aee2c/MACS2-2.1.1.20160309.tar.gz
$ cd biosoft  && tar zxvf MACS2-2.1.1.20160309.tar.gz
$ cd MACS2-2.1.1.20160309 && python setup.py install
$ echo 'export PATH=../MACS2-2.1.1.20160309/bin:$PATH' >>~/.bashrc

# 2. 用bioconda安装
$ conda install -c bioconda macs2

安装成功后就可以直接使用MACS2进行peak calling了,命令:

# MACS首先的工作是要确定一个模型,这个模型最关键的参数就是峰宽d,这个d就是bw(band width),而它的一半就是shiftsize,具体可以参阅后文提到的原理部分
$ macs2 callpeak -c controlFile.bam -t treatmentFile.bam -m 10 30 -p pvalue -f BAM -g gsize -B -n filename.preffix --outdir ChIP_seq/CallPeak 2>ChIP_seq/CallPeak/filename.macs2.log
  • -c Control file
  • -t Treatment file
  • -m Select the regions within MFOLD range of high-confidence enrichment ratio against background to build model.
  • -g Effective genome size. shortcuts:'hs' for human (2.7e9), 'mm' for mouse(1.87e9), 'ce' for C. elegans (9e7) and 'dm' for fruitfly (1.2e8).
  • -p P value cutoff
  • -f File format
  • -B Output a file in BEDGRAPH format to visualize the peak profiles in a genome browser. There will be one file for the treatment, and one for the control.
  • -n Experiment name, which will be used to generate output file names.

MACS2 参数的探究可以参考: https://github.com/crazyhottommy/ChIP-seq-analysis/blob/master/part1_peak_calling.md

如果要进行循环处理,可以先准备一个如下格式的文本文件:

Treatment SRA id Treatment SRA name Control SRA id Control SRA name

命令:

nohup cat ChIP_seq/CallPeak/ChIP_seq.pairs | while read i;
do
treat_id=`echo $i|perl -ane 'chomp;print $F[0]'`;
treat_name=`echo $i|perl -ane 'chomp;print $F[1]'`;
control_id=`echo $i|perl -ane 'chomp;print $F[2]'`;
control_name=`echo $i|perl -ane 'chomp;print $F[3]'`;
macs2 callpeak -c ChIP_seq/Map/${control_id}.sorted.bam -t ChIP_seq/Map/${treat_id}.sorted.bam -m 10 30 -p 1e-5 -f BAM -g mm -B -n $treat_name --outdir ChIP_seq/CallPeak/ 2>ChIP_seq/CallPeak/${treat_name}.macs2.log;
done &

Peak Calling 原理探究

具体统计学原理可以看这篇博客文章:https://www.plob.org/article/7227.html 具体peak calling原理可以看这篇文章:https://www.plob.org/article/3760.html 以下两张图很好的描述了peaks calling的过程:

(1) Building a signal profile

(2) Peak calling

也许看看在peaks calling分析早期,别人是怎么做的,对它的原理的理解会有启发:八年前的ChIP-seq怎么找peak

7. 可视化 目录

Viewing the peaks in IGV

在peak call步骤中,当给macs2 callpeak添加参数-B时会输出两个个bedgraph文件,其中保存着peak profile,分别为control和treat的peak profile,可以在genome browser上可视化peaks,可选择的genome browser:

  • Interactive Genome Viewer (IGV) 本地安装运行,避免了数据传输
  • UCSC genome browser 在线工具,必须有可用的参考基因组

Vizualisation with deeptools

将bam文件转换成bw(bigWig)文件,需要用到deeptools,这是一个用python编写的工具:

ls ChIP_seq/Map/*.bam | while read i;
do
sample=`basename $i .bam`;
samtools rmdup -s ChIP_seq/Map/${sample}.bam ChIP_seq/Map/${sample}.nodup.bam; # Remove the duplicated reads
samtools index -@ 10 ChIP_seq/Map/${sample}.nodup.bam ChIP_seq/Map/${sample}.nodup.bai; # Index the BAM file
bamCoverage -b ChIP_seq/Map/${sample}.nodup.bam -binSize 10 -o ChIP_seq/Map/${sample}.bw;
done

bamCoverage 参数

  • -b BAM file
  • --outFileFormat Output file type. Either "bigwig" or "bedgraph" (default: bigwig)
  • -binSize Bin size
  • -o Output file name

得到的bw文件可用用IGV进行可视化

因为peaks在基因组的分布是有规律的,如果是集中在TSS附近,就可以画TSS附近的信号强度图,一些人为处理可以改变peaks的分布,同理信号强度也会改变,这个是大家的注意分析结果以及生物学一样。

可以对每个sample分别画基因的TSS附近的profile和heatmap图,也可以整合所有的chipseq的bam文件,画基因的TSS附近的profile和heatmap图。首先要下载mm10基因组refseq注释数据,可以从ucsc的table browser上下载

$ computeMatrix reference-point -p 10 --referencePoint TSS -b 2000 -a 2000 -S ../*bw -R ~/annotation/CHIPseq/mm10/ucsc.refseq.bed --skipZeros -o tmp4.mat.gz
$ plotHeatmap -m tmp4.mat.gz -out tmp4.merge.png
$ plotProfile --dpi 720 -m tmp4.mat.gz -out tmp4.profile.pdf --plotFileFormat pdf --perGroup
$ plotHeatmap --dpi 720 -m tmp4.mat.gz -out tmp4.merge.pdf --plotFileFormat pdf

computeMatrix reference-point 参数

  • --referencePoint {TSS,TES,center}
  • -b Distance upstream of the reference-point selected
  • -a Distance downstream of the reference-point selected
  • -R Reference annotation file
  • -S bigWig file(s) containing the scores to be plotted
  • --skipZeros Whether regions with only scores of zero should be included or not
  • -out File name to save the gzipped matrix file needed by the "plotHeatmap" and "plotProfile" tools
  • --outFileSortedRegions BED file File name in which the regions are saved after skiping zeros or min/max threshold values

plotHeatmap 参数

  • -m Matrix file from the computeMatrix tool
  • -out File name to save the image to

plotProfile 参数

  • --dpi Set the DPI to save the figure. (default: 200)
  • -m Matrix file from the computeMatrix tool
  • -out File name to save the image to.The file ending will be used to determine the image format. The available options are: "png", "eps", "pdf" and "svg", e.g., MyHeatmap.png.
  • --plotFileFormat

8.Peaks注释 目录

经过前面的ChIP-seq测序数据处理的常规分析,我们已经成功的把测序仪下机数据变成了BED格式的peaks记录文件。所谓的peaks注释,就是想看看该peaks在基因组的哪一个区段,看看它们在各种基因组区域(基因上下游,5',3'端UTR,启动子,内含子,外显子,基因间区域,microRNA区域)分布情况,但是一般的peaks都有近万个,所以需要批量注释,如果脚本学的好,自己下载参考基因组的GFF注释文件,完全可以自己写一个。

用基因组注释文件GFF/GTF注释peaks

下载参考基因组的GFF注释文件,下载地址:ftp://ftp.ensembl.org/pub/release-91/gff3/mus_musculus/Mus_musculus.GRCm38.91.gff3.gz

使用BEDTools’ intersectBed注释peaks

$ bedtools intersect -wa -wb -a peaks.bed -b mm10.gff3 >ChIP_seq/CallPeak/peaks.anno.bed
  • -wa Write the original entry in A for each overlap
  • -wb Write the original entry in B for each overlap. Useful for knowing what A overlaps
  • -a BAM/BED/GFF/VCF file “A”
  • -b One or more BAM/BED/GFF/VCF file(s) “B”

注意:peaks.bed 和 mm10.gff3 的第一列的染色体写法是否一致,是[1-22]|[XY]还是chr[1-22]|[XY] ?如果不一致需要先统一:

# 将 chr[1-22]|[XY] 改成 [1-22]|[XY]
$ perl -ane 'chomp;$chr=substr($F[0],3,2);print "$F[0]\t$F[1]\t$F[2]\t$F[3]\t$F[4]\n" peaks.bed >peaks.convert.bed

查看peaks注释结果

$ cut -f8 peaks.anno.bed | sort | uniq -c

# 统计结果如下

#  30459 biological_region
#  2344 CDS
#    10 C_gene_segment
# 70534 chromosome
#  6181 exon
#   291 five_prime_UTR
# 27836 gene
#     2 gene_segment
# 22244 lnc_RNA
#     5 miRNA
# 76036 mRNA
#     5 ncRNA
#  4212 ncRNA_gene
#   491 pseudogene
#   401 pseudogenic_transcript
#     3 rRNA
#     2 snoRNA
#     3 snRNA
#  1546 three_prime_UTR
#   192 transcript
#     6 V_gene_segment

用R包ChIPpeakAnno注释peaks

这里我们使用一个bioconductor包ChIPpeakAnno来做CHIP-seq的peaks注释,下面的包自带的示例:

#这个包使用起来非常简单,只需要把我们做好的peaks文件(GSM1278641XuMUTrep1BAF155_MUT.peaks.bed等等)
#用toGRanges或者import读进去,成一个GRanges对象即可

# 比较两个peaks文件的overlap
library(ChIPpeakAnno)
bed <- system.file("extdata", "MACS_output.bed", package="ChIPpeakAnno")
gr1 <- toGRanges(bed, format="BED", header=FALSE)
## one can also try import from rtracklayer
library(rtracklayer)
gr1.import <- import(bed, format="BED")
identical(start(gr1), start(gr1.import))
gr1[1:2]
gr1.import[1:2] #note the name slot is different from gr1
gff <- system.file("extdata", "GFF_peaks.gff", package="ChIPpeakAnno")
gr2 <- toGRanges(gff, format="GFF", header=FALSE, skip=3)
ol <- findOverlapsOfPeaks(gr1, gr2)
makeVennDiagram(ol)

# peaks注释
data(TSS.human.GRCh37) ## 主要是借助于这个GRanges对象来做注释,也可以用getAnnotation来获取其它GRanges对象来做注释
## featureType : TSS, miRNA, Exon, 5'UTR, 3'UTR, transcript or Exon plus UTR
peaks=MUT_rep1_peaks
macs.anno <- annotatePeakInBatch(peaks, AnnotationData=TSS.human.GRCh37,
	output="overlapping", maxgap=5000L)

可以查看peaks都出现在基因结构的哪些位置上:

require(TxDb.Hsapiens.UCSC.hg19.knownGene)
aCR<-assignChromosomeRegion(peaks, nucleotideLevel=FALSE,
	precedence=c("Promoters", "immediateDownstream",
	"fiveUTRs", "threeUTRs",
	"Exons", "Introns"),
	TxDb=TxDb.Hsapiens.UCSC.hg19.knownGene)
barplot(aCR$percentage)

得到的结果类似下图

peaks注释也可以选择网页版工具:

9. Motif 分析 目录

目前知名的motif搜寻工具可以参阅文献:https://biologydirect.biomedcentral.com/articles/10.1186/1745-6150-9-4

进行Motif分析,首先需要获取peaks区域所对应的序列,可以用bedtools进行序列提取

$ bedtools getfasta -fi input.fasta -bed input.bed -fo output.fasta

然后可以用RSAT(Regulatory Sequence Analysis Tools)来搜索motif: NGS ChIP-seq -> peak-motifs,设置合适的参数,然后在线运行


参考资料:

(1) Hands-on introduction to ChIP-seq analysis - VIB Training

(2) ChIP-seq实战分析

(3) 一篇文章学会ChIP-seq分析(上)

(4) 一篇文章学会ChIP-seq分析(下)