Author:连明
12/24/2017 4:43 :25 PM
1. 参考基因组的准备 目录
参考基因组可以从以下两个途径获取:
- UCSC:
Download
->Mouse genome
->Full datasets
->chromFa.tar.gz
得到的是一个由多个染色体fasta文件组成的压缩文件,解压后需要合并成一个fasta文件,命令:
$ tar zxvf chromFa.tar.gz && cat *.fa >mm10.fa
- ENSEMBL:
Download
->Download data via FTP
->FTP site
->../Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
2. 建立参考基因组索引 目录
$ bowtie2-build ~/Ref/mm10/mm10.fa ~/Ref/mm10/mm10 1>Ref/mm10/mm10.bwt_index.log 2>&1
3. ChIP-seq数据获取 目录
可以从SRA上下载,也可以从ENA上下载。
从SRA上下载的数据是用特定的压缩方法得到的压缩文件格式sra
,下载后需要进行格式转换
下载数据先要根据数据的GEO id
到NCBI上获取该数据所对应的SRP id
(sra project id),然后根据SRP id
到FTP上下载,例如如果已知数据的GEO id
为GSE42466,从NCBI的GEO数据库上找到它的SRP id
为SRP017311,则FTP地址为
ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/ SRP017 / SRP017311 / SRR620$i / SRR620$i.sra
可以写一个循环来下载该数据集:
for ((i=204;i<=209;i++));
do
wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP017/SRP017311/SRR620$i/SRR620$i.sra;
done
SRA格式有6~9倍的压缩了,比zip格式压缩的2~3倍高多了。将SRA格式转换为fastq格式,这里需要用到NCBI开发的sratoolkit中的fastq-dump命令
$ fastq-dump --split-3 -O ChIP_seq/ SRR***.sra
--split-3
参数可以将PE的sra文件解压后的fastq文件拆分成*_1.fastq和*_2.fastq,由于本示例数据集是SE测序,不会进行拆分
从GEO数据可搜索中获得其所对应的SRP id,在EBI中直接搜索该id,下载样本信息的文本文件,然后根据该文件进行批量下载
$ tail -n +2 PRJNA182214.txt| cut -f16 | xargs wget -c -P Rawdata
4. 质控 目录
用fastqc
查看质量
$ ls *.fastq | xargs fastqc -t 10 -o ChIP_seq/
用cutadapt
进行QC,如果还没有安装cutadapt,可以通过以下方式安装:
$ conda install -c bioconda cutadapt
# 1. Single end
$ cutadapt -a ADAPTER -q 20,20 -m 20 -o outfile_QC.fastq infile.fastq
# 2. Paired end
$ cutadapt -j 8 -a ADAPTER -A ADAPTER -q 20,20 -m 20 -o outfile_QC_1.fastq -p outfile_QC_2.fastq infile_1.fastq infile_2.fastq
- -j CORES Number of CPU cores to use. Default: 1
- -a ADAPTER 3' adapter to be removed
- -A ADAPTER 3' adapter to be removed from second read in a pair.
- -q [5'CUTOFF,]3'CUTOFF
- -m LENGTH Discard reads shorter than LENGTH. Default: 0
可以写一个循环来处理:
ls *.fastq | while read i;
do
i=`basename $i .fastq`;
cutadapt -q 20,20 -m 20 -o ChIP_seq/${i}_QC.fastq ChIP_seq/${i}.fastq;
done
5. 比对参考基因组 目录
ls ChIP_seq/Rawdata/*QC.fastq | while read i;
do
i=`basename $i _QC.fastq`;
echo $i;
bowtie2 -p 10 --local -x Ref/mm10/mm10 -U ChIP_seq/Rawdata/${i}_QC.fastq | samtools sort -@ 8 -O bam -o ChIP_seq/Map/${i}.sorted.bam;
done 1>ChIP_seq/Map/map.log 2>&1
比对的统计信息保存在map.log
中
bowtie2 参数
- -p threads
- --local local alignment
- -x reference genome index
- -U Files with unpaired reads
samtools 参数
- -@ threads
- -O Specify output format (SAM, BAM, CRAM)
- -o Write final output to FILE rather than standard output
6. Peak calling 目录
目前可用的peak calling工具很多,详见:http://wodaklab.org/nextgen/data/peakfinders.html
这一步我们使用MACS2
,这是一个用python2.7写的工具,所以当你同时在使用python3.6和python2.7时,使用前请务必激活python2.7(将 python2.7/anaconda2 的安装目录添加到环境变量中),安装方法为:
下载安装anaconda2
$ wget -c -P basic_tool/ https://repo.continuum.io/archive/Anaconda2-5.0.1-Linux-x86_64.sh
$ sh Anaconda2-5.0.1-Linux-x86_64.sh
$ echo 'export PATH=../anaconda2/bin:$PATH' >>~/.bashrc
下载安装MACS2
# 1. 用源码安装
$ wget -c -P biosoft/ https://pypi.python.org/packages/9f/99/a8ac96b357f6b0a6f559fe0f5a81bcae12b98579551620ce07c5183aee2c/MACS2-2.1.1.20160309.tar.gz
$ cd biosoft && tar zxvf MACS2-2.1.1.20160309.tar.gz
$ cd MACS2-2.1.1.20160309 && python setup.py install
$ echo 'export PATH=../MACS2-2.1.1.20160309/bin:$PATH' >>~/.bashrc
# 2. 用bioconda安装
$ conda install -c bioconda macs2
安装成功后就可以直接使用MACS2进行peak calling了,命令:
# MACS首先的工作是要确定一个模型,这个模型最关键的参数就是峰宽d,这个d就是bw(band width),而它的一半就是shiftsize,具体可以参阅后文提到的原理部分
$ macs2 callpeak -c controlFile.bam -t treatmentFile.bam -m 10 30 -p pvalue -f BAM -g gsize -B -n filename.preffix --outdir ChIP_seq/CallPeak 2>ChIP_seq/CallPeak/filename.macs2.log
- -c Control file
- -t Treatment file
- -m Select the regions within MFOLD range of high-confidence enrichment ratio against background to build model.
- -g Effective genome size. shortcuts:'hs' for human (2.7e9), 'mm' for mouse(1.87e9), 'ce' for C. elegans (9e7) and 'dm' for fruitfly (1.2e8).
- -p P value cutoff
- -f File format
- -B Output a file in BEDGRAPH format to visualize the peak profiles in a genome browser. There will be one file for the treatment, and one for the control.
- -n Experiment name, which will be used to generate output file names.
MACS2 参数的探究可以参考: https://github.com/crazyhottommy/ChIP-seq-analysis/blob/master/part1_peak_calling.md
如果要进行循环处理,可以先准备一个如下格式的文本文件:
Treatment SRA id | Treatment SRA name | Control SRA id | Control SRA name |
---|
命令:
nohup cat ChIP_seq/CallPeak/ChIP_seq.pairs | while read i;
do
treat_id=`echo $i|perl -ane 'chomp;print $F[0]'`;
treat_name=`echo $i|perl -ane 'chomp;print $F[1]'`;
control_id=`echo $i|perl -ane 'chomp;print $F[2]'`;
control_name=`echo $i|perl -ane 'chomp;print $F[3]'`;
macs2 callpeak -c ChIP_seq/Map/${control_id}.sorted.bam -t ChIP_seq/Map/${treat_id}.sorted.bam -m 10 30 -p 1e-5 -f BAM -g mm -B -n $treat_name --outdir ChIP_seq/CallPeak/ 2>ChIP_seq/CallPeak/${treat_name}.macs2.log;
done &
具体统计学原理可以看这篇博客文章:https://www.plob.org/article/7227.html 具体peak calling原理可以看这篇文章:https://www.plob.org/article/3760.html 以下两张图很好的描述了peaks calling的过程:
(1) Building a signal profile
(2) Peak calling
也许看看在peaks calling分析早期,别人是怎么做的,对它的原理的理解会有启发:八年前的ChIP-seq怎么找peak
7. 可视化 目录
在peak call步骤中,当给macs2 callpeak
添加参数-B
时会输出两个个bedgraph文件,其中保存着peak profile,分别为control和treat的peak profile,可以在genome browser上可视化peaks,可选择的genome browser:
- Interactive Genome Viewer (IGV) 本地安装运行,避免了数据传输
- UCSC genome browser 在线工具,必须有可用的参考基因组
将bam文件转换成bw(bigWig)文件,需要用到deeptools,这是一个用python编写的工具:
ls ChIP_seq/Map/*.bam | while read i;
do
sample=`basename $i .bam`;
samtools rmdup -s ChIP_seq/Map/${sample}.bam ChIP_seq/Map/${sample}.nodup.bam; # Remove the duplicated reads
samtools index -@ 10 ChIP_seq/Map/${sample}.nodup.bam ChIP_seq/Map/${sample}.nodup.bai; # Index the BAM file
bamCoverage -b ChIP_seq/Map/${sample}.nodup.bam -binSize 10 -o ChIP_seq/Map/${sample}.bw;
done
bamCoverage 参数
- -b BAM file
- --outFileFormat Output file type. Either "bigwig" or "bedgraph" (default: bigwig)
- -binSize Bin size
- -o Output file name
得到的bw文件可用用IGV进行可视化
因为peaks在基因组的分布是有规律的,如果是集中在TSS附近,就可以画TSS附近的信号强度图,一些人为处理可以改变peaks的分布,同理信号强度也会改变,这个是大家的注意分析结果以及生物学一样。
可以对每个sample分别画基因的TSS附近的profile和heatmap图,也可以整合所有的chipseq的bam文件,画基因的TSS附近的profile和heatmap图。首先要下载mm10基因组refseq注释数据,可以从ucsc的table browser
上下载
$ computeMatrix reference-point -p 10 --referencePoint TSS -b 2000 -a 2000 -S ../*bw -R ~/annotation/CHIPseq/mm10/ucsc.refseq.bed --skipZeros -o tmp4.mat.gz
$ plotHeatmap -m tmp4.mat.gz -out tmp4.merge.png
$ plotProfile --dpi 720 -m tmp4.mat.gz -out tmp4.profile.pdf --plotFileFormat pdf --perGroup
$ plotHeatmap --dpi 720 -m tmp4.mat.gz -out tmp4.merge.pdf --plotFileFormat pdf
computeMatrix reference-point 参数
- --referencePoint {TSS,TES,center}
- -b Distance upstream of the reference-point selected
- -a Distance downstream of the reference-point selected
- -R Reference annotation file
- -S bigWig file(s) containing the scores to be plotted
- --skipZeros Whether regions with only scores of zero should be included or not
- -out File name to save the gzipped matrix file needed by the "plotHeatmap" and "plotProfile" tools
- --outFileSortedRegions BED file File name in which the regions are saved after skiping zeros or min/max threshold values
plotHeatmap 参数
- -m Matrix file from the computeMatrix tool
- -out File name to save the image to
plotProfile 参数
- --dpi Set the DPI to save the figure. (default: 200)
- -m Matrix file from the computeMatrix tool
- -out File name to save the image to.The file ending will be used to determine the image format. The available options are: "png", "eps", "pdf" and "svg", e.g., MyHeatmap.png.
- --plotFileFormat
8.Peaks注释 目录
经过前面的ChIP-seq测序数据处理的常规分析,我们已经成功的把测序仪下机数据变成了BED格式的peaks记录文件。所谓的peaks注释,就是想看看该peaks在基因组的哪一个区段,看看它们在各种基因组区域(基因上下游,5',3'端UTR,启动子,内含子,外显子,基因间区域,microRNA区域)分布情况,但是一般的peaks都有近万个,所以需要批量注释,如果脚本学的好,自己下载参考基因组的GFF注释文件,完全可以自己写一个。
下载参考基因组的GFF注释文件,下载地址:ftp://ftp.ensembl.org/pub/release-91/gff3/mus_musculus/Mus_musculus.GRCm38.91.gff3.gz
使用BEDTools’ intersectBed注释peaks
$ bedtools intersect -wa -wb -a peaks.bed -b mm10.gff3 >ChIP_seq/CallPeak/peaks.anno.bed
- -wa Write the original entry in A for each overlap
- -wb Write the original entry in B for each overlap. Useful for knowing what A overlaps
- -a BAM/BED/GFF/VCF file “A”
- -b One or more BAM/BED/GFF/VCF file(s) “B”
注意:peaks.bed 和 mm10.gff3 的第一列的染色体写法是否一致,是[1-22]|[XY]
还是chr[1-22]|[XY]
?如果不一致需要先统一:
# 将 chr[1-22]|[XY] 改成 [1-22]|[XY]
$ perl -ane 'chomp;$chr=substr($F[0],3,2);print "$F[0]\t$F[1]\t$F[2]\t$F[3]\t$F[4]\n" peaks.bed >peaks.convert.bed
查看peaks注释结果
$ cut -f8 peaks.anno.bed | sort | uniq -c
# 统计结果如下
# 30459 biological_region
# 2344 CDS
# 10 C_gene_segment
# 70534 chromosome
# 6181 exon
# 291 five_prime_UTR
# 27836 gene
# 2 gene_segment
# 22244 lnc_RNA
# 5 miRNA
# 76036 mRNA
# 5 ncRNA
# 4212 ncRNA_gene
# 491 pseudogene
# 401 pseudogenic_transcript
# 3 rRNA
# 2 snoRNA
# 3 snRNA
# 1546 three_prime_UTR
# 192 transcript
# 6 V_gene_segment
这里我们使用一个bioconductor包ChIPpeakAnno来做CHIP-seq的peaks注释,下面的包自带的示例:
#这个包使用起来非常简单,只需要把我们做好的peaks文件(GSM1278641XuMUTrep1BAF155_MUT.peaks.bed等等)
#用toGRanges或者import读进去,成一个GRanges对象即可
# 比较两个peaks文件的overlap
library(ChIPpeakAnno)
bed <- system.file("extdata", "MACS_output.bed", package="ChIPpeakAnno")
gr1 <- toGRanges(bed, format="BED", header=FALSE)
## one can also try import from rtracklayer
library(rtracklayer)
gr1.import <- import(bed, format="BED")
identical(start(gr1), start(gr1.import))
gr1[1:2]
gr1.import[1:2] #note the name slot is different from gr1
gff <- system.file("extdata", "GFF_peaks.gff", package="ChIPpeakAnno")
gr2 <- toGRanges(gff, format="GFF", header=FALSE, skip=3)
ol <- findOverlapsOfPeaks(gr1, gr2)
makeVennDiagram(ol)
# peaks注释
data(TSS.human.GRCh37) ## 主要是借助于这个GRanges对象来做注释,也可以用getAnnotation来获取其它GRanges对象来做注释
## featureType : TSS, miRNA, Exon, 5'UTR, 3'UTR, transcript or Exon plus UTR
peaks=MUT_rep1_peaks
macs.anno <- annotatePeakInBatch(peaks, AnnotationData=TSS.human.GRCh37,
output="overlapping", maxgap=5000L)
可以查看peaks都出现在基因结构的哪些位置上:
require(TxDb.Hsapiens.UCSC.hg19.knownGene)
aCR<-assignChromosomeRegion(peaks, nucleotideLevel=FALSE,
precedence=c("Promoters", "immediateDownstream",
"fiveUTRs", "threeUTRs",
"Exons", "Introns"),
TxDb=TxDb.Hsapiens.UCSC.hg19.knownGene)
barplot(aCR$percentage)
得到的结果类似下图
peaks注释也可以选择网页版工具:
9. Motif 分析 目录
目前知名的motif搜寻工具可以参阅文献:https://biologydirect.biomedcentral.com/articles/10.1186/1745-6150-9-4
进行Motif分析,首先需要获取peaks区域所对应的序列,可以用bedtools
进行序列提取
$ bedtools getfasta -fi input.fasta -bed input.bed -fo output.fasta
然后可以用RSAT(Regulatory Sequence Analysis Tools)来搜索motif: NGS ChIP-seq
-> peak-motifs
,设置合适的参数,然后在线运行
参考资料:
(1) Hands-on introduction to ChIP-seq analysis - VIB Training
(2) ChIP-seq实战分析