pipeline for oligoCLIP contains 2 stages. Stage1: SnakeCLIP.py
: Goes from fastq -> bam. Stage2: SnakePeakMain
: takes bam files for IP and background, perform clipper peak calls, normalization and annotation/motif calling.
- you need to install snakemake and mamba. See snakemake documentation
- fill out the config file. see example.
fastq_menifest
: a csv file pointing to 1 or multiple libraries with the SAME barcode set. Look at this example file to see how it is structures.barcode
: a csv file pointing barcodes to RBPs. The RBP names must be unique. exampleADAPTOR_PATH
: a folder containing a "tiling adaptor" fasta file example. The adaptor sequence is break into tiling substrings.adaptor
: name of the fasta file without the .fasta suffix.STAR_REP
star indicies build from repbase, used to remove the repetitive sequences. STAR indicies can be build using--runMode genomeGenerate
. see STAR documentationSTAR_DIR
: is the index built from genome sequences.TOOL_PATH
: local directory to this repositoryumi_pattern
:X
for barcode.N
for UMIs.WORKDIR
: where you want all outputs saved.QC_TOOLS_PATH
: paths to python scripts to gather QC outputs. Download here
snakemake -s snakeCLIP.py -j 12 --keep-going --cluster "qsub -l walltime={params.run_time} -l nodes=1:ppn={params.cores} -q home-yeo" --directory /home/hsher/scratch/ABC_reprocess/ --configfile eclipse_multi.yaml -n
- for specific options, see snakemake command line interface
- genome-aligned, de-deduplicated bams:
{libname}/bams/genome/{RBP}.genome-mappedSoSo.rmDupSo.Aligned.out.bam
and its index!{libname}/bams/genome/{RBP}.genome-mappedSoSo.rmDupSo.Aligned.out.bam.bai
- repeat-aligned bams:
{libname}/bams/repeat/{RBP}.Aligned.out.bam
QC
folder:cutadapt_log*.csv
: csv file with all cutadapt statistics joined togetherQC/{libname}/fastQC_*.csv
: fastQC outputs for all RBPs.genome_mapping_stats.csv
andrepeat_mapping_stats.csv
: csv file with all STAR Logs joined togetherdup_level.csv
: UMI deduplication statistics- These files are used to diagnose problems when you don't get expected outputs: chat with your lab's bioinformatician with these files in hand. They will love to see these.
the flowchart can be generated using the following command snakemake -s snakeCLIP.py --dag --configfile config/preprocess_config/eclipse_rbfox.yaml |gdot -Tsvg > snakeCLIP.svg
. You can also run this with your own config file.
- extract_umi: extracts UMI sequence using umitools. put UMIs into header. The barcodes are still in place.
- cutadapt_round_one/two: trim 3' adaptor sequences using tiling adaptor sequences (see config).
- demultuplex_using_barcodes: split RBPs with different barcodes into multiple fastqs.
- remove_barcode_and_reverse_complement: remove barcode sequence and reverse complement the sequence.
- sort_and_gzip_fastq: sort fastq and gzip it.
- align_reads_to_REPEAT: map to repetitve element database to remove rRNA sequences.
- align_to_GENOME: map to genome.
- sort_bams: sort the genome-aligned bam file by coordinate.
- umitools_dedup: deduplicate based on UMIs.
- index_genome_bams: index the bam.
These rules help you diganose problems if the output read is little.
- gather_timming_stat_round1/2: gather cutadapt output statistics and see how many adaptor/quality-trimmed reads there are. If lots of reads are "too short" after trimming, it can mean lots of adaptor dimer or terrible read quality.
- fastQC_post_trim: run fastQC with the "clean", adaptor- barcode- UMI- free reads. If you see adaptor sequence content here, it means trimming is wrong. Other metrics can be helpful too.
- gather_repeat/genome_mapping_stats: gather STAR .Log.final.out to see what is the unique mapping rate. If you see little reads mapping, check why they are not mapping. Check the unmapped reads by blasting a few of them.
- qc_duplication_rate: see how many reads are gone by UMI deduplication. If most reads are gone after deduplication, it means library is over-amplified.
This main script Snake_PeakMain.py
calls several modules:
rules/clipper_and_norm.py
: fundemental rules to call CLIPper peaks for CLIP library and do ranking based on chi-square/fisher test comparing 1 bam file to background bam file.rules/complementary_control.py
: reusesrules/clipper_and_norm.py
. Concatenate all other CLIP in the same library into a single bam file as complementary control(CC), and use this as the background bam file.rules/annotate_peak.py
: annotate peaks by region, and find motif with significant peaks.
- fill out config file example
SCRIPT_PATH
: path to thescripts/
folder in oligoCLIPSPECIES
: CLIPper--species
flag.MANIFEST
: example.bam_0
: the CLIP bam. This bam will be used to call peaks with CLIPper.bam_control_0
: the background bam (external normalization). Can be RNA-seq, IgG library and SM-INPUT. In the paper, we showed RNA-seq is not a good control.- All bams must be indexed (ideally outputs from Stage 1)
uid
: the unique ID for each IP-background pair.lib
: Dictates the complementary control behaviour. Specifies whichbam_0
CLIP is grouped in the same multiplexed library. All other bams in the samelib
will be used as the complemetnary control.
WORKDIR
: working directory. Where you want all files storedGTF
: *.gtf.db annotation used by annotator.ANNOTATOR_SPECIES
: species name used by annotator.GENOMEFA
: fasta file containing genome sequence.CLIPper_pvalthes
: pvalue threshold to filter CLIPper, used by motif calling. All peaks above this threshold will be used to find motif.
snakemake -j 40 -s Snake_PeakMain.py --cluster "qsub -l walltime={params.run_time}:00:00 -l nodes=1:ppn={params.cores} -q home-yeo" --configfile config/peak_call_config/snake_CLIPper_downsample.yaml --use-conda -n
- for specific options, see snakemake command line interface
output/CLIPper/{uid}.peaks.bed
: CLIPper outputs. "Unnormalized" peaks.output/{uid}.peaks.normed.compressed.bed
: normalized to external controlbam_control_0
.chisq/{lib}/{uid}.peaks.normed.compressed.bed
: normalized to complementary control (all other CLIP in the same library).
*annotate.bed
: contains the gene, region for each peak.*.summary
: summary of region distribution for all significant peaks.*.svg
: homer motifs figures partitioned by region.
Since the either flowchart will be a hairball, let me highlight each steps.
- CLIPper: CLIPper p-value is calculated with Poisson distribution. Background rate is estimated from +/- 500 b.p. from the called peaks. To understand what CLIPper is doing, please visit CLIPper
- Ranking/normalizing CLIPper peaks with a "background":
scripts/overlap_peakfi_with_bam.pl
is the script that does the job. Given a CLIP bam with a "background" bam with CLIPper beds, it counts how many read CLIP/background falls inside/outside the peak and make a contingency table. If any of the counts < 5, uses Fisher Exact Test. Else it uses chi-square test.
The following flow chart can be generated with:
snakemake -s Snake_PeakMain.py --configfile config/peak_call_config/snake_CLIPper_downsample.chi.yaml --use-conda -n --dag chisq/Dan_multiplex1_K562_rep1_RBFOX2.peaks.normed.compressed.bed| dot -Tsvg > CC.svg
concat_all_other_bams
: concatenate all other bams (bam_0
) in the same library (lib
) to be used as the complementary control.count_bam_chi
: count the total number of readsnorm_chisq_bg
: perform prioritization with the CLIP (bam_0
), with complementary control as background.compress*
: merge book-ended peaks. P-value and fold-change are from the strongest (lowest p-value) peak being merged.
The following flow chart can be generated with:
snakemake -s Snake_PeakMain.py --configfile config/peak_call_config/snake_CLIPper_downsample.chi.yaml --use-conda -n --dag "output/Dan_multiplex1_K562_rep1_RBFOX2.peaks.normed.compressed.bed" | dot -Tsvg > External.svg
All is the same as complementary control, except it uses an external bam (bam_control_0
) file as background.
call_count_read_num
: count the total number of reads.norm_chisq_bg
: perform prioritization with the CLIP (bam_0
), with an external bam (bam_control_0
) file as background.compress*
: merge book-ended peaks. P-value and fold-change are from the strongest (lowest p-value) peak being merged.
The following flow chart can be generated with:
snakemake -s Snake_PeakMain.py --configfile config/peak_call_config/snake_CLIPper_downsample.chi.yaml --use-conda -n --dag chisq/motif/Dan_multiplex1_K562_rep1/Dan_multiplex1_K562_rep1_RBFOX2.peaks.normed.compressed.filtered.annotate.svg chisq/Dan_multiplex1_K562_rep1/Dan_multiplex1_K562_rep1_RBFOX2.peaks.summary | dot -Tsvg > anno.svg
I use the complementary control as an example. The method used for external control or the unnormalized is the sname.The upper part up to "compress" is already explained above.
filter*
: filter for significant peaks*motif
: call HOMER motifs for each stepsanno*
: annotate peaks with region, gene name*summary
: count region distribution for singificant peaks.scripts/summarize_peak.py
does the job.
snakemake -s Snake_PeakMain.py --configfile config/peak_call_config/snake_CLIPper_downsample.chi.yaml \
--use-conda -n --dag \
chisq/Dan_multiplex1_K562_rep1/Dan_multiplex1_K562_rep1_RBFOX2.peaks.summary \
chisq/motif/Dan_multiplex1_K562_rep1/Dan_multiplex1_K562_rep1_RBFOX2.peaks.normed.compressed.filtered.annotate.svg \
output/Dan_multiplex1_K562_rep1_RBFOX2.peaks.normed.compressed.annotate.bed \
output/motif/Dan_multiplex1_K562_rep1_RBFOX2.peaks.normed.compressed.filtered.annotate.svg \
output/CLIPper/Dan_multiplex1_K562_rep1_RBFOX2.peaks.filtered.svg \
output/CLIPper/Dan_multiplex1_K562_rep1_RBFOX2.peaks.annotate.bed \
| dot -Tsvg > all.svg