Releases: ENCODE-DCC/chip-seq-pipeline2
v1.5.0
Upgraded WDL to 1.0
- Added metadata to WDL
- Removed hacky comments for Caper.
meta
for general metadata for pipeline (e.g. version, docker image)parameter_meta
for input parameters.
Pooling control
chip.always_use_pooled_control
is nowtrue
by default, which means that pipeline always try to pool controls if multiple control replicates are defined. And such pooled control is used for calling peaks on each experiment replicate.
Added control mode
- Added
control
tochip.pipeline_type
. - Now
chip.pipeline_type
has three choicestf
,histone
andcontrol
. - For control mode, do not use inputs prefixed with
ctl_*
. Instead, define inputs in non-ctl
variables. e.g. define control FASTQs inchip.fastqs_rep1_R1
(not inchip.ctl_fastqs_rep1_R1
).
Bug fixes
- Clip peak's genome coordinate between 0 and chromSize.
- Affected files: SPP/MACS2 peak and IDR/IDR_unthresholded/overlap peak.
Updated reference genome data
hg38
:v1
->v3
mm10
:v1
->v3
No update for old genome data: mm9
, hg19
. They are still at v1
.
Reference genome dataset v3
: ENCODE4 standard for ATAC/ChIP.
- New TSS regions (
tss
) based on GENCODE annotation.hg38
: GENCODE v29mm10
: GENCODE vM21
- Repacked other annotation BED files: no changes in actual contents.
- Enhancer (
enh
) - Promoter (
prom
) - DHS regions (
dnase
)
- Enhancer (
- New blacklist (
blacklist
) forhg38
. Keep using old blacklist formm10
.
v1.4.0.1
IMPORTANT: Update Caper >= 0.8.
$ pip install caper --upgrade
IMPORTANT: Conda users must update pipeline's Conda env.
$ bash scripts/update_conda_env.sh
New control subsampling
- Controlled by
chip.ctl_depth_limit
andchip.exp_ctl_depth_ratio_limit
. There are two limits calculated from each parameter. Pipeline takes a maximum of itmax(ctl_depth_limit, exp_ctl_depth_ratio_limit * exp_rep_read_depth)
and if control is deeper than that then control is subsampled to that limit. chip.ctl_depth_limit
: Hard limit on control's read depth. 200M by default.chip.exp_ctl_depth_ratio_limit
: Factor to be multiplied to experiment replicate's read depth.5.0
by default.- We still keep control subsampling controlled by a parameter
chip.ctl_subsample_reads
.- Both raw/filtered control BAMs have full reads. Filtered (nodup) control BAM is converted into control TAG-ALIGN and then TAG-ALIGN is subsampled down to
chip.ctl_subsample_reads
(if it is defined>0
). This parameter modifies TA itself so affects all downstream analyses like peak-calling and also the new automatic control subsampling, which is done in taskcall_peak
.
- Both raw/filtered control BAMs have full reads. Filtered (nodup) control BAM is converted into control TAG-ALIGN and then TAG-ALIGN is subsampled down to
Cropping FASTQs: Added a parameter chip.crop_length_tol
, which defines a tolerance to allow shorter reads around the crop_length. It's 2
by default and only works when chip.crop_length
is defined (>0
). Trimmomatic's parameters CROP
and MINLEN
will be chip.crop_length
and chip.crop_length
- abs(chip.crop_length_tol)
, respectively. Output (cropped FASTQ) filename will be PREFIX.crop_${CROP}-${TOELRANCE}bp.fastq.gz
where TOLERANCE = CROP - MINLEN
.
- All reads longer (>) than
chip.crop_length
will be cropped. - All reads shorter (<) then
chip.crop_length - abs(chip.crop_length_tol)
will be removed. - All reads not shorter (>=) then
chip.crop_length - abs(chip.crop_length_tol)
and not longer (<=) thanchip.crop_length
will be kept.
Java heap
- For tasks with Java app running inside. If the following parameters are not explicitly defined by a user, each Java app in a task uses 90% of corresponding task memory, so that it does not go over physical memory of cloud instance. For example, if user didn't define
chip.filter_picard_java_heap
and then pipeline will use 90% ofchip.filter_mem_mb
for Java heap-Xmx
(for picard tools in filter task).chip.align_trimmomatic_java_heap
chip.filter_picard_java_heap
chip.gc_bias_picard_java_heap
Bug fixes
- Subsampling TAG-ALIGN (for PE dataset only)
- PE subsampling task actually subsampled 2 x
chip.subsample_reads
reads.
- PE subsampling task actually subsampled 2 x
- Default settings of the pipeline is not affected by this bug.
- Affected cases:
chip.subsample_reads > 0
(0 by default) andchip.paired_end == True
and actual number of reads in replicate is >chip.subsample_reads
.chip.ctl_subsample_reads > 0
(0 by default) andchip.ctl_paired_end == True
and actual number of reads in control is >chip.ctl_subsample_reads
.- If users starts from types (e.g. BAM, NODUP-BAM, TA) other than FASTQ and
chip.paired_end == True
and actual number of reads in replicate is >chip.xcor_subsample_reads
(15M by default).
- Affected cases:
- Fix
grep
error on OSX. - Swapped lines in
chip.croo.v4.json
. - Cannot start from BAMs on DNAnexus (using Web UI).
- JSD didn't work without a blacklist.
- Pooled TAG-ALIGN had a fixed prefix "basename_prefix".
- Croo task graph got complicated due to diamond dependency problem of task
choose_ctl
.
v1.3.6
Conda users should re-install pipeline's environment.
$ bash scripts/uninstall_conda_env.sh
$ bash scripts/install_conda_env.sh
DNAnexus web-interface users should use workflows suffixed with -dockerhub
. i.e. v1.3.6-dockerhub
.
- New parameters (in an input JSON)
chip.crop_length
: Crop FASTQs with Trimmomatic. Cropping is disabled by default (set as0
). Check your FASTQs' read length first. Any reads SHORTER than this length will be excluded while cropping, hence not included in output BAMs and all downstream analyses.chip.fdr_thresh
: FDR threshold for SPP peak caller. It's0.01
by default. Use a more relaxed value if you see the followingFile is empty
error in SPP taskcall-peak
. Possible fix for issue #119.
Traceback (most recent call last):
File "/root/miniconda3/envs/encode-chip-seq-pipeline/bin/encode_task_spp.py", line 103, in <module>
main()
File "/root/miniconda3/envs/encode-chip-seq-pipeline/bin/encode_task_spp.py", line 94, in main
assert_file_not_empty(rpeak)
File "/root/miniconda3/envs/encode-chip-seq-pipeline/bin/encode_lib_common.py", line 212, in assert_file_not_empty
raise Exception('File is empty ({}). Help: {}'.format(f, help))
Exception: File is empty (rep2-R1.subsampled.50.merged.nodup.pr2_x_ctl_for_rep2.300K.regionPeak.gz). Help:
-
Changes in parameters
chip.xcor_pe_trim_bp
->chip.xcor_trim_bp
:_pe_
is misleading since it's also applied to SE FASTQs too.
-
Bug fixes
- Ungzipped single FASTQ input
- DNAnexus: Failure at
read_genome_tsv
due to errors while retrieving image fromquay.io
.dockerhub
is used instead.
v1.3.5.1
Conda users need to re-install pipeline's Conda env.
$ bash scripts/uninstall_conda_env.sh
$ bash scripts/install_conda_env.sh
Output file name change
- Pooled TAG-ALIGN file will have a fixed prefix of
rep.pooled
instead of usingrep1
's prefix.
Change in default parameters
chip.filter_picard_java_heap
:4G
to dynamic (chip.filter_mem_mb
)chip.gc_bias_picard_java_heap
:6G
to10G
Troubleshooting for failed pipelines
- Added some help texts for stringent IDR threshold
Added important GNU apps to Conda env
tar
: to sort by filenametar --sort
grep
: to use Perl style regular expressiongrep -P
Misc.
- Downgraded Java version 11 -> 8 in docker/singularity images.
- Output def JSON file for Croo: v3 released.
- bowtie2 log is directly printed to STDOUT instead of printing to
.align.log
- removed wrong arrow (
FASTQ R2
->chip.xcor
) in task graph in Croo's HTML report.
v1.3.4
IMPORTANT: Update Caper and Croo for a task graph in a Croo HTML report. Old Croo will not work with old pipeline's
metadata.json
.
$ pip install --upgrade caper croo
IMPORTANT: Conda users must update their environment.
$ bash scripts/update_conda_env.sh
Task graph on Croo report.
- updated the output definition JSON file on pipeline's side.
Default parameter changes
chip.macs2_signal_track_disks
: 200 GB -> 400 GB- to prevent possible
PAPI error 10
error.
- to prevent possible
WDL
- For a task
macs2_signal_track_disks
preemption is now allowed on GCP.
Bug fixes
- updated documentation for Conda installation for OSX users.
- OSX users need to install GNU
grep
.
- OSX users need to install GNU
v1.3.3
IMPORTANT: Conda users must re-install Conda env.
$ bash scripts/uninstall_conda_env.sh
$ bash scripts/install_conda_env.sh
New parameters to control JAVA max heap (java -Xmx
)
- should be helpful for issue #88
- added the following 2 parameters
chip.filter_picard_java_heap
: 4G by default (for Picard MarkDuplicate)chip.gc_bias_picard_java_heap
: 6G by default (for Picard CollectGcBiasMetrics)
Bug fixes
- to merge two blacklists with different number of columns (3 and 6).
- presigned URLs for organized outputs
- They are PUBLIC. Use this at your own risk.
- added UCSC browser tracks.
- bigWig: MACS2 signal tracks (p-val and fold-enrich).
- bigBed: optimal/conservative idr/overlap peaks.
Change of default parameters
chip.align_disks
: 200 GB -> 400 GB
Removed old method
- completely removed old method.
- Users must use Caper to run pipelines.
v1.3.2
IMPORTANT: Conda users must update pipeline's Conda environment (not a re-installation). This will just update pipeline's python task wrappers.
$ bash scripts/update_conda_env.sh
New feature
- Removed a parameter
chip.keep_irregular_chr
from the pipeline. - Added a genome-specific parameter
regex_bfilt_peak_chr_name
instead (chr[\dXY]+
by default, which means chr1, chr2, ... , chrX and chrY). You can define this either in a genome TSV (e.g.keep_irregular_chr[TAB]chr[\dXY]
) or in your input JSON (e.g. `"chip.keep_irregular_chr": "chr[\dXY]+") .- This parameter defines chromosomes to keep in the final (with
.bfilt.
suffix) peaks file. This filter is applied even without a blacklist.
- This parameter defines chromosomes to keep in the final (with
Bug fixes
- Pipeline can catch non-zero error code correctly at failed tasks (tasks
align
andcall_peak
). - Pipeline can run without a blacklist.
Dependencies
- Added
wget
andcurl
to Conda environment.
Genome data
- Genome database builder generates md5sum-same TAR balls.
- Can use gzipped Bowtie2/BWA indices (
.tar.gz
) with arbitrary filenames.- Files in a TAR ball don't need to be prefixed with TAR ball's filename prefix.
v1.3.1
IMPORTANT: Conda users must re-install Conda environment.
$ bash scripts/uninstall_conda_env.sh
$ bash scripts/install_conda_env.sh
Added latest python3 MACS2 2.2.4 to Conda env
- removed py2 one from py2 Conda env
- this upgrade slightly changes output of MACS2. so next version will be 1.4.0.
Added missing deps to Conda env
ghostscript
: to fixgs
error on Stanford Sherlock clustercaper
andcroo
: for user's convenience when pipeline's Conda env (py3) is activated. We havePYTHONNOUSERSITE
env var set up in pipeline's Conda env so user's locally pip-installed caper, croo are ignored when it's activated.
Fix for issue #91
- removed all file-linking (soft/hard) from the pipeline
v1.3.0
-
update for Conda users
IMPORTANT: Conda users must uninstall old pipeline's Conda environment (
scripts/uninstall_conda_env.sh
) and re-install it (scripts/install_conda_env.sh
). Pipeline now supports old (<4.7) and new (>=4.7) Conda versions.- pipeline now supports Conda >= 4.7. please follow (carefully) the conda installation instruction on README.
- pipeline's base Conda environment is now based on python3 (instead of python2) so users must re-install pipeline's Conda environment. please follow the above instruction.
-
update for Google Cloud Platform (GCP) users
- we will keep old naming
google
for GCP but it's recommened to usegcp
instead. For example, usehg38_gcp.tsv
instead ofhg38_google.tsv
for genome TSV file.
- we will keep old naming
-
moved files for old method to
dev/
- will deprecate this soon. please use Caper. old method has known issues unfixed.
-
blacklist filtering in JSD (Jenshen-Shannon Distance) calculation
deeptools
's native blacklist filtering turns out to be very slow for a blacklist BED with lines >= 1000. See this for details.- we make a temporary blacklist-filtered BAM using
bedtools intersect
and use it fordeeptools plotFingerprint
.
-
upgraded genomic softwares in Conda env/docker container:
- Conda environment is now based on python3. additional python2 environment for packages that are sill in py2 (MACS2, metaseq)
- update software versions
- python 2.7 -> 3.6.6
- samtools 1.2 -> 1.9 (both backward/forward compatibility for command lines)
- phantompeakqualtools 1.2 -> 1.2.1 (to remove negative peaks)
- deeptools 2.5.4 -> 3.3.1 (to print out synthetic JSD for samples without controls)
- picard 2.10.6 -> 2.20.7
- r 3.3.2 -> 3.2.2 (had to downgrade to support Conda >= 4.7 because free channels with high R version and python 3.6.6 are not allowed) we keep R version at 3.4.4 (with r-spp 1.15) in a docker container though. It was not possible to match R versions between Conda env and docker container.
- bowtie2 2.2.6 -> 2.3.4.3
- bwa 0.7.13 -> 0.7.17
- bedtools 2.26.0 -> 2.29.0
-
change of parameters
chip.regex_filter_reads
(String
) is replaced withchip.filter_chrs
(Array[String]
)- e.g. to remove a mito-chr
MT
. input JSON should have `"chip.filter_chrs": [ 'MT' ]" chip.filter_chrs
is[]
by default, i.e. For ChIP-seq, we keep mito chrs in a filtered BAM by default.- resource parameter name change
chip.call_peak_*
are shared by both peak callers MACS2 and SPPchip.call_align_*
are shared by both aligners BOWTIE2 and BWAchip.bowtie2_mem_mb
andchip.bwa_mem_mb
->chip.align_mem_mb
chip.bowtie2_cpu
andchip.bwa_cpu
->chip.align_cpu
chip.bowtie2_disks
andchip.bwa_disks
->chip.align_disks
chip.macs2_mem_mb
andchip.spp_mem_mb
->chip.call_peak_mem_mb
chip.macs2_disks
andchip.spp_disks
->chip.call_peak_disks
chip.spp_cpu
->chip.call_peak_cpu
-
change in default parameters
mapq_thresh
: Default:30
forbwa
aligner,30
forbowtie2
aligner. you can still define anymapq_thresh
.
-
added
bowtie2
as a new DEFAULT alignerbowtie2
is a new DEFAULT aligner- users can still use
bwa
insteadbowtie2
by setting a param in an input JSON file:chip.aligner
:bwa
-
use
SAMstats
instead ofsamtools flagstat
- better read counting for raw/filtered BAMs
-
can use custom aligner/peak_caller
- specify your custom aligner code
custom_align_py
and index TAR filecustom_aligner_idx_tar
- specify your custom peak caller
custom_call_peak_py
- for custom genomes, it's recommended to use genome data builder to build
custom_aligner_idx_tar
andcustom_aligner_mito_idx_tar
.
- specify your custom aligner code
-
added QC for GC bias
- GC bias plot is added to
align
section of the HTML report. you can disable it by setting flagchip.enable_gc_bias
tofalse
in your input JSON.
- GC bias plot is added to
-
multiple blacklists
- added
blacklist2
. users can definechip.blacklist2
in an input JSON file or add a new row (blacklist2[TAB][YOUR_2ND_BLACKLIST]
) in a genome genome TSV file. - multiple blacklists will be merged with
zcat
command
- added
-
better readability in QC report/JSON
- organized outputs in big categories:
align
,lib_complexity
,replcation
,align_enrich
,peak_enrich
andetc
. - in all docs and HTML reports and QC JSON files, replaced confusing abbreviations with comprehensive ones.
pprY
->pooled-prY
ppr
->pooled-pr1_vs_pooled_pr2
repX-pr
->repX-pr1_vs_repX-pr2
- organized outputs in big categories:
-
fixed bug
- removed python
multiprocessing
from all wrappers: to minimize memory error for SLURM + Singularity - replaced all
sambamba
withsamtools
in command lines sincesambamba
has some seg-fault issues. - do not make read length log for R2 (for PE). GCP backend somtimes pick wrong read length file.
- matplotlib X server error
- numpy conflict in metaseq
- SAMstats multiprocess error (downgrading SAMstats from 0.2.2 to 0.2.1)
- Index TAR file unpacking issue in docker/singularity container (ownership problem)
- MACS2 10th column== -1 issue
- py2->3 formatting issue in a subsampled filename (15.0M -> 15M)
- upgraded MACS2 to 2.1.3.3 to remove spurious spikes in peaks.
- hard-linkg problems in reproducibility step.
- removed python
v1.2.2
WARNING: Conda users must update their Conda environments.
$ bash conda/update_conda_env.sh
-
mixed endedness per replicate. For example of three replicates with mixed endedness. Similarly for controls (
ctl_paired_ends
).{ "chip.paired_ends": [false, true, false] }
-
mixed data type per replicate. For example of three replicates with mixed data type (rep1: BAM, rep2: NODUP_BAM, rep3: TAG-ALIGN). Similarly for controls (
ctl_bams
, ...).{ "chip.bams": ["rep1.bam", null, null] "chip.nodup_bams": [null, "rep2.nodup.bam", null] "chip.tas": [null, null, "rep3.tagAlign.gz"] }
-
no auto-installation for croo/caper inside py3 conda env
-
removed resumer support
- instead, Caper (Cromwell's native call-caching) is recommended
-
bug fixes
- qc_report fails due to type coercion (File -> File?) of outputs from idr/overlap
- fix for unreplicated pipeline failing on DNAnexus