Skip to content

Latest commit

 

History

History
184 lines (126 loc) · 14.5 KB

File metadata and controls

184 lines (126 loc) · 14.5 KB

Calculate Targeted Coverage

GitHub release

Overview

This pipeline extracts read depth calculations from a BAM file and generates outputs that are useful to the interpretation and downstream variant calling of a targeted sequencing experiment. Relevant datasets include targeted gene panels and whole exome sequencing (WXS) experiments. For in-depth downstream coverage QC, the pipeline can output per-base read depth at all targeted loci specified by a target BED file and read depth at genome-wide "off-target" well characterized polymorphic sites known to dbSNP. For a more general overview of targeted sequencing quality, the pipeline can output QC metrics produced by picard CollectHsMetrics. As a direct contribution to a DNA processing workflow, the pipeline can output a coordinate BED file containing target intervals merged with intervals encompassing off-target dbSNP sites enriched in coverage (as determined by a user-defined read-depth threshold). This new coordinate file can be used to indicate base quality recalibration and variant calling intervals to pipeline-recalibrate-BAM and gatk HaplotypeCallerin pipeline-call-gSNP directly or through metapipeline-DNA.

calculates per-base read depth in a BAM file at "target" intervals specified by a target BED file and at "off-target" well characterized polymorphic loci This pipeline performs coverage calculations from a BAM file at intervals specified by a target bed file and reports some basic coverage metrics. The SAMtools depth tool is used to calculate per-base coverage in specified regions. This intermediate output is converted into bed format using an awk script. Then the BEDtools merge tool is used to collapse consecutive coordinates into intervals, with a final output reporting a comma-separated list of per-base read depths for each coordinate in an interval. Picard's CollectHsMetrics is used to report various interval related metrics on the input BAM.


How To Run

  1. Update the params section of the .config file

  2. Update the input yaml

  3. See the submission script, here, to submit your pipeline

Requirements

Currently supported Nextflow versions: v23.04.2


Flow Diagram

A directed acyclic graph of your pipeline.

pipeline-calculate-targeted-coverage graph


Pipeline Steps

1. Depth Calculation

Per-base depth is calculated from the input BAM file at coordinates specified by the input target BED file using samtools depth. If off_target_depth is set to true, per-base read depth is also calculated genome-wide at dbSNP loci with a dbSNP reference VCF used as the coordinate file to samtools depth.

2. BED Formatting

TSV output from samtools depth is converted into BED format using awk with read depth reported in the fourth column. Per-base read depth across multiple-base-pair target intervals is collapsed into a comma-separated list of read depth values, one for each base pair encompassed by the interval (bedtools merge).

3. dbSNP off-target site filtering

dbSNP coordinates are filtered to keep only off-target regions. This is done by excluding coordinates specified in the target BED file from the dbSNP read depth BED using bedtools intersect. Near-target regions (+/- 500bp by default) are also excluded by first adding near-target buffers to the specified target intervals using bedtools slop.

4. dbSNP enriched read depth filtering

dbSNP coordinates from step 2 are filtered to keep sites exceeding a minimum read depth threshold (30x by default) using awk.

5. dbSNP coverage-enriched interval expansion

Filtered dbSNP coordinates from step 4 are expanded to include nearby basepairs, so that sites that are close together can be subsequently be merged into one interval (bedtools slop).

6. On-target and enriched off-target interval merging

Coverage enriched dbSNP intervals are merged with the original target intervals into one BED file using a series of bash commands that concatenate and sort the two files, then merge with bedtools.

7. Metrics Reporting

Target BED file and optional bait file are converted to INTERVAL_LIST format using picard BedToIntervalList then used to report metrics on input BAM with picard CollectHsMetrics.


Inputs and Configuration

Input and Input Parameter/Flag Required Type Description
input.BAM yes path BAM file for which to calculate coverage, path provided in input yaml.
target_BED yes path BED file specifying target intervals (defines regions for target and off-target coverage operations).
save_intermediate_files yes boolean Whether to save intermediate files.
reference_dict yes path Human genome reference dictionary file for use in BED to INTERVAL_LIST conversion. Required if collecting metrics.
reference_dbSNP yes path dbSNP reference VCF file, with proper chromosome encoding and compression. See discussion. Required if performing off-target read depth calculation.
genome_sizes yes path Reference file consisting of chromosomes and their lengths used by bedtools slop. Required for off-target read depth workflows. .fai files accepted.
target_depth no bool Whether to calculate per-base read depth in targeted regions. Default false.
off_target_depth no bool Whether to perform off-target read depth calculation at dbSNP loci. Default true.
output_enriched_target_file no bool Whether to output a new target file containing coverage-enriched off-target dbSNP loci. Default true.
min_read_depth no bool Minimum read depth threshold for an off-target locus to be considered enriched and be included in the new target file. Default 30.
min_base_quality no integer Minimum base quality for a read to be counted in depth calculation by samtools depth. Applies to both off- and on-target calculations. Defaults to 20.
min_mapping_quality no integer Minimum mapping quality for a read to be counted in depth calculation by samtools depth. Applies to both off- and on-target calculations. Defaults to 20.
collect_metrics no bool Whether to run CollectHsMetrics. Default true.
target_interval_list no path Interval list file specifying target intervals used to calculate coverage by collecHsMetrics. If not provided, the target BED will be used to calculate the intervals.
bait_BED no path BED file with bait locations that can be used to generate a bait interval list used by CollecHsMetrics. If not provided, the target BED will be used.
bait_interval_list no path Interval list file specifying bait intervals used by CollectHsMetrics. If not provided, the bait BED will be used to calculate the intervals.
save_interval_list yes boolean Whether to save a copy of any generated interval lists. Saves to the output_dir.
save_all_dbSNP no boolean Whether to save a copy of the read depth BED file for all dbSNP loci generated by the off-target workflows. Default false.
save_raw_target_bed no boolean Whether to save a copy of the per-base, target read depth BED with uncollapsed intervals. Default false.
off_target_slop no integer Number of base pairs to add to either side of target file coordinates so that they may be excluded from off-target read depth calculation. Default is 500.
dbSNP_slop no integer Number of base pairs to add to either side of off-target dbSNP loci to generate off-target intervals. The purpose is to merge adjacent dbSNP loci into single intervals prior to mergeing with target intervals. Default is 150.
coverage_cap no integer COVERAGE_CAP parameter for CollectHsMetrics, determines the coverage threshold at which to stop calculating coverage.
near_distance no integer NEAR_DISTANCE parameter for CollectHsMetrics, determines the maximum distance in bp of a read from the nearest probe (bait) for it to be counted as "near probe" in metrics calculations. Default 250.
samtools_depth_extra_args no string Extra arguments for samtools depth.
picard_CollectHsMetrics_extra_args no string Extra arguments for picard CollectHsMetrics.
merge_operation no string Operation performed on read depth column values when intervals are collapsed during bedtools merge. Defaults to 'collapse'. See bedtools documentation for other options.
work_dir no path Path of working directory for Nextflow. When included in the sample config file, Nextflow intermediate files and logs will be saved to this directory. With ucla_cds, the default is /scratch and should only be changed for testing/development. Changing this directory to /hot or /tmp can lead to high server latency and potential disk space limitations, respectively.

Outputs

Output and Output Parameter/Flag Description
output_dir Location where generated output should be saved.
*target-with-enriched-off-target-intervals.bed New target file including original target intervals and intervals encompassing coverage-enriched off-target dbSNP sites.
*target-with-enriched-off-target-intervals.bed.gz New compressed target file including original target intervals and intervals encompassing coverage-enriched off-target dbSNP sites.
*off-target-dbSNP-depth-per-base.bed Per-base read depth at dbSNP loci outside of targeted regions.
*collapsed_coverage.bed Per-base read depth at specified target intervals, collapsed by interval. (OPTIONAL) Set target_depth in config file.
*target-depth-per-base.bed Per-base read depth at target intervals (not collapsed). (OPTIONAL) set save_raw_target_bed in config file.
*genome-wide-dbSNP-depth-per-base.bed Per-base read depth at all dbSNP loci. (OPTIONAL) Set save_all_dbSNP in config file.
*HsMetrics.txt QC output from CollectHsMetrics()
.tsv,.bed Intermediate outputs of unformatted and unmerged depth files. (OPTIONAL) Set save_intermediate_files in config file.
.interval_list Intermediate output of target bed file converted to picard's interval list format. (OPTIONAL) Set save_interval_list in config file.
report.html, timeline.html and trace.txt A Nextflowreport, timeline and trace files
log.command.* Process specific logging files created by nextflow.

Performance Validation

Testing was performed in the Boutros Lab SLURM Development cluster. Pipeline version used here is v1.0.0-rc.1

Targeted Panels

General estimates, with wide variations, are that smaller gene panel experiments require 16 cpus and 32GB of memory to run all processes efficiently in parallel. However each individual process requires much fewer resources, and 1CPU and 1GB is frequently sufficient for most component tools. Larger numbers of targets may increase memory requirements, particularly for interval merging steps.

Whole Exomes

General estimates, with wide variations, are that whole exome experiments require 16 CPUs and 32GB of memory to run all processes efficiently in parallel. However each individual process requires much fewer resources, and 1CPU and 1GB is frequently sufficient for most component tools.


References

  1. SAMtools
  2. Picard
  3. BEDtools

Discussions


Contributors

Please see list of Contributors at GitHub.


License

pipeline-calculate-targeted-coverage is licensed under the GNU General Public License version 2. See the file LICENSE for the terms of the GNU GPL license.

pipeline-calculate-targeted-coverage performs read-depth related calculations on BAMs from targeted sequencing experiments.

Copyright (C) 2022-2024 University of California Los Angeles ("Boutros Lab") All rights reserved.

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.