Skip to content

Commit

Permalink
Subdivide FilterBatch and add SV count plots to enable IQR cutoff sel…
Browse files Browse the repository at this point in the history
…ection (#220)
  • Loading branch information
epiercehoffman authored Aug 30, 2021
1 parent 20684c7 commit b5e954c
Show file tree
Hide file tree
Showing 30 changed files with 1,096 additions and 1,007 deletions.
7 changes: 5 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -329,7 +329,10 @@ Generates variant metrics for filtering.
## <a name="generate-batch-metrics">FilterBatch</a>
*Formerly Module03*

Filters poor quality variants and filters outlier samples.
Filters poor quality variants and filters outlier samples. This workflow can be run all at once with the WDL at `wdl/FilterBatch.wdl`, or it can be run in three steps to enable tuning of outlier filtration cutoffs. The three subworkflows are:
1. FilterBatchSites: Per-batch variant filtration
2. PlotSVCountsPerSample: Visualize SV counts per sample per type to help choose an IQR cutoff for outlier filtering, and preview outlier samples for a given cutoff
3. FilterBatchSamples: Per-batch outlier sample filtration; provide an appropriate `outlier_cutoff_nIQR` based on the SV count plots and outlier previews from step 2.

#### Prerequisites:
* [GenerateBatchMetrics](#generate-batch-metrics)
Expand Down Expand Up @@ -441,7 +444,7 @@ gs://gatk-sv-resources-public/hg38/v0/sv-resources/ref-panel/1KG/v2/mingq/1KGP_2
```

* BatchEffect - remove variants that show significant discrepancies in allele frequencies across batches
* FilterOutlierSamples - remove outlier samples with unusually high or low number of SVs
* FilterOutlierSamplesPostMinGQ - remove outlier samples with unusually high or low number of SVs
* FilterCleanupQualRecalibration - sanitize filter columns and recalibrate variant QUAL scores for easier interpretation

## <a name="annotate-vcf">AnnotateVcf</a> (in development)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -55,14 +55,19 @@ The following workflows are included in this workspace, to be executed in this o
4. `04-GatherBatchEvidence`: Per-batch copy number variant calling using cn.MOPS and GATK gCNV; B-allele frequency (BAF) generation; call and evidence aggregation
5. `05-ClusterBatch`: Per-batch variant clustering
6. `06-GenerateBatchMetrics`: Per-batch variant filtering, metric generation
7. `07-FilterBatch`: Per-batch variant filtering; outlier exclusion
8. (Skip for a single batch) `08-MergeBatchSites`: Site merging of SVs discovered across batches, run on a cohort-level `sample_set_set`
9. `09-GenotypeBatch`: Per-batch genotyping of all sites in the cohort. Use `09-GenotypeBatch_SingleBatch` if you only have one batch.
10. `10-RegenotypeCNVs`: Cohort-level genotype refinement of some depth calls. Use `10-RegenotypeCNVs_SingleBatch` if you only have one batch.
11. `11-MakeCohortVcf`: Cohort-level cross-batch integration; complex variant resolution and re-genotyping; VCF cleanup. Use `11-MakeCohortVcf_SingleBatch` if you only have one batch.
12. `12-AnnotateVcf`: Cohort VCF annotations, including functional annotation, allele frequency (AF) annotation, and AF annotation with external population callsets. Use `12-AnnotateVcf_SingleBatch` if you only have one batch.
7. `07a-FilterBatchSites`: Per-batch variant filtering
8. `07b-PlotSVCountsPerSample`: Plot SV counts per sample per SV type to enable choice of IQR cutoff for outlier filtration in `07c-FilterBatchSamples`
9. `07c-FilterBatchSamples`: Per-batch outlier sample filtration
10. (Skip for a single batch) `08-MergeBatchSites`: Site merging of SVs discovered across batches, run on a cohort-level `sample_set_set`
11. `09-GenotypeBatch`: Per-batch genotyping of all sites in the cohort. Use `09-GenotypeBatch_SingleBatch` if you only have one batch.
12. `10-RegenotypeCNVs`: Cohort-level genotype refinement of some depth calls. Use `10-RegenotypeCNVs_SingleBatch` if you only have one batch.
13. `11-MakeCohortVcf`: Cohort-level cross-batch integration; complex variant resolution and re-genotyping; VCF cleanup. Use `11-MakeCohortVcf_SingleBatch` if you only have one batch.
14. `12-AnnotateVcf`: Cohort VCF annotations, including functional annotation, allele frequency (AF) annotation, and AF annotation with external population callsets. Use `12-AnnotateVcf_SingleBatch` if you only have one batch.

Additional modules, such as those for filtering and visualization, are under development. They are not included in this workspace at this time, but the source code can be found in the [GATK-SV GitHub repository](https://github.com/broadinstitute/gatk-sv).
Additional downstream modules, such as those for filtering and visualization, are under development. They are not included in this workspace at this time, but the source code can be found in the [GATK-SV GitHub repository](https://github.com/broadinstitute/gatk-sv). See **Downstream steps** towards the bottom of this page for more information.

Extra workflows (Not part of canonical pipeline, but included for your convenience. May require manual configuration):
* `FilterOutlierSamples`: Filter outlier samples (in terms of SV counts) from a single VCF. Recommended to run `07b-PlotSVCountsPerSample` beforehand (reconfigured with the single VCF you want to filter) to enable IQR cutoff choice.

For detailed instructions on running the pipeline in Terra, see **Step-by-step instructions** below.

Expand Down Expand Up @@ -178,24 +183,26 @@ Read the full documentation for these modules [here](https://github.com/broadins
* Use the same `sample_set` definitions you used for `03-TrainGCNV` and `04-GatherBatchEvidence`.


#### 07-FilterBatch
#### 07a-FilterBatchSites, 07b-PlotSVCountsPerSample, 07c-FilterBatchSamples

Read the full FilterBatch documentation [here](https://github.com/broadinstitute/gatk-sv#filter-batch).
These three workflows make up FilterBatch; they are subdivided in this workspace to enable tuning of outlier filtration cutoffs. Read the full FilterBatch documentation [here](https://github.com/broadinstitute/gatk-sv#filter-batch).
* Use the same `sample_set` definitions you used for `03-TrainGCNV` through `06-GenerateBatchMetrics`.
* The default value for `outlier_cutoff_nIQR`, which is used to filter samples that have an abnormal number of SV calls, is 10000. This essentially means that no samples are filtered. You should adjust this value depending on your scientific needs.
* `07a-FilterBatchSites` does not require user intervention
* `07b-PlotSVCountsPerSample` produces SV count plots and files, as well as a preview of the outlier samples to be filtered, but it does not perform any filtering of the VCFs. The input `N_IQR_cutoff` is used to visualize filtration thresholds on the SV count plots and preview the samples to be filtered; the default value is set to 6. You can adjust this value depending on your needs, and you can re-run the workflow with new `N_IQR_cutoff` values until the plots and outlier sample lists suit the purposes of your study. Once you have chosen an IQR cutoff, provide it to the `N_IQR_cutoff` input in `07c-FilterBatchSamples` to filter the VCFs using the chosen cutoff.
* `07c-FilterBatchSamples` performs outlier sample filtration, removing samples with an abnormal number of SV calls of at least one SV type. To tune the filtering threshold to your needs, edit the `N_IQR_cutoff` input value based on the plots and outlier sample preview lists from `07b-PlotSVCountsPerSample`. The default value for `N_IQR_cutoff` in this step is 10000, which essentially means that no samples are filtered.

#### 08-MergeBatchSites

Read the full MergeBatchSites documentation [here](https://github.com/broadinstitute/gatk-sv#merge-batch-sites).
* If you only have one batch, skip this workflow.
* For a multi-batch cohort, `08-MergeBatchSites` is a cohort-level workflow, so it is run on a `sample_set_set` containing all of the batches in the cohort. You can create this `sample_set_set` while you are launching the `08-MergeBatchSites` workflow: click "Select Data", choose "Create new sample_set_set [...]", check all the batches to include (all of the ones used in `03-TrainGCNV` through `07-FilterBatch`), and give it a name that follows the **Sample ID requirements**.
* For a multi-batch cohort, `08-MergeBatchSites` is a cohort-level workflow, so it is run on a `sample_set_set` containing all of the batches in the cohort. You can create this `sample_set_set` while you are launching the `08-MergeBatchSites` workflow: click "Select Data", choose "Create new sample_set_set [...]", check all the batches to include (all of the ones used in `03-TrainGCNV` through `07c-FilterBatchSamples`), and give it a name that follows the **Sample ID requirements**.

<img alt="creating a cohort sample_set_set" title="How to create a cohort sample_set_set" src="https://i.imgur.com/zKEtSbe.png" width="500">

#### 09-GenotypeBatch

Read the full GenotypeBatch documentation [here](https://github.com/broadinstitute/gatk-sv#genotype-batch).
* Use the same `sample_set` definitions you used for `03-TrainGCNV` through `07-FilterBatch`.
* Use the same `sample_set` definitions you used for `03-TrainGCNV` through `07c-FilterBatchSamples`.
* If you only have one batch, use the `09-GenotypeBatch_SingleBatch` version of the workflow.

#### 10-RegenotypeCNVs, 11-MakeCohortVcf, and 12-AnnotateVcf
Expand Down

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{
"FilterBatchSamples.sv_pipeline_docker": "${workspace.sv_pipeline_docker}",
"FilterBatchSamples.sv_base_mini_docker": "${workspace.sv_base_mini_docker}",
"FilterBatchSamples.linux_docker" : "${workspace.linux_docker}",

"FilterBatchSamples.N_IQR_cutoff": "10000",

"FilterBatchSamples.batch": "${this.sample_set_id}",
"FilterBatchSamples.vcfs" : "${this.sites_filtered_vcfs}",
"FilterBatchSamples.sv_counts": "${this.sv_counts}"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{
"FilterBatchSites.sv_pipeline_docker": "${workspace.sv_pipeline_docker}",

"FilterBatchSites.batch": "${this.sample_set_id}",
"FilterBatchSites.depth_vcf" : "${this.clustered_depth_vcf}",
"FilterBatchSites.manta_vcf" : "${this.clustered_manta_vcf}",
"FilterBatchSites.wham_vcf" : "${this.clustered_wham_vcf}",
"FilterBatchSites.melt_vcf" : "${this.clustered_melt_vcf}",
"FilterBatchSites.evidence_metrics": "${this.metrics}",
"FilterBatchSites.evidence_metrics_common": "${this.metrics_common}"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
{
"FilterOutlierSamples.sv_pipeline_docker": "${workspace.sv_pipeline_docker}",
"FilterOutlierSamples.sv_base_mini_docker": "${workspace.sv_base_mini_docker}",
"FilterOutlierSamples.linux_docker" : "${workspace.linux_docker}",

"FilterOutlierSamples.N_IQR_cutoff": "6",

"FilterOutlierSamples.name": "${this.sample_set_set_id}",
"FilterOutlierSamples.vcf" : "${this.output_vcf}"
}
Original file line number Diff line number Diff line change
Expand Up @@ -16,14 +16,14 @@

"GenotypeBatch.batch": "${this.sample_set_id}",
"GenotypeBatch.rf_cutoffs": "${this.cutoffs}",
"GenotypeBatch.batch_depth_vcf": "${this.filtered_depth_vcf}",
"GenotypeBatch.batch_pesr_vcf": "${this.filtered_pesr_vcf}",
"GenotypeBatch.batch_depth_vcf": "${this.outlier_filtered_depth_vcf}",
"GenotypeBatch.batch_pesr_vcf": "${this.outlier_filtered_pesr_vcf}",
"GenotypeBatch.ped_file": "${workspace.cohort_ped_file}",
"GenotypeBatch.bin_exclude": "${workspace.bin_exclude}",
"GenotypeBatch.discfile": "${this.merged_PE}",
"GenotypeBatch.coveragefile": "${this.merged_bincov}",
"GenotypeBatch.splitfile": "${this.merged_SR}",
"GenotypeBatch.medianfile": "${this.median_cov}",
"GenotypeBatch.cohort_depth_vcf": "${this.filtered_depth_vcf}",
"GenotypeBatch.cohort_pesr_vcf": "${this.filtered_pesr_vcf}"
"GenotypeBatch.cohort_depth_vcf": "${this.outlier_filtered_depth_vcf}",
"GenotypeBatch.cohort_pesr_vcf": "${this.outlier_filtered_pesr_vcf}"
}
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@

"GenotypeBatch.batch": "${this.sample_set_id}",
"GenotypeBatch.rf_cutoffs": "${this.cutoffs}",
"GenotypeBatch.batch_depth_vcf": "${this.filtered_depth_vcf}",
"GenotypeBatch.batch_pesr_vcf": "${this.filtered_pesr_vcf}",
"GenotypeBatch.batch_depth_vcf": "${this.outlier_filtered_depth_vcf}",
"GenotypeBatch.batch_pesr_vcf": "${this.outlier_filtered_pesr_vcf}",
"GenotypeBatch.ped_file": "${workspace.cohort_ped_file}",
"GenotypeBatch.bin_exclude": "${workspace.bin_exclude}",
"GenotypeBatch.discfile": "${this.merged_PE}",
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"MergeBatchSites.sv_pipeline_docker": "${workspace.sv_pipeline_docker}",
"MergeBatchSites.cohort": "${this.sample_set_set_id}",
"MergeBatchSites.pesr_vcfs": "${this.sample_sets.filtered_pesr_vcf}",
"MergeBatchSites.depth_vcfs": "${this.sample_sets.filtered_depth_vcf}"
"MergeBatchSites.pesr_vcfs": "${this.sample_sets.outlier_filtered_pesr_vcf}",
"MergeBatchSites.depth_vcfs": "${this.sample_sets.outlier_filtered_depth_vcf}"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
{
"PlotSVCountsPerSample.sv_pipeline_docker": "${workspace.sv_pipeline_docker}",

"PlotSVCountsPerSample.N_IQR_cutoff": "6",

"PlotSVCountsPerSample.prefix": "${this.sample_set_id}",
"PlotSVCountsPerSample.vcfs" : "${this.sites_filtered_vcfs}",
"PlotSVCountsPerSample.vcf_identifiers" : "${this.algorithms_filtersites}"
}
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,10 @@

"RegenotypeCNVs.RD_depth_sepcutoffs": "${this.trained_genotype_depth_depth_sepcutoff}",

"RegenotypeCNVs.cohort_depth_vcf": "${this.filtered_depth_vcf}",
"RegenotypeCNVs.cohort_depth_vcf": "${this.outlier_filtered_depth_vcf}",

"RegenotypeCNVs.ped_file": "${workspace.cohort_ped_file}",
"RegenotypeCNVs.batch_depth_vcfs": "${this.filtered_depth_vcf}",
"RegenotypeCNVs.batch_depth_vcfs": "${this.outlier_filtered_depth_vcf}",

"RegenotypeCNVs.depth_vcfs": "${this.genotyped_depth_vcf}",
"RegenotypeCNVs.coveragefiles": "${this.merged_bincov}",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
"RegenotypeCNVs.cohort_depth_vcf": "${workspace.cohort_depth_vcf}",

"RegenotypeCNVs.ped_file": "${workspace.cohort_ped_file}",
"RegenotypeCNVs.batch_depth_vcfs": "${this.sample_sets.filtered_depth_vcf}",
"RegenotypeCNVs.batch_depth_vcfs": "${this.sample_sets.outlier_filtered_depth_vcf}",

"RegenotypeCNVs.depth_vcfs": "${this.sample_sets.genotyped_depth_vcf}",
"RegenotypeCNVs.coveragefiles": "${this.sample_sets.merged_bincov}",
Expand Down
8 changes: 8 additions & 0 deletions input_values/test_batch_large.json
Original file line number Diff line number Diff line change
Expand Up @@ -1840,6 +1840,14 @@
"TCGA-W9-A837-10A-01D-A706-36"
],
"samples_post_filtering_file" : "gs://gatk-sv-resources/test/module03/large/output/test_large.post03_outliers_excluded.samples.list",
"sites_filtered_svcounts_depth": "gs://gatk-sv-resources/test/module03/large/SVCounts/test_large.depth.svcounts.txt",
"sites_filtered_svcounts_manta": "gs://gatk-sv-resources/test/module03/large/SVCounts/test_large.manta.svcounts.txt",
"sites_filtered_svcounts_melt": "gs://gatk-sv-resources/test/module03/large/SVCounts/test_large.melt.svcounts.txt",
"sites_filtered_svcounts_wham": "gs://gatk-sv-resources/test/module03/large/SVCounts/test_large.wham.svcounts.txt",
"sites_filtered_depth_vcf": "gs://gatk-sv-resources/test/module03/large/FilterBatchSites/test_large.depth.with_evidence.vcf.gz",
"sites_filtered_manta_vcf": "gs://gatk-sv-resources/test/module03/large/FilterBatchSites/test_large.manta.with_evidence.vcf.gz",
"sites_filtered_melt_vcf": "gs://gatk-sv-resources/test/module03/large/FilterBatchSites/test_large.melt.with_evidence.vcf.gz",
"sites_filtered_wham_vcf": "gs://gatk-sv-resources/test/module03/large/FilterBatchSites/test_large.wham.with_evidence.vcf.gz",
"snp_vcfs" : [
"gs://gatk-sv-resources/test/module00a/large/inputs/vcf/test_large.0.vcf.gz",
"gs://gatk-sv-resources/test/module00a/large/inputs/vcf/test_large.1.vcf.gz",
Expand Down
2 changes: 1 addition & 1 deletion scripts/test/terra_validation.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ def main():
parser.add_argument("-j", "--womtool-jar", help="Path to womtool jar", required=True)
parser.add_argument("-n", "--num-input-jsons",
help="Number of Terra input JSONs expected",
required=False, default=16, type=int)
required=False, default=19, type=int)
parser.add_argument("--log-level",
help="Specify level of logging information, ie. info, warning, error (not case-sensitive)",
required=False, default="INFO")
Expand Down
24 changes: 24 additions & 0 deletions test_input_templates/FilterBatch/FilterBatchSamples.json.tmpl
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
{
"FilterBatchSamples.sv_pipeline_docker": {{ dockers.sv_pipeline_docker | tojson }},
"FilterBatchSamples.sv_base_mini_docker":{{ dockers.sv_base_mini_docker | tojson }},
"FilterBatchSamples.linux_docker" : {{ dockers.linux_docker | tojson }},

"FilterBatchSamples.N_IQR_cutoff": "10000",
"FilterBatchSamples.outlier_cutoff_table" : {{ test_batch.outlier_cutoff_table | tojson }},

"FilterBatchSamples.batch": {{ test_batch.batch_name | tojson }},
"FilterBatchSamples.vcfs" : [
{{ test_batch.sites_filtered_manta_vcf | tojson }},
null,
{{ test_batch.sites_filtered_wham_vcf | tojson }},
{{ test_batch.sites_filtered_melt_vcf | tojson }},
{{ test_batch.sites_filtered_depth_vcf | tojson }}
],
"FilterBatchSamples.sv_counts": [
{{ test_batch.sites_filtered_svcounts_manta | tojson }},
null,
{{ test_batch.sites_filtered_svcounts_wham | tojson }},
{{ test_batch.sites_filtered_svcounts_melt | tojson }},
{{ test_batch.sites_filtered_svcounts_depth | tojson }}
]
}
11 changes: 11 additions & 0 deletions test_input_templates/FilterBatch/FilterBatchSites.json.tmpl
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{
"FilterBatchSites.sv_pipeline_docker": {{ dockers.sv_pipeline_docker | tojson }},

"FilterBatchSites.batch": {{ test_batch.batch_name | tojson }},
"FilterBatchSites.depth_vcf" : {{ test_batch.merged_depth_vcf | tojson }},
"FilterBatchSites.manta_vcf" : {{ test_batch.merged_manta_vcf | tojson }},
"FilterBatchSites.wham_vcf" : {{ test_batch.merged_wham_vcf | tojson }},
"FilterBatchSites.melt_vcf" : {{ test_batch.merged_melt_vcf | tojson }},
"FilterBatchSites.evidence_metrics": {{ test_batch.evidence_metrics | tojson }},
"FilterBatchSites.evidence_metrics_common": {{ test_batch.evidence_metrics_common | tojson }}
}
14 changes: 14 additions & 0 deletions test_input_templates/FilterBatch/PlotSVCountsPerSample.json.tmpl
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
{
"PlotSVCountsPerSample.sv_pipeline_docker": {{ dockers.sv_pipeline_docker | tojson }},

"PlotSVCountsPerSample.N_IQR_cutoff": "6",

"PlotSVCountsPerSample.prefix": {{ test_batch.batch_name | tojson }},
"PlotSVCountsPerSample.vcfs" : [
{{ test_batch.sites_filtered_manta_vcf | tojson }},
{{ test_batch.sites_filtered_wham_vcf | tojson }},
{{ test_batch.sites_filtered_melt_vcf | tojson }},
{{ test_batch.sites_filtered_depth_vcf | tojson }}
],
"PlotSVCountsPerSample.vcf_identifiers" : ["manta", "wham", "melt", "depth"]
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{
"FilterOutlierSamples.sv_pipeline_docker": {{ dockers.sv_pipeline_docker | tojson }},
"FilterOutlierSamples.sv_base_mini_docker": {{ dockers.sv_base_mini_docker | tojson }},
"FilterOutlierSamples.linux_docker" : {{ dockers.linux_docker | tojson }},

"FilterOutlierSamples.N_IQR_cutoff": "6",

"FilterOutlierSamples.name": "test_large",
"FilterOutlierSamples.vcf_identifier": "cohort_outlier_filtered",
"FilterOutlierSamples.vcf" : {{ test_batch.baseline_final_vcf | tojson }}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{
"PlotSVCountsPerSample.sv_pipeline_docker": {{ dockers.sv_pipeline_docker | tojson }},

"PlotSVCountsPerSample.N_IQR_cutoff": "6",

"PlotSVCountsPerSample.prefix": {{ test_batch.batch_name | tojson }},
"PlotSVCountsPerSample.vcfs" : [
{{ test_batch.baseline_final_vcf | tojson }}
],
"PlotSVCountsPerSample.vcf_identifiers" : ["cohort_outlier_filtered"]
}
Loading

0 comments on commit b5e954c

Please sign in to comment.