Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add XY filtration workflow #191

Open
wants to merge 67 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 44 commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
73e1e03
add patient_sex to template
Faizal-Eeman Nov 26, 2024
7cd06a6
add XY filter script from project-method-AlgorithmEvaluation-BNCH-000…
Faizal-Eeman Nov 26, 2024
cc5ba70
add par_bed parameter to template.config
Faizal-Eeman Dec 7, 2024
13fb844
add genome build to template
Faizal-Eeman Dec 7, 2024
43fd101
add user input sample id
Faizal-Eeman Dec 19, 2024
cfb4198
add sample sex; remove redundant code; exremove het calls for XY
Faizal-Eeman Dec 19, 2024
6d0ed9e
fix variables
Faizal-Eeman Dec 19, 2024
7abfcf7
extract autosomes
Faizal-Eeman Dec 19, 2024
9417465
merge autosomes and XY filtered calls
Faizal-Eeman Dec 19, 2024
f8417d5
change vcf header extraction location
Faizal-Eeman Dec 19, 2024
055ab3e
Add workflow steps to script note
Faizal-Eeman Dec 19, 2024
6c31337
clean up script
Faizal-Eeman Dec 19, 2024
08213c9
add skeleton code for vcf header temp file
Faizal-Eeman Dec 20, 2024
4be2064
write VCF source to temp file and parameterize it
Faizal-Eeman Dec 21, 2024
ac2d36c
add arg for variant caller
Faizal-Eeman Dec 21, 2024
4ff9575
set variant caller source
Faizal-Eeman Dec 21, 2024
7bdd8cc
improve documentation
Faizal-Eeman Dec 21, 2024
1eadfbb
add hail v0.2.133 and docker image to default config
Faizal-Eeman Dec 21, 2024
24cb06d
add filter-xy NF script
Faizal-Eeman Dec 21, 2024
27b39da
add NF skeleton
Faizal-Eeman Dec 21, 2024
73c7f1e
add script dir var in main
Faizal-Eeman Dec 21, 2024
570677b
add xy filtration command
Faizal-Eeman Dec 21, 2024
d247577
rename XY script
Faizal-Eeman Dec 21, 2024
71c294b
change arg variant_caller to vcf_source_file
Faizal-Eeman Dec 28, 2024
6cb1d9c
add VCF source extraction code to script section in nextflow module
Faizal-Eeman Dec 28, 2024
8b8948d
revert script output to bgz
Faizal-Eeman Dec 28, 2024
6e22838
set publishDir
Faizal-Eeman Dec 28, 2024
f41aab6
add channel for xy filter and call process
Faizal-Eeman Dec 28, 2024
28698a2
include filter XY module in main
Faizal-Eeman Dec 28, 2024
e572c84
simplify xy filter channel
Faizal-Eeman Dec 30, 2024
f43e8f9
add script dir ch
Faizal-Eeman Dec 30, 2024
08e7387
add script dir input to NF module
Faizal-Eeman Dec 30, 2024
c0af897
fix docker tag
Faizal-Eeman Dec 30, 2024
4a0687e
update script command to take vcf file source
Faizal-Eeman Jan 7, 2025
0e552e6
update sample sex parameter in template
Faizal-Eeman Jan 7, 2025
6d78944
update template config
Faizal-Eeman Jan 7, 2025
a99e3bb
add parameters to schema
Faizal-Eeman Jan 7, 2025
a4150b5
add genome build arg to script command
Faizal-Eeman Jan 7, 2025
9bf2c4d
parameterize genome build in script
Faizal-Eeman Jan 7, 2025
5408c83
temporarily add hail dev tag
Faizal-Eeman Jan 7, 2025
b3123f0
add params.par_bed as input
Faizal-Eeman Jan 7, 2025
d026262
add par_bed as process input
Faizal-Eeman Jan 7, 2025
3e0047f
fix output vcf dataset at export in script
Faizal-Eeman Jan 7, 2025
365c331
fix pylint
Faizal-Eeman Jan 8, 2025
f70361b
update docker hail version
Faizal-Eeman Jan 8, 2025
7928c5e
remove cat command
Faizal-Eeman Jan 8, 2025
7ab2a6e
fix log output dir
Faizal-Eeman Jan 8, 2025
29de6e0
standardize process name with tool name at the end
Faizal-Eeman Jan 8, 2025
f779ca5
standardize process name with tool name at the end
Faizal-Eeman Jan 8, 2025
c67cc5c
update sample_sex comment in template
Faizal-Eeman Jan 8, 2025
96768ff
add default and choices for genome_build
Faizal-Eeman Jan 8, 2025
b2cf9e8
add choices for sample_sex in schema
Faizal-Eeman Jan 8, 2025
1e51960
add resource allocation to filter_XY_Hail
Faizal-Eeman Jan 8, 2025
c51d3c3
emit xy filtered output
Faizal-Eeman Jan 8, 2025
a69f46c
generate checksum for xy filtered vqsr VCF
Faizal-Eeman Jan 8, 2025
bd9e781
add system command to VCF header
Faizal-Eeman Jan 9, 2025
002ffc3
Add XY filtration step to README
Faizal-Eeman Jan 9, 2025
f783530
fix process name
Faizal-Eeman Jan 9, 2025
f96c632
add XY filteration params to README
Faizal-Eeman Jan 9, 2025
9b6e036
Update script description
Faizal-Eeman Jan 9, 2025
6e571bc
add xy_filtration_workflow.md
Faizal-Eeman Jan 9, 2025
d9a1eae
add GRCh38 PAR to README
Faizal-Eeman Jan 9, 2025
bd568b6
fix pylint
Faizal-Eeman Jan 9, 2025
8f8326d
fix publishDir rules
Faizal-Eeman Jan 9, 2025
9b06490
Update outputs in README
Faizal-Eeman Jan 9, 2025
ca0e4f4
Upddate CHANGELOG
Faizal-Eeman Jan 9, 2025
e8d1d5d
update output filename
Faizal-Eeman Jan 10, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion config/default.config
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,12 @@ params {
picard_version = "2.26.10"
pipeval_version = "4.0.0-rc.2"
gatkfilter_version = "v1.0.0"
hail_version = "branch-mmootor-fix-spark-permission"
Faizal-Eeman marked this conversation as resolved.
Show resolved Hide resolved
docker_image_gatk = "broadinstitute/gatk:${params.gatk_version}"
docker_image_picard = "${-> params.docker_container_registry}/picard:${params.picard_version}"
docker_image_pipeval = "${-> params.docker_container_registry}/pipeval:${params.pipeval_version}"
docker_image_gatkfilter = "${-> params.docker_container_registry}/gatk:${params.gatkfilter_version}"
docker_image_hail = "${-> params.docker_container_registry}/hail:${params.hail_version}"

emit_all_confident_sites = false
}
Expand All @@ -36,7 +38,7 @@ process {
cache = true

executor = 'local'

// Other directives or options that should apply for every process

// total amount of resources avaible to the pipeline
Expand Down
13 changes: 13 additions & 0 deletions config/schema.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,18 @@ patient_id:
type: 'String'
required: true
help: 'Patient ID'
sample_sex:
type: 'String'
required: true
help: 'Sample Sex'
Faizal-Eeman marked this conversation as resolved.
Show resolved Hide resolved
dataset_id:
type: 'String'
required: true
help: 'Dataset ID'
genome_build:
type: 'String'
required: true
help: 'Genome build, GRCh37 or GRCh38'
output_dir:
type: 'Path'
mode: 'w'
Expand Down Expand Up @@ -62,6 +70,11 @@ bundle_phase1_1000g_snps_high_conf_vcf_gz:
mode: 'r'
required: true
help: 'Absolute path to high-confidence 1000g SNPs VCF'
par_bed:
type: 'Path'
mode: 'r'
required: true
help: 'Absolute path to Pseudo-autosomal Region (PAR) BED'
base_resource_update:
type: 'ResourceUpdateNamespace'
required: false
Expand Down
8 changes: 8 additions & 0 deletions config/template.config
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,11 @@ params {
dataset_id = ''
blcds_registered_dataset = false // if you want the output to be registered

genome_build = "GRCh38"

// Input patient sex if known - male or female. Leave empty if not known.
sample_sex = ''
Faizal-Eeman marked this conversation as resolved.
Show resolved Hide resolved

output_dir = '/path/to/output/directory'

// Set to false to disable the publish rule and delete intermediate files as they're no longer needed
Expand Down Expand Up @@ -43,6 +48,9 @@ params {
bundle_omni_1000g_2p5_vcf_gz = "/hot/resource/tool-specific-input/GATK/GRCh38/1000G_omni2.5.hg38.vcf.gz"
bundle_phase1_1000g_snps_high_conf_vcf_gz = "/hot/resource/tool-specific-input/GATK/GRCh38/1000G_phase1.snps.high_confidence.hg38.vcf.gz"

// Specify BED file path for Pseudoautosomal Region (PAR)
par_bed = ""

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this be a standardized reference in /hot/resource/ ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll defer this to @yashpatel6 as I do not have permission to create a dir in /hot/resource/

Here's the GRCh38 version of PAR BED. You can remove the commented lines from this file when you make a copy in /hot/resource/ - /hot/project/method/AlgorithmEvaluation/BNCH-000122-GIABSexChrGermlineFilter/GIAB/AshkenazimTrio/germline-small-variant/filter_XY/pseudoautosomal_regions_hg38.bed


// Base resource allocation updater
// See README for adding parameters to update the base resource allocations
}
Expand Down
21 changes: 21 additions & 0 deletions main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@ include {
} from './module/merge-vcf.nf'
include { recalibrate_variants } from './module/workflow-recalibrate-variants.nf'
include { filter_gSNP_GATK } from './module/filter-gsnp.nf'
include { filter_XY } from './module/filter-xy.nf'
include { calculate_sha512 } from './module/checksum.nf'

// Returns the index file for the given bam or vcf
Expand Down Expand Up @@ -104,6 +105,12 @@ workflow {
}
.set{ input_ch_collected_files }

script_dir_ch = Channel.fromPath(
"$projectDir/script",
checkIfExists: true
)
.collect()

/**
* Input validation
*/
Expand Down Expand Up @@ -248,6 +255,20 @@ workflow {
recalibrate_variants.out.output_ch_recalibrated_variants
)

filter_xy_ch = recalibrate_variants.out.output_ch_recalibrated_variants
.map { it -> [it[0], it[1], it[2]] }

script_dir_ch = Channel.fromPath(
"$projectDir/script",
checkIfExists: true
)
.collect()

filter_XY(
filter_xy_ch,
params.par_bed,
script_dir_ch
)
Faizal-Eeman marked this conversation as resolved.
Show resolved Hide resolved
/**
* Calculate checksums for output files
*/
Expand Down
69 changes: 69 additions & 0 deletions module/filter-xy.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
include { generate_standard_filename; sanitize_string } from '../external/pipeline-Nextflow-module/modules/common/generate_standardized_filename/main.nf'

/*
Nextflow module for filtering chrX and chrY variant calls based on sample sex

input:
sample_id: identifier for sample
sample_vcf: path to VCF to filter
sample_vcf_tbi: path to index of VCF to filter

params:
params.output_dir_base: string(path)
params.log_output_dir: string(path)
params.docker_image_hail: string
params.sample_sex: string
params.par_bed: string(path)
*/

process filter_XY {
Faizal-Eeman marked this conversation as resolved.
Show resolved Hide resolved
Faizal-Eeman marked this conversation as resolved.
Show resolved Hide resolved
container params.docker_image_hail

publishDir path: "${params.output_dir_base}/output",
mode: "copy",
pattern: '*.vcf.bgz*',
saveAs: {
"${output_filename}_${sanitize_string(file(it).getName().replace("${sample_id}_", ""))}"
}

publishDir path: "${params.log_output_dir}/process-log",
pattern: ".command.*",
mode: "copy",
saveAs: {
"${task.process.replace(':', '/')}/${task.process.split(':')[-1]}-${sample_id}-${interval_id}/log${file(it).getName()}"
}
Faizal-Eeman marked this conversation as resolved.
Show resolved Hide resolved

input:
tuple val(sample_id), path(recalibrated_vcf), path(recalibrated_vcf_tbi)
path(par_bed)
path(script_dir)

output:
path(".command.*")
path("${output_filename}_XY_filtered.vcf.bgz")
path("${output_filename}_XY_filtered.vcf.bgz.tbi")

script:
output_filename = generate_standard_filename(
"Hail-${params.hail_version}",
params.dataset_id,
sample_id,
[additional_tools:["GATK-${params.gatk_version}"]]
)
"""
set -euo pipefail

zgrep "##source=" ${recalibrated_vcf} > ./vcf_source.txt

cat ./vcf_source.txt
Faizal-Eeman marked this conversation as resolved.
Show resolved Hide resolved

python ${script_dir}/filter_xy_call.py \
--sample_name ${output_filename} \
--input_vcf ${recalibrated_vcf} \
--vcf_source_file ./vcf_source.txt \
--sample_sex ${params.sample_sex} \
Faizal-Eeman marked this conversation as resolved.
Show resolved Hide resolved
--par_bed ${par_bed} \
--genome_build ${params.genome_build} \
--output_dir .
"""
}
156 changes: 156 additions & 0 deletions script/filter_xy_call.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
#!/usr/bin/env python3
"""
Filter XY calls from call-gSNP single sample VCF file

Steps:
- Extract autosomes and chrX/Y variants from input VCF
- Filter chrX/Y variants
- Merge autosomal and filtered chrX/Y variants

Filter criteria:
- Extract XY calls
- Extract XY calls overlapping with Pseudo-Autosomal Regions (PARs)
- For non-PAR
- Male sample:
- Filter out heterozygous GT calls in chrX and chrY
- Transform homozygous GT=1/1 to hemizygous GT=1
- Female sample: Filter out chrY calls

Dependencies:
- Python 3
- HAIL python library (pip install hail)

Note:
- Do not export VCF to a path that is being read from in the same pipeline,\
based on HAIL recommendation
"""

import os
import argparse
import hail as hl

script_dir = os.getcwd()

parser = argparse.ArgumentParser()
parser.add_argument(
'--sample_name',
dest='sample_name',
help = 'Sample name',
required=True
)
parser.add_argument(
'--input_vcf',
dest='input_vcf',
help = 'Input single sample VCF file path',
required=True
)
parser.add_argument(
'--vcf_source_file',
dest='vcf_source_file',
help = 'A TXT file containing variant caller source details (eg. ##source=HaplotypeCaller)',
required=True
)
parser.add_argument(
'--sample_sex',
dest='sample_sex',
help = 'Sample sex, XY or XX',
required=True
)
Faizal-Eeman marked this conversation as resolved.
Show resolved Hide resolved
parser.add_argument(
'--par_bed',
dest='par_bed',
help = 'Input BED file path for Pseudo-Autosomal Regions (PAR)',
required=True
)
parser.add_argument(
'--genome_build',
dest='genome_build',
help = 'Genome build of input VCF, GRCh37 or GRCh38',
required=True
)
parser.add_argument(
'--output_dir',
dest='output_dir',
help = 'Output path where filtered XY variant VCF will be written',
required=True
)

args = parser.parse_args()

sample_name = args.sample_name
sample_sex = args.sample_sex
vcf_file = args.input_vcf
vcf_source_file = args.vcf_source_file
par_bed = args.par_bed
genome_build = args.genome_build
output_dir = args.output_dir

#Import PAR BED file
par = hl.import_bed(
path = par_bed,
reference_genome = genome_build,
skip_invalid_intervals = True
)

#Extract VCF file header
vcf_header = hl.get_vcf_metadata(vcf_file)

#Import VCF file into a hail MatrixTable
vcf_matrix = hl.import_vcf(
path = vcf_file,
reference_genome = genome_build,
force_bgz = True
)

#Filter XY calls
##Extract XY calls
X_contig = vcf_matrix.locus.contig.startswith('chrX') | vcf_matrix.locus.contig.startswith('X')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we would ever encounter this in DNA-metapipeline, but just FYI I have seen X/Y encoded as chr23 and chr24 in some genetic data.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm aware. However to keep consistent with variant calls in the DNA-metapipeline, chrX and chrY notation should be fine in pipeline outputs.

Y_contig = vcf_matrix.locus.contig.startswith('chrY') | vcf_matrix.locus.contig.startswith('Y')
extract_condition = (X_contig) | (Y_contig)
vcf_XY = vcf_matrix.filter_rows(extract_condition)
print('chrX/Y variants before XY filtration:', vcf_XY.count())

##Extract autosomes
vcf_autosomes = vcf_matrix.filter_rows(~extract_condition)

##Extract PAR and non-PAR regions
par_variants = vcf_XY.filter_rows(hl.is_defined(par[vcf_XY.locus]))
non_par_variants = vcf_XY.filter_rows(hl.is_missing(par[vcf_XY.locus]))

if sample_sex == 'XY':
#If MALE (XY), remove heterozygous non-PAR chrX calls
non_par_filtered_variants = non_par_variants.filter_rows(
hl.agg.all(
non_par_variants.GT.is_diploid() & non_par_variants.GT.is_hom_var()
)
)
non_par_filtered_variants = non_par_filtered_variants.annotate_entries(
GT = hl.call(non_par_filtered_variants.GT[0])
)

elif sample_sex == 'XX':
#If Female (XX), remove non-PAR chrY calls
non_par_filtered_variants = non_par_variants.filter_rows(
non_par_variants.locus.contig.startswith('chrX') | \
non_par_variants.locus.contig.startswith('X')
)

#Combine PAR and filtered non-PAR regions
par_non_par = [par_variants, non_par_filtered_variants]
filterXY = hl.MatrixTable.union_rows(*par_non_par)
print('chrX/Y variant counts after XY filtration:', filterXY.count())

#Combine filtered X/Y + autosomal variants
autosomes_XYfiltered = [vcf_autosomes, filterXY]
output_vcf = hl.MatrixTable.union_rows(*autosomes_XYfiltered)

#Export MatrixTable to VCF
output_file = output_dir + '/' + sample_name + '_XY_filtered.vcf.bgz'

hl.export_vcf(
dataset = output_vcf,
output = output_file,
tabix = True,
metadata = vcf_header,
append_to_header = vcf_source_file
)
Loading