You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Genome segmentation indispensable to parallelize the compute intensive genomicDB-driven joint-variant calling on 1,000+ samples has been attempted using the GATK4 bundle interval file intervals: gatk-bundle/wgs_calling_regions.hg38.interval_list... But this file
is primarily design to mask (more adequately white-list) regions of the genome on which it is reasonable to compute joint-variant calling.
is far from being suited for breaking down the genome into segments from whoch joint-variant calling can efficiently be paralellized
see the attached quick'n dirty plot: the 1st half of the 356 shorter interval cover a mere 0.5% of the genome, while the 10 longest segments cover 1/3 of the genome
According to the following GATK forum post, joint-variant calling segmentation can actually be quite arbitrary.
The approach described in this post is to segment the genome using a 20k hg38.even.handcurated.20k.intervals file, eventually combined via “DynamicallyCombineIntervals” to bring about a number of segments corresponding to the number of samples.
Alternatively, we could leverage the SG10K_Heath_mini (2,000 samples whose composition in terms of ethnic background is very roughly balanced between Indian, Chinese and Malay and male and female) freebayes-based VCFs to derive genome segments containing equivalent amounts of variant sites.
Wrote a quick'n dirty script to gather SG10K_Heath_mini variant density over the genome to guide the design of such genome segmentation
#!bash
SCRATCHDIR="/home/users/astar/gis/bertinn/scratch/"$WORKDIR
mkdir -p $SCRATCHDIR
ln -s $SCRATCHDIR/SG10K_Health_mini.multiinter.bed.gz
module load bedtools
nohup sh -c "
bedtools multiinter -header -i freebayes_vcfs-hg38/*.fb.vcf.gz \
| sed -e 's/chrom/#chrom/' \
| bedtools intersect -sorted -header -wa -a stdin -b wgs_calling_regions.hg38.interval_list.bed \
| perl -lane 'if(\$F[0] eq qq{#chrom}){\$F[4] = qq{all.fb.vcf}; print join(qq{\t}, @F)} else { @S[0..3]=@F[0..3]; \$S[4]=++\$j; for \$i (5..scalar @F-1){ \$S[\$i]+=\$F[\$i] } ; print join(qq{\t}, @S)}' \
| gzip \
> $SCRATCHDIR/SG10K_Health_mini.multiinter.bed.gz " &
zcat SG10K_Health_mini.multiinter.bed.gz | tail -n 1 > SG10K_Health_mini.multiinter.lastline.bed
# chrY 56887852 56887853 2 98170636 5564852 5578317 5461235 5580398 5712465...
# let's stream SG10K_Health_mini.multiinter.bed.gz and trace the location at every 10,000 variants
# generating ~10,000 segments containing the same number of variants
# keep in mind: reset counter when new chr
# need for penultimate intersect-bed with wgs_calling_regions.hg38.interval_list.bed
zcat SG10K_Health_mini.multiinter.bed.gz \
| perl -lane 'BEGIN{
$nseg=10000; $tvar=98170636; $lnfile="freebayes_vcfs-hg38.contig_lengths";
open(LN, $lnfile);
%ln = map {chomp; split /\t/} (<LN>);
close(LN);
$nsplit = int($tvar/$nseg); $chr=""; $start=1; $nvar=1;
}
next if ($F[0] =~ /^#/);
$chr = $F[0] unless ($chr);
if ($F[4]-$nvar > $nsplit){ print join("\t", $chr, $start, $F[1]-1); $nvar=$F[4]; $start=$F[1]; }
if ($F[0] ne $chr) { print join("\t", $chr, $start, $ln{$chr}); $nvar=$F[4]; $start=1; $chr=$F[0]; }
END{
print join("\t", $chr, $start, $ln{$chr});
}
' \
> SG10K_Health_mini.freebayes_vcfs-hg38.10k_segments.bed
#
# intersect with wgs_calling_regions.hg38.interval_list
# prepend wgs_calling_regions.hg38.interval_list `interval list` header and append feature name as `SG10K_Health_mini-freebayes intersection ACGTmer"`
#
grep '^@[HS][QD]' wgs_calling_regions.hg38.interval_list > wgs_jointcalling_regions.hg38_SG10K_Health_mini_freebayes_10k.interval_list
grep -v '^@' wgs_calling_regions.hg38.interval_list \
| bedtools intersect -b stdin -a SG10K_Health_mini.freebayes_vcfs-hg38.10k_segments.bed \
| awk 'BEGIN{FS=OFS="\t"}{print $0,"+", "SG10K_Health_mini-freebayes intersection ACGTmer"}' \
>> wgs_jointcalling_regions.hg38_SG10K_Health_mini_freebayes_10k.interval_list
#
# derive a ...2k_segments from `SG10K_Health_mini.freebayes_vcfs-hg38.10k_segments.bed`
#
cat SG10K_Health_mini.freebayes_vcfs-hg38.10k_segments.bed \
| perl -ane '
BEGIN{
$lnfile="freebayes_vcfs-hg38.contig_lengths";
open(LN, $lnfile);
%ln = map {chomp; split /\t/} (<LN>);
close(LN);
}
if ($chr ne $F[0]){ print "\t", $ln{$chr}, "\n" if(($chr)and($ct)) ;
$chr=$F[0]; $ct=0; }
if ($ct==4) { print "\t", $F[2], "\n"; $ct=-1;}
if ($ct==0) { print $F[0], "\t", $F[1]; }
$ct++;
END{
print "\t", $F[2], "\n";
}' \
> SG10K_Health_mini.freebayes_vcfs-hg38.2k_segments.bed
grep '^@[HS][QD]' wgs_calling_regions.hg38.interval_list > wgs_jointcalling_regions.hg38_SG10K_Health_mini_freebayes_2k.interval_list
grep -v '^@' wgs_calling_regions.hg38.interval_list \
| bedtools intersect -b stdin -a SG10K_Health_mini.freebayes_vcfs-hg38.2k_segments.bed \
| awk 'BEGIN{FS=OFS="\t"}{print $0,"+", "SG10K_Health_mini-freebayes intersection ACGTmer"}' \
>> wgs_jointcalling_regions.hg38_SG10K_Health_mini_freebayes_2k.interval_list
Attached are the two resulting interval files (~10k intervals and ~2k intervals) which could be used :
Original report by Nicolas Bertin (GIS) (Bitbucket: nicolas-bertin, GitHub: nicolas-bertin).
Genome segmentation indispensable to parallelize the compute intensive genomicDB-driven joint-variant calling on 1,000+ samples has been attempted using the GATK4 bundle interval file intervals:
gatk-bundle/wgs_calling_regions.hg38.interval_list
... But this fileAccording to the following GATK forum post, joint-variant calling segmentation can actually be quite arbitrary.
The approach described in this post is to segment the genome using a 20k
hg38.even.handcurated.20k.intervals
file, eventually combined via “DynamicallyCombineIntervals” to bring about a number of segments corresponding to the number of samples.Alternatively, we could leverage the SG10K_Heath_mini (2,000 samples whose composition in terms of ethnic background is very roughly balanced between Indian, Chinese and Malay and male and female) freebayes-based VCFs to derive genome segments containing equivalent amounts of variant sites.
Attached are the two resulting interval files (~10k intervals and ~2k intervals) which could be used :
The text was updated successfully, but these errors were encountered: