joint-variant calling intervals #26

andreas-wilm · 2018-09-14T01:51:30Z

Original report by Nicolas Bertin (GIS) (Bitbucket: nicolas-bertin, GitHub: nicolas-bertin).

Genome segmentation indispensable to parallelize the compute intensive genomicDB-driven joint-variant calling on 1,000+ samples has been attempted using the GATK4 bundle interval file intervals: gatk-bundle/wgs_calling_regions.hg38.interval_list... But this file

is primarily design to mask (more adequately white-list) regions of the genome on which it is reasonable to compute joint-variant calling.
is far from being suited for breaking down the genome into segments from whoch joint-variant calling can efficiently be paralellized
see the attached quick'n dirty plot: the 1st half of the 356 shorter interval cover a mere 0.5% of the genome, while the 10 longest segments cover 1/3 of the genome

According to the following GATK forum post, joint-variant calling segmentation can actually be quite arbitrary.
The approach described in this post is to segment the genome using a 20k hg38.even.handcurated.20k.intervals file, eventually combined via “DynamicallyCombineIntervals” to bring about a number of segments corresponding to the number of samples.

Alternatively, we could leverage the SG10K_Heath_mini (2,000 samples whose composition in terms of ethnic background is very roughly balanced between Indian, Chinese and Malay and male and female) freebayes-based VCFs to derive genome segments containing equivalent amounts of variant sites.

Wrote a quick'n dirty script to gather SG10K_Heath_mini variant density over the genome to guide the design of such genome segmentation

#!bash

SCRATCHDIR="/home/users/astar/gis/bertinn/scratch/"$WORKDIR
mkdir -p $SCRATCHDIR
ln -s $SCRATCHDIR/SG10K_Health_mini.multiinter.bed.gz
module load bedtools

nohup sh -c "
bedtools multiinter -header -i freebayes_vcfs-hg38/*.fb.vcf.gz \
   | sed -e 's/chrom/#chrom/' \
   | bedtools intersect -sorted -header -wa -a stdin -b wgs_calling_regions.hg38.interval_list.bed \
   | perl -lane 'if(\$F[0] eq qq{#chrom}){\$F[4] = qq{all.fb.vcf}; print join(qq{\t}, @F)} else { @S[0..3]=@F[0..3]; \$S[4]=++\$j; for \$i (5..scalar @F-1){ \$S[\$i]+=\$F[\$i] } ; print join(qq{\t}, @S)}' \
   | gzip \
   > $SCRATCHDIR/SG10K_Health_mini.multiinter.bed.gz " &

zcat SG10K_Health_mini.multiinter.bed.gz | tail -n 1 > SG10K_Health_mini.multiinter.lastline.bed
# chrY    56887852    56887853    2    98170636    5564852    5578317    5461235    5580398    5712465...

# let's stream SG10K_Health_mini.multiinter.bed.gz and trace the location at every 10,000 variants
#    generating ~10,000 segments containing the same number of variants
#    keep in mind: reset counter when new chr
#                  need for penultimate intersect-bed with wgs_calling_regions.hg38.interval_list.bed

zcat SG10K_Health_mini.multiinter.bed.gz \
 | perl -lane 'BEGIN{
                       $nseg=10000; $tvar=98170636; $lnfile="freebayes_vcfs-hg38.contig_lengths";
                       open(LN, $lnfile);
                       %ln = map {chomp; split /\t/} (<LN>);
                       close(LN);
                       $nsplit = int($tvar/$nseg); $chr=""; $start=1; $nvar=1;
                     }
               next if ($F[0] =~ /^#/);
               $chr = $F[0] unless ($chr);
               if ($F[4]-$nvar > $nsplit){ print join("\t", $chr, $start, $F[1]-1);   $nvar=$F[4]; $start=$F[1];             }
               if ($F[0] ne $chr)        { print join("\t", $chr, $start, $ln{$chr}); $nvar=$F[4]; $start=1;     $chr=$F[0]; }
               END{
                     print join("\t", $chr, $start, $ln{$chr});
                  }
              ' \
  > SG10K_Health_mini.freebayes_vcfs-hg38.10k_segments.bed

#
# intersect with wgs_calling_regions.hg38.interval_list
# prepend wgs_calling_regions.hg38.interval_list `interval list` header and append feature name as `SG10K_Health_mini-freebayes intersection ACGTmer"`
#

grep    '^@[HS][QD]' wgs_calling_regions.hg38.interval_list > wgs_jointcalling_regions.hg38_SG10K_Health_mini_freebayes_10k.interval_list
grep -v '^@'         wgs_calling_regions.hg38.interval_list \
  | bedtools intersect -b stdin -a SG10K_Health_mini.freebayes_vcfs-hg38.10k_segments.bed  \
  | awk 'BEGIN{FS=OFS="\t"}{print $0,"+", "SG10K_Health_mini-freebayes intersection ACGTmer"}' \
  >> wgs_jointcalling_regions.hg38_SG10K_Health_mini_freebayes_10k.interval_list


#
# derive a ...2k_segments from `SG10K_Health_mini.freebayes_vcfs-hg38.10k_segments.bed`
#

cat SG10K_Health_mini.freebayes_vcfs-hg38.10k_segments.bed \
 | perl -ane '
               BEGIN{
                       $lnfile="freebayes_vcfs-hg38.contig_lengths";
                       open(LN, $lnfile);
                       %ln = map {chomp; split /\t/} (<LN>);
                       close(LN);
                    }
              if ($chr ne $F[0]){ print "\t", $ln{$chr}, "\n" if(($chr)and($ct)) ;
                                  $chr=$F[0];        $ct=0; }
              if ($ct==4) { print "\t", $F[2], "\n"; $ct=-1;}
              if ($ct==0) { print $F[0], "\t", $F[1]; }
              $ct++;
              END{
                     print "\t", $F[2], "\n";
                 }' \
 > SG10K_Health_mini.freebayes_vcfs-hg38.2k_segments.bed

grep    '^@[HS][QD]' wgs_calling_regions.hg38.interval_list > wgs_jointcalling_regions.hg38_SG10K_Health_mini_freebayes_2k.interval_list
grep -v '^@'         wgs_calling_regions.hg38.interval_list \
  | bedtools intersect -b stdin -a SG10K_Health_mini.freebayes_vcfs-hg38.2k_segments.bed  \
  | awk 'BEGIN{FS=OFS="\t"}{print $0,"+", "SG10K_Health_mini-freebayes intersection ACGTmer"}' \
  >> wgs_jointcalling_regions.hg38_SG10K_Health_mini_freebayes_2k.interval_list

Attached are the two resulting interval files (~10k intervals and ~2k intervals) which could be used :

wgs_jointcalling_regions.hg38_SG10K_Health_mini_freebayes_10k.interval_list
wgs_jointcalling_regions.hg38_SG10K_Health_mini_freebayes_2k.interval_list

The text was updated successfully, but these errors were encountered:

andreas-wilm added major enhancement New feature or request labels Jan 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

joint-variant calling intervals #26

joint-variant calling intervals #26

andreas-wilm commented Sep 14, 2018

joint-variant calling intervals #26

joint-variant calling intervals #26

Comments

andreas-wilm commented Sep 14, 2018