Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

joint-variant calling intervals #26

Open
andreas-wilm opened this issue Sep 14, 2018 · 0 comments
Open

joint-variant calling intervals #26

andreas-wilm opened this issue Sep 14, 2018 · 0 comments
Labels
enhancement New feature or request major

Comments

@andreas-wilm
Copy link
Contributor

Original report by Nicolas Bertin (GIS) (Bitbucket: nicolas-bertin, GitHub: nicolas-bertin).


Genome segmentation indispensable to parallelize the compute intensive genomicDB-driven joint-variant calling on 1,000+ samples has been attempted using the GATK4 bundle interval file intervals: gatk-bundle/wgs_calling_regions.hg38.interval_list... But this file

  • is primarily design to mask (more adequately white-list) regions of the genome on which it is reasonable to compute joint-variant calling.
  • is far from being suited for breaking down the genome into segments from whoch joint-variant calling can efficiently be paralellized
  • see the attached quick'n dirty plot: the 1st half of the 356 shorter interval cover a mere 0.5% of the genome, while the 10 longest segments cover 1/3 of the genome
    interval length distribution

According to the following GATK forum post, joint-variant calling segmentation can actually be quite arbitrary.
The approach described in this post is to segment the genome using a 20k hg38.even.handcurated.20k.intervals file, eventually combined via “DynamicallyCombineIntervals” to bring about a number of segments corresponding to the number of samples.

Alternatively, we could leverage the SG10K_Heath_mini (2,000 samples whose composition in terms of ethnic background is very roughly balanced between Indian, Chinese and Malay and male and female) freebayes-based VCFs to derive genome segments containing equivalent amounts of variant sites.

  • Wrote a quick'n dirty script to gather SG10K_Heath_mini variant density over the genome to guide the design of such genome segmentation
#!bash

SCRATCHDIR="/home/users/astar/gis/bertinn/scratch/"$WORKDIR
mkdir -p $SCRATCHDIR
ln -s $SCRATCHDIR/SG10K_Health_mini.multiinter.bed.gz
module load bedtools

nohup sh -c "
bedtools multiinter -header -i freebayes_vcfs-hg38/*.fb.vcf.gz \
   | sed -e 's/chrom/#chrom/' \
   | bedtools intersect -sorted -header -wa -a stdin -b wgs_calling_regions.hg38.interval_list.bed \
   | perl -lane 'if(\$F[0] eq qq{#chrom}){\$F[4] = qq{all.fb.vcf}; print join(qq{\t}, @F)} else { @S[0..3]=@F[0..3]; \$S[4]=++\$j; for \$i (5..scalar @F-1){ \$S[\$i]+=\$F[\$i] } ; print join(qq{\t}, @S)}' \
   | gzip \
   > $SCRATCHDIR/SG10K_Health_mini.multiinter.bed.gz " &

zcat SG10K_Health_mini.multiinter.bed.gz | tail -n 1 > SG10K_Health_mini.multiinter.lastline.bed
# chrY    56887852    56887853    2    98170636    5564852    5578317    5461235    5580398    5712465...

# let's stream SG10K_Health_mini.multiinter.bed.gz and trace the location at every 10,000 variants
#    generating ~10,000 segments containing the same number of variants
#    keep in mind: reset counter when new chr
#                  need for penultimate intersect-bed with wgs_calling_regions.hg38.interval_list.bed

zcat SG10K_Health_mini.multiinter.bed.gz \
 | perl -lane 'BEGIN{
                       $nseg=10000; $tvar=98170636; $lnfile="freebayes_vcfs-hg38.contig_lengths";
                       open(LN, $lnfile);
                       %ln = map {chomp; split /\t/} (<LN>);
                       close(LN);
                       $nsplit = int($tvar/$nseg); $chr=""; $start=1; $nvar=1;
                     }
               next if ($F[0] =~ /^#/);
               $chr = $F[0] unless ($chr);
               if ($F[4]-$nvar > $nsplit){ print join("\t", $chr, $start, $F[1]-1);   $nvar=$F[4]; $start=$F[1];             }
               if ($F[0] ne $chr)        { print join("\t", $chr, $start, $ln{$chr}); $nvar=$F[4]; $start=1;     $chr=$F[0]; }
               END{
                     print join("\t", $chr, $start, $ln{$chr});
                  }
              ' \
  > SG10K_Health_mini.freebayes_vcfs-hg38.10k_segments.bed

#
# intersect with wgs_calling_regions.hg38.interval_list
# prepend wgs_calling_regions.hg38.interval_list `interval list` header and append feature name as `SG10K_Health_mini-freebayes intersection ACGTmer"`
#

grep    '^@[HS][QD]' wgs_calling_regions.hg38.interval_list > wgs_jointcalling_regions.hg38_SG10K_Health_mini_freebayes_10k.interval_list
grep -v '^@'         wgs_calling_regions.hg38.interval_list \
  | bedtools intersect -b stdin -a SG10K_Health_mini.freebayes_vcfs-hg38.10k_segments.bed  \
  | awk 'BEGIN{FS=OFS="\t"}{print $0,"+", "SG10K_Health_mini-freebayes intersection ACGTmer"}' \
  >> wgs_jointcalling_regions.hg38_SG10K_Health_mini_freebayes_10k.interval_list


#
# derive a ...2k_segments from `SG10K_Health_mini.freebayes_vcfs-hg38.10k_segments.bed`
#

cat SG10K_Health_mini.freebayes_vcfs-hg38.10k_segments.bed \
 | perl -ane '
               BEGIN{
                       $lnfile="freebayes_vcfs-hg38.contig_lengths";
                       open(LN, $lnfile);
                       %ln = map {chomp; split /\t/} (<LN>);
                       close(LN);
                    }
              if ($chr ne $F[0]){ print "\t", $ln{$chr}, "\n" if(($chr)and($ct)) ;
                                  $chr=$F[0];        $ct=0; }
              if ($ct==4) { print "\t", $F[2], "\n"; $ct=-1;}
              if ($ct==0) { print $F[0], "\t", $F[1]; }
              $ct++;
              END{
                     print "\t", $F[2], "\n";
                 }' \
 > SG10K_Health_mini.freebayes_vcfs-hg38.2k_segments.bed

grep    '^@[HS][QD]' wgs_calling_regions.hg38.interval_list > wgs_jointcalling_regions.hg38_SG10K_Health_mini_freebayes_2k.interval_list
grep -v '^@'         wgs_calling_regions.hg38.interval_list \
  | bedtools intersect -b stdin -a SG10K_Health_mini.freebayes_vcfs-hg38.2k_segments.bed  \
  | awk 'BEGIN{FS=OFS="\t"}{print $0,"+", "SG10K_Health_mini-freebayes intersection ACGTmer"}' \
  >> wgs_jointcalling_regions.hg38_SG10K_Health_mini_freebayes_2k.interval_list

Attached are the two resulting interval files (~10k intervals and ~2k intervals) which could be used :

  • wgs_jointcalling_regions.hg38_SG10K_Health_mini_freebayes_10k.interval_list
  • wgs_jointcalling_regions.hg38_SG10K_Health_mini_freebayes_2k.interval_list
@andreas-wilm andreas-wilm added major enhancement New feature or request labels Jan 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request major
Projects
None yet
Development

No branches or pull requests

1 participant