Variant Calling


Standard SNP callers fail to detect repetitive portions of fungal genomes which results in the reporting of false variants between non-allelic sequences. Repeat-masking programs also miss large numbers of repeated sequences. To address these problem we built a BLAST-based SNP caller that pre-masks all repeats in the reference and query genomes before alignment, and then performs a second round of masking after alignment to screen out "cryptic" repeats that are not detected using genome self-comparison. Our program, iSNPcaller works in an incremental fashion where newly added genomes are first compared with one another, and then with those that have been previously analyzed. iSNPcaller performs the following operations: a) Rewrites sequence header lines using a standard format: >genomeID_contig1, >genomeID_contig2, etc.) b) Creates a repeat-masked version of every genome assembly c) Blasts each genome against all others in pairwise fashion d) Determines # of uniquely aligned nucleotide positions for each genome x genome comparision e) Performs SNP calling and reports: i) total # of SNPs; ii) total number of uniquely aligned nucleotide positions; and iii) SNPs/Mb uniquely aligned sequence f) Moves analyzed sequences/results into a "PROCESSED" directory, allowing new genomes to be analyzed in an incremental fashion

SNP calling based on pairwise alignments between all genomes

  1. iSNPcaller (multi-threaded version) was used to create a project directory, with subfolders for holding intermadiate analyses and final outputs:
perl WheatBlast
  1. Genomes were copied into the newly-created GENOMES directory:
cp RAW_GENOMES/*fasta WheatBlast/GENOMES/
  1. iSNPcaller was then run in multi-threaded mode on a High Performance Computing Cluster using the SLURM script:
sbatch $scripts/ WheatBlast

SNPcalling against the B71 reference genome:

  1. Each genome assembly was run through a custom script that masks all nucleotide positions that occur in multiple alignments when the genome is BLASTed against itself:
  1. The B71 reference genome was then BLASTed against each of the masked genome assemblies:
mkdir B71v5_BLAST
for f in `ls *masked.fasta`; do blastn -query B71v5_nh_masked.fasta -subject $f -evalue 1e-20 -max_target_seqs 2000 -outfmt '6 qseqid sseqid qstart qend sstart send btop' > ../B71v5_BLAST/B71v5.$f.BLAST; done
  1. SNPs were then called using the SNPcalling module of iSNPcaller:
cd ..
perl B71v5_BLAST B71v5_SNPs

SNPcalling against the B71 reference genome using GATK:

SNPs were called using a standard Bowtie2/GATK pipeline using the SLURM script. The variant call format file was then filtered to remove: i) sites that occurred in repeat regions of the reference genome (to ensure that all calls were between allelic loci); ii) heterozygous calls (alt:ref ratio < 20; to avoid calling variants between non allelic loci, due to repeat regions in the query genome); and iii) variant calls with low coverage (DP < 10; usually false calls caused by poor sequence quality in homopolymer tracts).

  1. The B71 reference genome was indexed using bowtie2-build:
bowtie2-build B71.fasta B71_index/B71
  1. Sequence reads were aligned using bowtie2 and genotyping was performed using GATK version using the SLURM script.
for f in `ls FASTQ_DIRECTORY/*_1.fastq.gz | awk -F '/|_' '{print $3}`; do sbatch B71.fasta FASTQ_DIRECTORY $f; done

Filtering to remove false SNP calls

  1. The "snps-only" VCF files were copied into a new directory and illegal SNP calls were then filtered out using the script:
for f in `ls VCF_FILES/*vcf`; do B71_ALIGN_STRINGs/B71.B71_alignments $f 20 10; done   # alt:ref ratio >= 20; read coverage >= 10

Manual filtering based on comparison between iSNPcaller and BWT/GATK variant datasets

  1. The SmartSNPs-filtered data were summarized using the script. This produces a convenient output format allowing manual inspection of the data to identify possible problems (especially calls in repeated regions that escape repeat detection):
perl CHR1CHR2CHR5_FINAL > Chr1Ch2Chr5_sites.txt

(note: inspection of the resulting output file revealed no obvious problems with the dataset).

  1. Next, we used the script to compare variant calls made using the "genome assembly x reference genome" strategy versus the "reads x reference genome" approach.
perl samples.txt Chr1Chr2Chr5_sites.txt B71v5_SNPs > Chr1Chr2Chr5_GATKviSNPs.txt
  1. Any differences (either in variant positions and/or which samples possessed a given variant) were investigated to identify the reason for the discrepancy. Confirmed "problem" sites were recorded in a "disallowed-sites" file (for false calls), or in a "add-back" file (for legitimate calls filtered out by the SmartSNPs script). The SNP call dataset was then updated using the script to generate the final dataset Chr1Chr2Chr5_final.txt:
perl CHR1CHR2CHR5_VCFs > Chr1Chr2Chr5_final.txt