Skip to content

Latest commit

 

History

History
95 lines (76 loc) · 5.23 KB

genotyping_and_likelihoods.md

File metadata and controls

95 lines (76 loc) · 5.23 KB

Genotype calling and likelihood estimation

Background

Now it is time to genotype! We use ANGSD for both genotpying and genotype likelihood estimation. Lots of work relies on genotype likelihoods exclusively for analyses - a great tutorial on that approach is here.

We generate both raw genotypes and likelihoods simultaneously, the reasoning being that we've found very high similarity in imputed genotype and likelihood estimate allele frequencies, and working with called genotypes is often just logistically easier because they work in catch-all analysis programs like plink and vcftools (which is now deprecated but still great). Here is an example from Atlantic Salmon: image

We will calculate genotype likelihoods, which are uncertainty-weighted estimates of an individual's genotype at a locus. We will carry this out across each chromosome separately, because it is somewhat memory and time intensive. Notice that we are now exporting a couple of different variables to slurm - we are specifying chromosome rather than set, and some run-specific parameters that we keep in another params file. This parameters to pass to ANGSD look like this:

Parameters

bamfile=<bamlist>.tsv
runname=<your project name>
minInd=<number of individuals required to observe a read - I aim for 80% but this is up to you>
minDepth=<number of reads required to call a read - I aim the equivalent of  1.5-2x coverage based 

Chromosomes names can come from wherever, but must match the aligned genome. An easy option is that they can be acquired from the first column of the genome fasta.fai file generated by samtools faidx. I will usually just do

cut -f1 <your reference genome.fasta.fai> > chroms

Job submission

while read chrom;  do sbatch --export=ALL,chrom=$chrom,paramfile=WGSparams_aeip.tsv,angsdparam=refs_angsdparam.tsv  09_angsd_bcf_beag_maf.sh ;  
  done < Ssal_v3.1_genomic.chroms

We are doing this first for a set of individuals sequenced at high coverage. The ANGSD command looks like this:

cd projdir/angsd_in

$angsd \
  -nThreads 8 \ #multithread - later versions auto-cap this at 8 
  -bam $bamfile \ # list of individuals
  -out $projdir/angsd_out/$species.$projname.$runname.$chrom. \ 
  -dobcf 1 \ #make bcf file
  -gl 1 \ #samtools variant call
  -dopost 1 \ #estimate the posterior genotype probability based on the allele frequency as a prior
  -dogeno 5 \ #print out genotypes as major and minor (1), and print out base calls (4) - flag is additive
  -doGlf 2 \ #output likelihoods as beagle files
  -domajorminor 1 \ #Infer major and minor from GL
  -domaf 1 \ #calculate allele frequency from "known" major and minor inferred from likielihoods
  -docounts 1 \ #count total depth per locus/per allele
  -dumpCounts 2 \ #output individual allele depths
  -doQsDist 1 \ #base quality score count
  -minMapQ 30 \ #only output sites with mapQ 30 or greater
  -minQ 30 \ #only output sites with basecall Q 30 or greater
  -minInd $minInd \ #minimum individual required for allele observation
  -SNP_pval 2e-6 \ #output sites with probable SNPs 
  -uniqueOnly 1 \ #drop multi-map locations
  -minMaf 0.05 \ #maf cutoff 0.05
  -setMinDepth $minDepth \ #mindepth per locus
  -r $chrom \ #retain this site
  -remove_bads 1 \ #remove reads with failed map/pairing paramterts
  -only_proper_pairs 1 #only keep reads with proper pairing

Most of it won't change, but we can use sites genotyped at high coverage for imputation or to maintain allele calls across samples. To get these sites after the first genotyping run, we do the following:

zcat *mafs.gz  | sort | uniq | cut -f1,2,3,4 | sed '$d' > All_sites.tsv

Which will open up all the allele frequency estimates from each chromosome, sort them and remove duplicate headers, and then drop the header that gets sorted to the bottom of the file. Then we index these sites using ANGSD:

conda activate align
angsd sites index All_sites.tsv
conda deactivate 

We can add this information to the parameter file for our lcWGS samples:

sites=All_sites.tsv

And then also add the --sites $sites option to our ANGSD call for these samples. Because we are specifying another option for ANGSD, we run a new script that now includes the site option:

while read chrom;  do sbatch --export=ALL,chrom=$chrom,paramfile=WGSparams_aeip.tsv,angsdparam=lcwgs_angsdparam.tsv  10_refsites_angsd_bcf_beag_maf.sh ;  
  done < Ssal_v3.1_genomic.chroms 

Now we have a bunch of kinda useless bcf files. They're smaller than vcfs but they lack human interpretable info, so we need to convert them to vcfs. We do this per bcf file, so we again specify the chromosome and file set.

while read chrom;  do sbatch --export=ALL,chrom=$chrom,paramfile=WGSparams_<project name>,angsdparam=<project_name>_angsdparam.tsv  11_bcf_to_vcf.sh ;  
  done < chroms

And we're done! ANGSD has handled much of the filtering and format conversion here for us, so we can now do different population genomic analyses.