Skip to content

Latest commit

 

History

History
92 lines (63 loc) · 8.27 KB

B_GLs_est.md

File metadata and controls

92 lines (63 loc) · 8.27 KB

A. Genotype Likelihood estimation - brief background

Having the outputs from read mapping for every sample (*.bam files) we can then proceed with the estimation of variants or genotype likelihoods (GLs). On population genetics, there are a number of differences when it comes to choose working with variants or GLs, summarised in the table as follows:

GLs Variants
GLs represent the probability of observing different genotypes at a specific genomic position given the sequencing data and a model of sequencing error Variants refer to specific differences or mutations in DNA sequences compared to a reference genome
Instead of working with a specific genotype assigned to each individual, GLs provide a probabilistic framework that incorporates uncertainty about the true genotype Variants are often single nucleotide polymorphisms (SNPs), insertions, deletions, or other types of genomic variation
GLs are often estimated using statistical models that take into account factors such as base quality scores, read mapping quality, and sequencing depth Variants are typically characterized by their genomic position, the reference allele (the allele found in the reference genome), and the alternative allele (the observed variant allele)
GLs are especially useful in situations where sequencing data may be noisy or ambiguous, such as in low-coverage or ancient DNA sequencing experiments Variants are commonly represented in variant call format (VCF) files, which provide information about variant type, genomic coordinates, allele frequencies, and other annotations

In general, given the nature of most datasets (i.e., low sequencing coverage for most individuals amongst populations), it is best to work with probabilistic frameworks such as GLs which are inferred for a collection of individuals (as opposed to inferences on variants/GLs conducted on single individuals). As such, in this workshop, we will work with GLs from which we will then derive analyses that can reveal relatedness and ancestry of individuals sampled from different species.

Note

GLs (or variants) can be calculated from any set of individuals and different sequencing depths. The same can be told of almost any population genomics metric (e.g., Heterozygosity (Hz), Effective population size (FSt), etc). What determines how biased your estimations are will depend on how robust your sampling is how well sampled your populations are. Here, a good sampling is defined by how well represented the genetic diversity of your populations are, instead of how well sequenced each individula is. In other words, sequencing many more individuals across the spectrum of a population at lower depths is better than sequencing few individuals at deeper representations (see Figure 1 below)

Figure 1 Figure 1: Bias in the estimation of segregating sites and expected Hz under different sampling levels. Notice how less bias estimations are whenever many individuals are sampled at lower sequencing depths (1x) (taken from Fumagalli, 2013)

B. Genotype Likelihood estimation - analysis

We will use angsd to infer GLs from a the BAM files produced by paleomix. angsd (abbreviation of Analysis of Next Generation Sequencing Data) is a versatile software widely used for analyzing genomic data generated from HTS technologies. It can estimate GLs and produce different metrics, including Fst, Site Frequency Spectrum (SFS), Tajima's D, amongst others.

To compute GLs, angsd requires as input a tab-delimited text file with the names of the *.bam files that will be included in the analysis. Here is an example of a command we could use to compute GLs from our bam files.

angsd -GL 2 -out genolike -nThreads 20 -doGlf 2 -doMajorMinor 1 -SNP_pval 1e-6 -doMaf 1 -bam bam.filelist

where:

-GL 2: This parameter specifies the genotype likelihood model to use. In this case, 2 indicates that ANGSD should use the GATK model for genotype likelihood estimation. There are four models one can choose from, with -GL 2 being the most popular choice whenever one works with low coverage sequence data.

-out genolike: This parameter specifies the prefix for the output files generated by ANGSD (here, our output files will be then named genolike.arg, genolike.mafs.gz, etc).

-nThreads 2: This parameter specifies the number of threads (or CPU cores) to use for parallel processing. Increase this parameter whenever working with large genomes and many individuals.

-doGlf 2: This parameter specifies the type of genotype likelihood output file. The value used here indicates that ANGSD should output genotype likelihood files in beagle format (\*.beagle.gz).

-doMajorMinor 1: This parameter specifies whether to infer major and minor alleles at each site. The value 1 indicates that ANGSD should infer major and minor alleles based on genotype likelihoods.

-SNP_pval 1e-6: This parameter specifies the p-value threshold for calling SNPs. SNPs with a p-value less than or equal to 1e-6 will be considered significant. This threshold helps filter out potential false positives.

-doMaf 1: This parameter specifies whether the major and minor alleles are known for each site. The value 1 indicates that ANGSD should assume that major and minor alleles are known (they can either be provided by the user, or calculated from GLs).

-bam bam.filelist: This parameter specifies the input BAM file(s) containing the aligned sequencing reads. The file bam.filelist should contain a list of paths to the BAM files to be analyzed by ANGSD.

BTW this is how the bam.filelist should look like:

SRR106852_NC013991cp.NC013991cp.realigned.bam
SRR121596_NC013991cp.NC013991cp.realigned.bam
SRR121604_NC013991cp.NC013991cp.realigned.bam
SRR121607_NC013991cp.NC013991cp.realigned.bam
SRR121612_NC013991cp.NC013991cp.realigned.bam
SRR5120109_NC013991cp.NC013991cp.realigned.bam
SRR5120110_NC013991cp.NC013991cp.realigned.bam
SRR5120111_NC013991cp.NC013991cp.realigned.bam
SRR5120112_NC013991cp.NC013991cp.realigned.bam
SRR5120113_NC013991cp.NC013991cp.realigned.bam
SRR5120114_NC013991cp.NC013991cp.realigned.bam
SRR5120115_NC013991cp.NC013991cp.realigned.bam
SRR6439415_NC013991cp.NC013991cp.realigned.bam
...

The command provided above will produce four output files as follows:

 genolike.mafs.gz # this file contain the major and minor alleles per position and their frequencies
 genolike.beagle.gz # this file contain the actual GLs. 
 genolike.arg # this file contains the arguments used to estimate GLs

Example output files from the estimation of GLs from a similar set of samples is provided in the directory /home/ontasia*/Documents/ONT-workshop-March-2024//CP_SRA_26796/PCA_32s_v2/. The *.beagle file is required as input to conduct PCA and Admixture analyses

Important

angsd will always depart from the assumption that individuals included in a panel for GLs estimation are diploid, regardless of the model of choice to compute the GLs. This is a limitation if you find yourself working with polyploid genomes.

One advantage of angsd is the flexibility it offers in terms of filtering input datasets. This is possible by estimating the frequency of bases per position across a panel of individuals (flag -doCounts). Once such frequencies are produced, we can remove bases below a given quality threshold (-minQ), remove reads below a given mapping quality (-minMapQ), and even exclude sites from analysis that have below a costumised minimum (-setMinDepth) and above maximum (-setMaxDepth) sequencing coverage.

$\color{orange}{\textsf{C. ACTIVITY}}$

  1. Using the command provided above, please compute GLs from the set of BAM files available in the folder /home/ontasia*/Documents/ONT-workshop-March-2024/BAM_CP/. NB: you need to first produce the bam.filelist (hint: use ls *.bam > bam.filelist).

  2. How many sites were input and how many were kept by angsd for GL estimation?

Selected references

  1. Fumagalli M. 2013. Assessing the Effect of Sequencing Depth and Sample Size in Population Genetics Inferences. PLoS ONE 8:e79667