Merge pull request #6 from thapasz/main

Updated vocabulary for HMMs (response to call #3).
merenlab · Mar 9, 2022 · dbd7afb · dbd7afb
2 parents c6df109 + 532e567
commit dbd7afb
Showing 1 changed file with 10 additions and 0 deletions.
diff --git a/vocabulary/index.md b/vocabulary/index.md
@@ -229,7 +229,17 @@ Commonly used SCGs can be identified across a set of genomes through sequence ho
 
 The number of SCGs will decrease with decreasing resolutions of taxonomy. For instance, the number of SCGs across a set of genomes that belong to the same phylum will typically be much smaller than the number of SCGs across a set of genomes that belong to a same genus within that phylum, and so on. At the domain level there exists a small set of ribosomal proteins that are both core and single-copy across a very large number of genomes that span eukarya, archaea, and bacteria and lead to [comprehensive analyses](https://www.nature.com/articles/nmicrobiol201648) of the tree of life through phylogenomics.
 
+### Hidden Markov Models (HMMs)
+
+A [Markov model](https://web.stanford.edu/~jurafsky/slp3/A.pdf) allows us to predict/describe a future state, given the knowledge of [current state](https://web.stanford.edu/~jurafsky/slp3/A.pdf) **(observation)** in the sequence. The past state is not important in predicting the future outcome. To summarize, the system state at a given time point [“t+1”](https://reader.elsevier.com/reader/sd/pii/S000437029800023X?token=79509CC161F6A21DD71D5B2C02D3E7A3C6D2AC8EBB6D10B37EC4E063A4F21931F2C8F3204F4EFF5E89610ED5280FAF64&originRegion=us-east-1&originCreation=20220306193035) is dependent upon the state at time point “t”. [Hidden Markov Models](https://en.wikipedia.org/wiki/Hidden_Markov_model) (HMMs) are the Markov models where the **[states are hidden or not directly unobservable](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2766791/pdf/CG-10-402.pdf)**.
 
+HMMs have been widely used in bioinformatics for [sequence analysis](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2766791/pdf/CG-10-402.pdf) - tasks such as database searches, gene prediction, pairwise and multiple sequence alignment etc. Many problems in biological sequence analysis often have similar patterns - availability of initial sequence of symbols (nucleotides, amino acids) and necessity to predict which protein or phylogenetic family it belongs to - aka “sequence-based homology detection”.
+
+HMMs trained and built on closely related species can exhibit advanced sensitivity towards sequence searches for [remote homology](https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1002195&type=printable). In contrast to BLAST techniques used for sequence alignment, i) HMMs can corresponds to position-specific gap penalties, which leads to better depiction of changes occurring at a [conserved vs variable region](https://reader.elsevier.com/reader/sd/pii/S0022283684711041?token=DB01FA515414FC42BC5DCA4555C53A6D84434346ED3F29CC8A2FDE4DD008FFCAE2A9783792E15F30C072F5CB6BE63457&originRegion=us-east-1&originCreation=20220308161623). Moreover, ii) the overall alignment is an outcome of not just one best-scoring alignment, but consensus over all possible alignments. This assists in [effective prediction](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2447419/pdf/CFG-04-250.pdf) of the true homologs.
+
+Within Anvi’o environment, running ***“anvi-run-hmms”*** on your contig database allows to estimate single-copy core genes (SCGs) present in the contig database. Since SCGs are phylogenetically conserved, they are good candidates to measure the [completeness of genomes](https://pubmed.ncbi.nlm.nih.gov/26500826/). Anvi’o comes with HMMs for Bacterial, Archea as well as Ribosomal RNA genes.
+
+ ### Completion
 {:data-tags="completion,completeness"}
 ### Completion