Skip to content

Latest commit

 

History

History
56 lines (43 loc) · 3.91 KB

repeatMasker.md

File metadata and controls

56 lines (43 loc) · 3.91 KB

Understanding repeatMasker and repeat annotation

RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences. The output of the program is a detailed annotation of the repeats that are present in the query sequence as well as a modified version of the query sequence in which all the annotated repeats have been masked

From: http://www.repeatmasker.org/

Repeats are identified with RepeatModeler.

The full repeatMasker track can be downloaded e.g. via wget "https://hgdownload.soe.ucsc.edu/goldenPath/mm10/database/rmsk.txt.gz" The fields should be labelled as follows:

Field Meaning
chrom "Genomic sequence name"
chromStart "Start in genomic sequence"
chromEnd "End in genomic sequence"
name "Name of repeat"
score "always 0 place holder"
strand "Relative orientation + or -"
swScore "Smith Waterman alignment score"
milliDiv "Base mismatches in parts per thousand"
milliDel "Bases deleted in parts per thousand"
milliIns "Bases inserted in parts per thousand"
genoLeft "-#bases after match in genomic sequence"
repClass "Class of repeat"
repFamily "Family of repeat"
repStart "Start (if strand is +) or -#bases after match (if strand is -) in repeat sequence"
repEnd "End in repeat sequence"
repLeft "-#bases after match (if strand is +) or start (if strand is -) in repeat sequence"

Based on info from http://genomewiki.ucsc.edu/index.php/RepeatMasker

  • up to ten different classes of repeats:
    • Short interspersed nuclear elements (SINE), which include ALUs
    • Long interspersed nuclear elements (LINE)
    • Long terminal repeat elements (LTR), which include retroposons
    • DNA repeat elements (DNA)
    • Simple repeats (micro-satellites)
    • Low complexity repeats
    • Satellite repeats
    • RNA repeats (including RNA, tRNA, rRNA, snRNA, scRNA, srpRNA)
    • Other repeats, which includes class RC (Rolling Circle)
    • Unknown

"A "?" at the end of the "Family" or "Class" (for example, DNA?) signifies that the curator was unsure of the classification. At some point in the future, either the "?" will be removed or the classification will be changed."

from UCSC GenomeBrowser Track description

Families, classes and so on

The most elementary level of classification of TEs is the family, which designates interspersed genomic copies derived from the amplification of an ancestral progenitor sequence (10). Each TE family can be represented by a consensus sequence approximating that of the ancestral progenitor.

From Flynn et al. (2020)

RepeatModeler contains a basic homology-based classification module (RepeatClassifier) which compares the TE families generated by the various de novo tools to both the RepeatMasker Repeat Protein Database (DB) and to the RepeatMasker libraries (e.g., Dfam and/or RepBase). The Repeat Protein DB is a set of TE-derived coding sequences that covers a wide range of TE classes and organisms. As is often the case with a search against all known TE consensus sequences, there will be a high number of false positive or partial matches. RepeatClassifier uses a combination of score and overlap filters to produce a reduced set of high-confidence results. If there is a concordance in classification among the filtered results, RepeatClassifier will label the family using the RepeatMasker/Dfam classification system and adjust the orientation (if necessary). Remaining families are labeled “Unknown” if a call cannot be made. Classification is the only step that requires a database, and can be completed with only open-source Dfam if Repbase is not available.