hgvs - manipulate biological sequence variants according to Human Genome Variation Society recommendations

The hgvs package provides a Python library to parse, format, validate, normalize, and map sequence variants according to Variation Nomenclature (aka Human Genome Variation Society) recommendations.

Information
Latest Release
Development

Features

Parsing is based on formal grammar.
An easy-to-use object model that represents most variant types (SNVs, indels, dups, inverstions, etc) and concepts (intronic offsets, uncertain positions, intervals)
A variant normalizer that rewrites variants in canoncial forms and substitutes reference sequences (if reference and transcript sequences differ)
Formatters that generate HGVS strings from internal representations
Tools to map variants between genome, transcript, and protein sequences
Reliable handling of regions genome-transcript discrepancies
Pluggable data providers support alternative sources of transcript mapping data
Extensive automated tests, including those for all variant types and "problematic" transcripts
Easily installed using remote data sources. Installation with local data sources is straightforward and completely obviates network access

Important Notes

You are encouraged to browse issues. All known issues are listed there. Please report any issues you find.
Use a pip package specification to stay within minor releases. For example, hgvs>=1.1,<1.2. hgvs uses Semantic Versioning.

Examples

Installation

By default, hgvs uses remote data sources, which makes installation easy.

$ mkvirtualenv hgvs-test
(hgvs-test)$ pip install --upgrade setuptools
(hgvs-test)$ pip install hgvs
(hgvs-test)$ python

See Installation instructions for details, including instructions for installing Universal Transcript Archive (UTA) and SeqRepo locally.

Parsing and Formating

hgvs parses HGVS variants (as strings) into an object model, and can format object models back into HGVS strings.

>>> import hgvs.parser

# start with these variants as strings
>>> hgvs_g = 'NC_000007.13:g.36561662C>T'
>>> hgvs_c = 'NM_001637.3:c.1582G>A'

# parse the genomic variant into a Python structure
>>> hp = hgvs.parser.Parser()
>>> var_g = hp.parse_hgvs_variant(hgvs_g)
>>> var_g
SequenceVariant(ac=NC_000007.13, type=g, posedit=36561662C>T)

# SequenceVariants are composed of structured objects, e.g.,
>>> var_g.posedit.pos.start
SimplePosition(base=36561662, uncertain=False)

# format by stringification
>>> str(var_g)
'NC_000007.13:g.36561662C>T'

Projecting ("Mapping") variants between aligned genome and transcript sequences

hgvs provides tools to project variants between genome, transcript, and protein sequences. Non-coding and intronic variants are supported. Alignment data come from the Universal Transcript Archive (UTA).

>>> import hgvs.dataproviders.uta
>>> import hgvs.assemblymapper

# initialize the mapper for GRCh37 with splign-based alignments
>>> hdp = hgvs.dataproviders.uta.connect()
>>> am = hgvs.assemblymapper.AssemblyMapper(hdp,
...          assembly_name='GRCh37', alt_aln_method='splign',
...          replace_reference=True)

# identify transcripts that overlap this genomic variant
>>> transcripts = am.relevant_transcripts(var_g)
>>> sorted(transcripts)
['NM_001177506.1', 'NM_001177507.1', 'NM_001637.3']

# map genomic variant to one of these transcripts
>>> var_c = am.g_to_c(var_g, 'NM_001637.3')
>>> var_c
SequenceVariant(ac=NM_001637.3, type=c, posedit=1582G>A)
>>> str(var_c)
'NM_001637.3:c.1582G>A'

# CDS coordinates use BaseOffsetPosition to support intronic offsets
>>> var_c.posedit.pos.start
BaseOffsetPosition(base=1582, offset=0, datum=Datum.CDS_START, uncertain=False)

Translating coding variants to protein sequences

Coding variants may be translated to their protein consequences. HGVS uses the same pairing of transcript and protein accessions as seen in NCBI and Ensembl.

# translate var_c to its protein consequence
# The object structure of protein variants is nearly identical to
# that of nucleic acid variants and is converted to a string form
# by stringification. Per HGVS recommendations, inferred consequences
# must have parentheses to indicate uncertainty.
>>> var_p = am.c_to_p(var_c)
>>> var_p
SequenceVariant(ac=NP_001628.1, type=p, posedit=(Gly528Arg))
>>> str(var_p)
'NP_001628.1:p.(Gly528Arg)'

# setting uncertain to False removes the parentheses on the
# stringified form
>>> var_p.posedit.uncertain = False
>>> str(var_p)
'NP_001628.1:p.Gly528Arg'

# formatting can be customized, e.g., use 1 letter amino acids to
# format a specific variant
>>> var_p.format(conf={"p_3_letter": False})
'NP_001628.1:p.G528R'

# configuration may also be set globally
>>> hgvs.global_config.formatting.p_3_letter = False
>>> str(var_p)
'NP_001628.1:p.G528R'

Normalizing variants

Some variants have multiple representations due to instrinsic biological ambiguity (e.g., inserting a G in a poly-G run) or due to misunderstanding HGVS recommendations. Normalization rewrites certain veriants into a single representation.

# rewrite ins as dup (depends on sequence context)
>>> import hgvs.normalizer
>>> hn = hgvs.normalizer.Normalizer(hdp)
>>> hn.normalize(hp.parse_hgvs_variant('NM_001166478.1:c.35_36insT'))
SequenceVariant(ac=NM_001166478.1, type=c, posedit=35dup)

# during mapping, variants are normalized (by default)
>>> c1 = hp.parse_hgvs_variant('NM_001166478.1:c.31del')
>>> c1
SequenceVariant(ac=NM_001166478.1, type=c, posedit=31del)
>>> c1n = hn.normalize(c1)
>>> c1n
SequenceVariant(ac=NM_001166478.1, type=c, posedit=35del)
>>> g = am.c_to_g(c1)
>>> g
SequenceVariant(ac=NC_000006.11, type=g, posedit=49917127del)
>>> c2 = am.g_to_c(g, c1.ac)
>>> c2
SequenceVariant(ac=NM_001166478.1, type=c, posedit=35del)

There are more examples in the documentation.

Citing hgvs (the package)

A Python Package for Parsing, Validating, Mapping, and Formatting Sequence Variants Using HGVS Nomenclature.

Hart RK, Rico R, Hare E, Garcia J, Westbrook J, Fusaro VA.

Bioinformatics. 2014 Sep 30. PubMed | Open Access PDF

Contributing

The hgvs package is intended to be a community project. Please see Contributing to get started in submitting source code, tests, or documentation. Thanks for getting involved!

Name		Name	Last commit message	Last commit date
Latest commit History 1,256 Commits
doc		doc
etc		etc
examples		examples
hgvs		hgvs
misc		misc
sbin		sbin
tests		tests
.coveragerc		.coveragerc
.gitignore		.gitignore
.hgtags		.hgtags
.style.yapf		.style.yapf
.travis.yml		.travis.yml
AUTHORS		AUTHORS
CHANGELOG		CHANGELOG
LICENSE.txt		LICENSE.txt
Makefile		Makefile
README.rst		README.rst
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hgvs - manipulate biological sequence variants according to Human Genome Variation Society recommendations

Features

Important Notes

Examples

Installation

Parsing and Formating

Projecting ("Mapping") variants between aligned genome and transcript sequences

Translating coding variants to protein sequences

Normalizing variants

Citing hgvs (the package)

Contributing

See Also

About

Releases

Packages

Languages

License

invitae/hgvs

Folders and files

Latest commit

History

Repository files navigation

hgvs - manipulate biological sequence variants according to Human Genome Variation Society recommendations

Features

Important Notes

Examples

Installation

Parsing and Formating

Projecting ("Mapping") variants between aligned genome and transcript sequences

Translating coding variants to protein sequences

Normalizing variants

Citing hgvs (the package)

Contributing

See Also

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages