GST - Gene Symbol Transformer

A Transformer model for gene symbol assignment of protein coding gene sequences.

The classification model is built for assigning gene symbols and names to protein coding gene sequences from the Ensembl Genebuild annotation.

The network has been trained on a dataset constructed from existing protein coding gene sequences with assigned symbols on the Ensembl main release, which have been generated either by manual annotation or using an HMM homology method.

network architecture

The training pipeline utilizes a Transformer to learn higher-dimensional features of the raw sequences and performs multiclass classification on the ~30,500 gene symbols that are most frequently occurring in the dataset.

performance

The Transformer network exhibits high accuracy in assigning gene symbols. The highest probability gene symbol prediction can be used as a cutoff threshold for the gene symbols to be included in the final assignments, to take into account the network's confidence of the prediction. A probability of value 0.9 is used as the cutoff threshold for assigning gene symbols in production.

run GST with Docker

Download a pretrained transformer model to assign gene symbols to gene sequences saved in FASTA file format.

docker run --read-only \
    --volume="<checkpoints directory path>":/app/checkpoints \
    --volume="<sequences file directory path>":/app/data \
    ensemblorg/gene_symbol_classifier:0.12.1 \
    --checkpoint "/app/checkpoints/<checkpoint filename>" \
    --sequences_fasta "/app/data/<sequences fasta filename>" \
    --scientific_name "<species scientific name>"

License

Apache License 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 846 Commits
OrthoDB_dataset		OrthoDB_dataset
images		images
notebooks		notebooks
production		production
scripts		scripts
.gitignore		.gitignore
.python-version		.python-version
Dockerfile		Dockerfile
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
configuration.yaml		configuration.yaml
environment_setup.sh		environment_setup.sh
generate_dataset.py		generate_dataset.py
models.py		models.py
orthologs_dataset.py		orthologs_dataset.py
pipeline.py		pipeline.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
query.sql		query.sql
run_query.py		run_query.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GST - Gene Symbol Transformer

network architecture

performance

run GST with Docker

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Ensembl/gene_symbol_transformer

Folders and files

Latest commit

History

Repository files navigation

GST - Gene Symbol Transformer

network architecture

performance

run GST with Docker

License

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages