pLM-BLAST

pLM-BLAST is a sensitive remote homology detection tool that is based on the comparison of residue embeddings obtained from the protein language model ProtTrans5. It is available as a standalone package as well as an easy-to-use web server within the MPI Bioinformatics Toolkit, where several precomputed databases (e.g., ECOD, InterPro, and PDB) can be searched.

Note: the method is being actively developed, please expect soon new features, such as speed improvement, global alignment, and others.

Installation

For the local use, use the requirements.txt file or environment.yml to create a new conda environment.
pip

pip install -r requirements.txt

conda

conda env create -f environment.yml

Alternatively, the packages listed below can be installed manually:

python==3.9
pandas
pytorch
scikit-learn
biopython 
tqdm
numba
transformers
sentencepiece 
matplotlib

Usage

Databases

Pre-calculated databases can be downloaded from http://ftp.tuebingen.mpg.de/pub/protevo/toolkit/databases/plmblast_dbs. To create a custom database, use embeddings.py script:

embeddings.py -embedder pt -cname column_name database.csv database.pt_emb.p --gpu

database.csv is an index file defining sequences and their descriptions. For example, the first lines of the ECOD database index are:

,id,description,sequence
0,ECOD_000151743_e4aybQ1,"ECOD_000151743_e4aybQ1 | 4146.1.1.2 | 4AYB Q:33-82 | A: alpha bundles, X: NO_X_NAME, H: NO_H_NAME, T: YqgQ-like, F: RNA_pol_Rpo13 | Protein: DNA-DIRECTED RNA POLYMERASE",FPKLSIQDIELLMKNTEIWDNLLNGKISVDEAKRLFEDNYKDYEKRDSRR
1,ECOD_000399743_e3nmdE1,"ECOD_000399743_e3nmdE1 | 5027.1.1.3 | 3NMD E:3-53 | A: extended segments, X: NO_X_NAME, H: NO_H_NAME, T: Preprotein translocase SecE subunit, F: DD_cGKI-beta | Protein: cGMP Dependent PRotein Kinase",LRDLQYALQEKIEELRQRDALIDELELELDQKDELIQMLQNELDKYRSVI
2,ECOD_002164660_e6atuF1,"ECOD_002164660_e6atuF1 | 927.1.1.1 | 6ATU F:8-57 | A: few secondary structure elements, X: NO_X_NAME, H: NO_H_NAME, T: Elafin-like, F: WAP | Protein: Elafin",PVSTKPGSCPIILIRCAMLNPPNRCLKDTDCPGIKKCCEGSCGMACFVPQ

Index can be generated from a FASTA file using scripts/makeindex.py.

Use -cname to specify in which column of the database.csv file sequences are stored
The resulting embeddings will be stored in database.pt_emb.p
Usage of --gpu is highly recommended (cpu calculations are orders of magnitude slower)

Searching a database

To search a pre-calculated or custom database, follow scripts/example.sh

Use in Python

import torch
import alntools as aln
from alntools.base import Extractor
import pandas as pd

# load the embedding; for the embedding calculation, refer to `scripts/example.sh`
emb_file = './scripts/output/A9A4Y8.pt_emb.p'
embs = torch.load(emb_file)

# a self-comparison will be performed
seq1_emb, seq2_emb = embs[0], embs[0]

# calculate embedding similarity aka substitution matrix
densitymap = aln.density.embedding_similarity(seq1_emb, seq2_emb)
# convert to numpy array
densitymap = densitymap.cpu().numpy()
# find all alignment possible paths (traceback from borders)
paths = aln.alignment.gather_all_paths(densitymap)
# score those paths
results = aln.prepare.search_paths(densitymap, paths=paths, as_df=True)
# remove redundant hits
results = aln.postprocess.filter_result_dataframe(results)

Remarks

How to cite?

If you find the pLM-BLAST useful, please cite the preprint:

"pLM-BLAST – distant homology detection based on direct comparison of sequence representations from protein language models"
Kamil Kaminski, Jan Ludwiczak, Vikram Alva, and Stanislaw Dunin-Horkawicz
bioRxiv https://www.biorxiv.org/content/10.1101/2022.11.24.517862v1

Contact

If you have any questions, problems or suggestions, please contact us.

Funding

This work was supported by the First TEAM program of the Foundation for Polish Science co-financed by the European Union under the European Regional Development Fund.

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
alntools		alntools
embedders		embedders
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
embeddings.py		embeddings.py
environment.yml		environment.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pLM-BLAST

Table of contents

Installation

Usage

Databases

Searching a database

Use in Python

Remarks

How to cite?

Contact

Funding

About

Releases

Packages

Languages

License

knakamurakhb/pLM-BLAST-1

Folders and files

Latest commit

History

Repository files navigation

pLM-BLAST

Table of contents

Installation

Usage

Databases

Searching a database

Use in Python

Remarks

How to cite?

Contact

Funding

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages