Store for finding similar amino acid sequences using Locality Sensitive Hashing and Approximate Nearest Neighbors.
Python, with dependencies:
- Annoy
- BioPython
pip install -r requirements.txt
Redis
- Load proteins into database
$ ./load-proteins < data/proteins.fasta
- Query database
$ ./query-proteins < data/proteins.fasta # returns JSON for each record
Very basic. I plan to add more configuration.
from Bio import SeqIO
import nearproteins
store = nearproteins.SimilarStringStore()
store.engine.clean_all_buckets()
records = SeqIO.parse(handle, 'fasta')
for record in records:
store.add(str(record.seq), record.id)
# returns array of vectors, match IDs, similarities
results = store.get(str(record.seq))
You can query and add records to the database using simple sockets.
$ ./server # start the server, listens on port 1234
In another window...
$ nc 127.0.0.1 1234 # connect
SET 1 AUSTIN
SET 2 BOSTON
GET AUSTIN
{"1": 0.0}
GET BOSTON
{"2": 0.0}
You can use this to build a simple client in another language such as Ruby
require 'dna'
require 'socket'
require 'json'
HOSTNAME = '127.0.0.1'
PORT = '1234'
socket = TCPSocket.open HOSTNAME, PORT
File.open('proteins.fasta') do |handle|
records = Dna.new handle, :format => :fasta
records.each do |record|
socket.puts "GET #{record.sequence}"
resp = JSON.parse(socket.gets)
p resp
end
end