Skip to content

Latest commit

 

History

History
154 lines (114 loc) · 4.58 KB

README.md

File metadata and controls

154 lines (114 loc) · 4.58 KB

NCBI blast tutorial

Short introduction to using NCBI blast tools from the command line

Using Blast from the command line

Sometimes, you may have to use blast on your own computer to query thousands of sequences against a custom database of hundreds of thousands of sequences. To do that, you will need to install Blast on your computer, format the database, and then blast the sequences.

Here is a short tutorial on how to do this.

Installing Blast+ tools

Get the compiled executables from this URL:

ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/

Decompress the archive. For example:

tar xvfz ncbi-blast-2.9.0+-x64-linux.tar.gz

Add the bin folder from the extracted archive to your path. For example, add the following line to your ~/.bashrc file:

export PATH="/PATH/TO/ncbi-blast-2.9.0+/bin":$PATH

And change the /PATH/TO part to the path where you have put the extracted archive.

Example sequences to use with the tutorial

In order to test blast, you need a test fasta file. Use the following files that come with the tutorial:

  • sequences.fasta
  • reference.fasta

Create blast database

The different blast tools require a formatted database to search against. In order to create the database, we use the makeblastdb tool:

makeblastdb -in reference.fasta -title reference -dbtype nucl -out databases/reference

This will create a list of files in the databases folder. These are all part of the blast database.

Blast

We can now blast our sequences against the database. In this case, both our query sequences and database sequences are DNA sequences, so we use the blastn tool:

blastn -db databases/reference -query sequences.fasta -evalue 1e-3 -word_size 11 -outfmt 0 > sequences.reference

You can use different output formats with the outmft option:

 -outfmt <String>
   alignment view options:
     0 = pairwise,
     1 = query-anchored showing identities,
     2 = query-anchored no identities,
     3 = flat query-anchored, show identities,
     4 = flat query-anchored, no identities,
     5 = XML Blast output,
     6 = tabular,
     7 = tabular with comment lines,
     8 = Text ASN.1,
     9 = Binary ASN.1,
    10 = Comma-separated values,
    11 = BLAST archive format (ASN.1)

Blast with parallel

If you need to run your blasts faster (and who doesn't?), you can maximise CPU usage with gnu parallel. You will find it at this link.

Download the archive, extract it (with tar xvfB parallel-latest.tar.bz2) and install it with the following commands:

./configure
make
sudo make install

We can now use parallel to speed up blast:

time cat sequences.fasta | parallel -k --block 1k --recstart '>' --pipe 'blastn -db databases/reference -query - -evalue 1e-3 -word_size 11 -outfmt 0' > sequences.reference

More options and getting help

If you need help to know the options and parameters you can pass blastn and the other blast+ utilities, use the --help option and pipe the output into less, for example:

blastn --help | less

NCBI blast tools cover more cases than DNA against DNA searches. For example, you can search a protein database with either DNA or protein sequences. Here is an exhaustive list of the programs that come with the blast+ distribution:

blastdb_aliastool
blastdbcheck
blastdbcmd
blast_formatter
blastn
blastp
blastx
convert2blastmask
deltablast
dustmasker
legacy_blast.pl
makeblastdb
makembindex
makeprofiledb
psiblast
rpsblast
rpstblastn
segmasker
tblastn
tblastx
update_blastdb.pl
windowmasker

References

O. Tange (2011): GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, February 2011:42-47.

Licence

Creative Commons Licence
NCBI blast tutorial by Eric Normandeau is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at https://github.com/enormandeau/ncbi_blast_tutorial.