read_genbank

Biopiece: read_genbank

Description

read_genbank read in entries from Genbank files. A Genbank entry consists of three main parts:

Generic info (such as accession number, species, references, taxonimy, version, etc.)
Feature table (containing info on features within each Genbank entry)
Sequence

read_genbank per default read all these informations, but it is possible to specify which parts of the Generic info, Feature table, and Sequence that is read - which results in great speed improvements.

A Biopiece record is output per feature from the Feature table. The sequence for each feature is included.

Based on the Location of each feature S_BEG, S_END, and STRAND keys are added to the biopiece record.

For each feature the qualifiers are seperated with semi-colon per qualifier.

The Genbank format is notoriously evil to parse and read_genbank uses a couple of compromises in order to focus on parsing information from the Feature table. E.g. the parsing of references from the Generic info section is crude.

Usage

read_genbank [options] -i <Genbank file(s)>

Options

[-?          | --help]               #  Print full usage description.
[-i <files!> | --data_in=<files!>]   #  Comma separated list of files or glob expression to read.
[-n <uint>   | --num=<uint>]         #  Limit number of records to read.
[-k <list>   | --keys=<list>]        #  Match a subset of record keys only.
[-f <list>   | --features=<list>]    #  Match a subset of features only.
[-q <list>   | --qualifiers=<list>]  #  Match a subset of qualifiers only.
[-I <file!>  | --stream_in=<file!>]  #  Read input stream from file  -  Default=STDIN
[-O <file>   | --stream_out=<file>]  #  Write output stream to file  -  Default=STDOUT
[-v          | --verbose]            #  Verbose output.

Examples

Consider the following sample record:

http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html

If we have the sample record in the file test.gb:

Now, reading in the entire entry with read_genbank yields:

read_genbank -i test.gb

<6 records output - one for each feature>

To limit parsing of the Generic info section specify which keys you are interested in with the -k switch (using the first two letters in upper case):

read_genbank -i test.gb -k AC

<6 records output - one for each feature>

To limit parsing of the Feature table specify which features you are interested in with the -f switch:

read_genbank -i test.gb -k AC -f CDS

<3 records output - one for each CDS feature>

To limit parsing of the qualifiers specify which to parse using the -q switch:

read_genbank -i test.gb -k AC -f CDS -q translation

<3 records output - one for each CDS feature>

Author

[email protected]

December 2010

License

GNU General Public License version 2

http://www.gnu.org/copyleft/gpl.html

Help

read_genbank is part of the Biopieces framework.

http://www.biopieces.org

Provide feedback

Saved searches

Use saved searches to filter your results more quickly