-
Notifications
You must be signed in to change notification settings - Fork 23
read_genbank
read_genbank read in entries from Genbank files. A Genbank entry consists of three main parts:
- Generic info (such as accession number, species, references, taxonimy, version, etc.)
- Feature table (containing info on features within each Genbank entry)
- Sequence
read_genbank per default read all these informations, but it is possible to specify which parts of the Generic info, Feature table, and Sequence that is read - which results in great speed improvements.
A Biopiece record is output per feature from the Feature table. The sequence for each feature is included.
Based on the Location of each feature S_BEG
, S_END
, and STRAND
keys are added to the biopiece
record.
For each feature the qualifiers are seperated with semi-colon per qualifier.
The Genbank format is notoriously evil to parse and read_genbank uses a couple of compromises in order to focus on parsing information from the Feature table. E.g. the parsing of references from the Generic info section is crude.
read_genbank [options] -i <Genbank file(s)>
[-? | --help] # Print full usage description.
[-i <files!> | --data_in=<files!>] # Comma separated list of files or glob expression to read.
[-n <uint> | --num=<uint>] # Limit number of records to read.
[-k <list> | --keys=<list>] # Match a subset of record keys only.
[-f <list> | --features=<list>] # Match a subset of features only.
[-q <list> | --qualifiers=<list>] # Match a subset of qualifiers only.
[-I <file!> | --stream_in=<file!>] # Read input stream from file - Default=STDIN
[-O <file> | --stream_out=<file>] # Write output stream to file - Default=STDOUT
[-v | --verbose] # Verbose output.
Consider the following sample record:
http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html
If we have the sample record in the file test.gb
:
Now, reading in the entire entry with read_genbank yields:
read_genbank -i test.gb
<6 records output - one for each feature>
To limit parsing of the Generic info section specify which keys
you are interested in with the -k
switch (using the first two
letters in upper case):
read_genbank -i test.gb -k AC
<6 records output - one for each feature>
To limit parsing of the Feature table specify which features
you are interested in with the -f
switch:
read_genbank -i test.gb -k AC -f CDS
<3 records output - one for each CDS feature>
To limit parsing of the qualifiers specify which to parse using
the -q
switch:
read_genbank -i test.gb -k AC -f CDS -q translation
<3 records output - one for each CDS feature>
Martin Asser Hansen - Copyright (C) - All rights reserved.
December 2010
GNU General Public License version 2
http://www.gnu.org/copyleft/gpl.html
read_genbank is part of the Biopieces framework.