-
Notifications
You must be signed in to change notification settings - Fork 23
format_genome
format_genome uses sequences in the stream to create genome indexes for specified tools.
The indexes are kept on disk in the genome directory ~/BP_DATA/genomes/
.
The resulting directory tree of format_genome is:
BP_DATA/+-- genomes/
+-- ce4/
| +-- blast/
| +-- bowtie/
| +-- bwa/
| +-- fasta/
+-- dm3/
| +-- blast/
| +-- bowtie/
| +-- fasta/
| +-- phastcons/
| +-- vmatch/
+-- hg18/
| +-- blast/
| +-- fasta/
| +-- phastcons/
| +-- vmatch/
+-- mm9/
+-- blast/
+-- fasta/
+-- vmatch/
For large genomes pre-build indexes can be obtained on the bowtie website which saves the time of building the indexes:
ftp://ftp.cbcb.umd.edu/pub/data/bowtie_indexes/
To install a bowtie index from the ftp site, e.g. the human genome, download the index, create the target directory and unzip the files into this - and finally rename:
wget ftp://ftp.cbcb.umd.edu/pub/data/bowtie_indexes/h_sapiens.ebwt.zip
mkdir $BP_DATA/genomes/hg18/bowtie
unzip h_sapiens.ebwt.zip -d $BP_DATA/genomes/hg18/bowtie
cd $BP_DATA/genomes/hg18/bowtie
perl -e 'map { $old = $_; $new = $old; $new =~ s/h_sapiens/hg18/; rename( $old, $new ) } @ARGV' h_sapiens.*
... | format_genome [options] <-g <genome>>
[-? | --help] # Print full usage description.
[-x | --no_stream] # Do not emit records.
[-d <dir> | --dir=<dir>] # Biopiece data directory - Default=$BP_DATA
[-g <genome> | --genome=<genome>] # Name of genome.
[-f <string> | --formats=<string>] # List of formats to create (fasta,blast,vmatch,bowtie,bwa,phastcons).
[-I <file!> | --stream_in=<file!>] # Read input from stream file - Default=STDIN
[-O <file> | --stream_out=<file>] # Write output to stream file - Default=STDOUT
[-v | --verbose] # Verbose output.
Formatting a genome creating a FASTA index for fast lookups with get_genome_seq:
read_2bit -i human.2bit | format_genome -g human -f fasta -x
And to create a nucleotide BLAST index for use with blast_seq:
read_fasta -i human.fna | format_genome -g human -f blast -x
And to create the Vmatch index for use with vmatch_seq:
read_2bit -i human.2bit | format_genome -g human -f vmatch -x
[read_2bit]
Martin Asser Hansen - Copyright (C) - All rights reserved.
August 2007
GNU General Public License version 2
http://www.gnu.org/copyleft/gpl.html
format_genome is part of the Biopieces framework.