Skip to content
Martin Asser Hansen edited this page Oct 2, 2015 · 6 revisions

Biopiece: format_genome

Description

format_genome uses sequences in the stream to create genome indexes for specified tools. The indexes are kept on disk in the genome directory ~/BP_DATA/genomes/.

The resulting directory tree of format_genome is:

BP_DATA/+-- genomes/
        +-- ce4/
        |   +-- blast/
        |   +-- bowtie/
        |   +-- bwa/
        |   +-- fasta/
        +-- dm3/
        |   +-- blast/
		|   +-- bowtie/
        |   +-- fasta/
        |   +-- phastcons/
        |   +-- vmatch/
        +-- hg18/
        |   +-- blast/
        |   +-- fasta/
        |   +-- phastcons/
        |   +-- vmatch/
        +-- mm9/
            +-- blast/
            +-- fasta/
            +-- vmatch/

For large genomes pre-build indexes can be obtained on the bowtie website which saves the time of building the indexes:

ftp://ftp.cbcb.umd.edu/pub/data/bowtie_indexes/

To install a bowtie index from the ftp site, e.g. the human genome, download the index, create the target directory and unzip the files into this - and finally rename:

wget ftp://ftp.cbcb.umd.edu/pub/data/bowtie_indexes/h_sapiens.ebwt.zip

mkdir $BP_DATA/genomes/hg18/bowtie

unzip h_sapiens.ebwt.zip -d $BP_DATA/genomes/hg18/bowtie

cd $BP_DATA/genomes/hg18/bowtie

perl -e 'map { $old = $_; $new = $old; $new =~ s/h_sapiens/hg18/; rename( $old, $new ) } @ARGV' h_sapiens.*

Usage

... | format_genome [options] <-g <genome>>

Options

[-?          | --help]               #  Print full usage description.
[-x          | --no_stream]          #  Do not emit records.
[-d <dir>    | --dir=<dir>]          #  Biopiece data directory      -  Default=$BP_DATA
[-g <genome> | --genome=<genome>]    #  Name of genome.
[-f <string> | --formats=<string>]   #  List of formats to create (fasta,blast,vmatch,bowtie,bwa,phastcons).
[-I <file!>  | --stream_in=<file!>]  #  Read input from stream file  -  Default=STDIN
[-O <file>   | --stream_out=<file>]  #  Write output to stream file  -  Default=STDOUT
[-v          | --verbose]            #  Verbose output.

Examples

Formatting a genome creating a FASTA index for fast lookups with get_genome_seq:

read_2bit -i human.2bit | format_genome -g human -f fasta -x

And to create a nucleotide BLAST index for use with blast_seq:

read_fasta -i human.fna | format_genome -g human -f blast -x

And to create the Vmatch index for use with vmatch_seq:

read_2bit -i human.2bit | format_genome -g human -f vmatch -x

See also

read_fasta

[read_2bit]

list_genomes

get_genome_seq

blast_seq

vmatch_seq

Author

Martin Asser Hansen - Copyright (C) - All rights reserved.

[email protected]

August 2007

License

GNU General Public License version 2

http://www.gnu.org/copyleft/gpl.html

Help

format_genome is part of the Biopieces framework.

http://www.biopieces.org

Clone this wiki locally