Skip to content
Adam Price edited this page Jan 17, 2017 · 12 revisions

Simulome

October 21, 2016
Version: 1.0.1
Author: Adam Price
Maintainer: Adam Price
Contact: [email protected]
Copyright: Adam Price, 2016
License: MIT

Simulome provides a powerful and easy to use tool for creating pseudo-genomes and mutated variants for prokaryotes. Simulome makes it possible to create genomes based on any prokaryotic species, while controlling for a variety of factors. It provides a range of options that can be used in combination to create mutated variants of the simulated genome, which allows for controlled testing of specific genomic conditions. Simulome can be used in combination with reads generated from next-generation sequencing platforms or alternatively with NGS read simulation packages.

Simulome takes an existing genome and the corresponding annotation information for that genome and samples a subset of the genes to use as a simulated genome. Sampling is performed based gene length and genes are selected to approximate a normal distribution of read lengths. That is, the mean length of all genes in the provided reference genome and the standard deviation are calculated, and genes are then sampled such that the mean and standard deviation of the simulated reference genome approximates that of the originally provided genome. An initial simulation is created by using these sampled genes in conjunction with non-duplicating intergenic regions, or by randomly sampling from the intergenic regions of the provided reference genomes. Once the initial genome is simulated, a variant genome can be simulated to meet desired specifications. Alternatively, users can specify not to simulate a pseudo-genome and can directly apply Simulome’s variant tools to create a mutated genome based directly on the provided reference genome. Four run modes are available and can be used in any combination to produce variants containing SNPs, Synonymous/nonsynonymous mutations, indels, and/or duplicate regions. Additional optional arguments are available to allow direct control over selection criteria and genomic structure. The resulting simulations will each be provided as a FASTA nucleotide file, a GTF/GFF3 annotation file, and a variant metadata file.

Each run mode can be configured to introduce either an exact number of mutations in each gene, or otherwise to simulate variants in a range based on a Gaussian distribution with user-defined means and standard deviations. Intergenic regions can be determined in a variety of ways. Users can either specify the use of randomly generated intergenic regions, or can alternatively use actual intergenic sequence data from the provided reference genome. Randomly generated sequences are generated such that each base has a 25% chance of being selected for any given position. When actual intergenic regions are used, all intergenic regions for the provided reference genome are extracted and segments are randomly sampled for when necessary. In both cases, users can specify intergenic length or allow randomly sized intergenic regions to be simulated.

Simulome can be used in combination with read simulators such as ART to create completely controlled simulations.

For example, how SNPs influence read alignment of various lengths can be simulated as shown in the above plot. The above data was simulated using Simulome and Art, and shows how read alignment performs for a correct (Native) or mutated (Heterologous) genome.

Dependencies

Simulome was developed in a linux/unix environment and requires the following libraries for proper functionality.
• Python 2.7.2
• Biopython 1.6.1+
• BLAST 2.3.0+

For detailed usage instructions, please see the Simulome manual.

Clone this wiki locally