-
Notifications
You must be signed in to change notification settings - Fork 10
3.7. Outputting the Simulation Results
The sampling schedule lists one or more samplers, which each extract a particular type of information from the simulation, and dumps it to a file.
An example of a sampling schedule which defines three samplers:
<samplingSchedule>
<!-- Sample 100 sequences from the population, every 1000 generations,
in NEXUS format. -->
<sampler>
<atFrequency>1000</atFrequency>
<fileName>alignment_%r.nex</fileName>
<alignment>
<sampleSize>100</sampleSize>
<format>NEXUS</format>
<label>seq_%g_%s</label>
</alignment>
</sampler>
<!-- Sample the frequency of all 20 states at two amino acid sites,
every 1000 generations. -->
<sampler>
<atFrequency>1000</atFrequency>
<fileName>aa_frequencies.csv</fileName>
<alleleFrequency>
<feature>protein AB</feature>
<sites>1,2</sites>
</alleleFrequency>
</sampler>
<!-- Sample population statistics, every generation -->
<sampler>
<atFrequency>1</atFrequency>
<fileName>stats.csv</fileName>
<statistics />
</sampler>
</samplingSchedule>
The following properties are common for every sampler:
-
<atFrequency>
: define the sampler to run at a certain frequency, every so many generations -
<atGeneration>
: define the sampler to run once at a specific generation -
<fileName>
: dumps its result in a given filename. The special string '%r' will be replaced with the current replicate, to avoid that each replicate writes in the same value, each time erasing the results of the previous run.
The alignment sampler samples whole genome alignments from the population at a given generation. It has the following properties:
-
<sampleSize>
: the amount of genomes to be sampled -
<format>
: the format in which the alignments need to be stored:- NEXUS: NEXUS format
- FASTA: FASTA format
- XML: a custom XML format
-
<label>
: the label for each sequence. The special strings '%r', '%g', and '%s' are substituted with respectively the replicate index, the generation number, and the index of the sequence within the sample. -
<consensus>
(true or false): whether a consensus sequence should be synthesized and stored, rather than the full alignment (false by default).
The statistics sampler dumps some common population genetic statistics:
-
mean
_
diversity: mean nucleotide sequence diversity (estimated from a sample of 10 random sequences) -
max
_
diversity: maximum nucleotide sequence diversity (estimated from a sample of 10 random sequences) -
min
_
fitness: fitness of individual with lowest fitness -
mean
_
fitness: mean fitness of population -
max
_
fitness: fitness of individual with highest fitness -
max
_
frequency: frequency of most common genome in the population -
mean
_
distance: mean sequence distance of population from initial population (ignoring mutation saturation, thus an overestimate)
The statistics sampler does not require any configuration.
This sampler will output the frequency of each possible state at each given site in a nucleotide or amino acid feature.
By default, the feature is the genome feature (nucleotides of the entire genome), and the sites are all sites in the feature.
This may be overridden by:
-
<feature>
: the name of one of the defined features in the**<genome>
description. If omitted, genome is assumed. -
<sites>
: A comma separated list of single sites or site ranges within the feature. Note that if the feature is an amino acid feature, this refers to amino acid sites, while if the feature is a nucleotide feature, this refers to nucleotide sites.
In addition to sampling sequences, the genealogies that gave rise to those samples may also be sampled. For example, the configuration below produces a NEXUS format ancestral tree for 10 random viruses selected from the population every 100 generations.
As with the <alignment>
sampler, a random subset of viruses is
selected from the population
each time the sampler is run. It is possible to get a complete
picture of the branching process if the tree sample size matches the
population size. The resulting trees may be viewed with
figtree or
other tree visualization software.
<sampler>
<atFrequency>100</atFrequency>
<fileName>santa_out.trees</fileName>
<tree>
<sampleSize>10</sampleSize>
<format>NEXUS</format>
<label>sequence_%s</label>
</tree>
</sampler>
-
<sampleSize>
: number of leaves in the sampled trees. -
<format>
: format of the genealogy trees to be produced:- NEXUS: NEXUS format
- NEWICK: NEWICK format
-
<format>
: labels associated with the leaves of each tree.
As with
<label>
elements in alignment samplers, the strings '%r', '%g', and '%s' are substituted with respectively the replicate index, the generation number, and the index of the sequence within the sample. The format provided here must provide unique names across all sampled taxa, but should provide consistent names across samples. For example, a value ofname
would not work because all taxa would have the same name.sequence_%s_%g
also would not work because the taxon names would change across samples. A value ofsequence_%s
satisfies all requirements and works well. An incorrect value here results in a Java exception thrown from deep in thejebl.jar
library.