gstudio_package.rmd

# the gstudio Library


The **gstudio**  package is a package created to make the inclusion of marker based population genetic data in the R workflow easy.  An underlying motivation for this package is to provide a link between spatial analysis and graphing packages such that the user can be quickly and easily manipulate data in exploratory ways that aid in gaining biological inferences.


## Installing the package

This package requires several other packages for installation.  By default, the install should be easily accomplished using the build-in functionalities in R.

```{r, eval=FALSE,echo=TRUE}
install.packages("gstudio")
```

Occasionally, you should look to see if there are updates to the package by doing the following (this will update all packages you have installed)

```{r, eval=FALSE}
update.packages( ask=FALSE )
```


If you want the most recent version of this package, I make the development builds available on Github (http://github.com/dyerlab/).  You can install directly from within `R` as:

```{r,eval=FALSE}
install.packages(devtools)
library(devtools)
install_github("dyerlab/gstudio")
```

I recommend using the latest version, it has a lot of the newer features and I do not check it into github until it has been tested.  I only post to CRAN when major versions change.


### Loading the Package

And any time you need to use the package, you would just pull it into your session.

```{r,message=FALSE}
library(gstudio)
```

This should get you everything you need. The **gstudio**  package does contain a lot of build-in documentation including a lot of examples.  All the functions and examples associated with them can be found in the build-in documents available from any of the following (n.b., the vignette is a simple placeholder with links to the website.)

```{r, eval=FALSE}
help(package="gstudio")
vignette('gstudio')
system( "open http://dyerlab.github.io/gstudio/")
```

At any point if you have any questions about the values or options for a particular function in **gstudio**  or any other package, you can use the  `help.search(func)` or `?` functionality. This file is also kept in sync with the development of the **gstudio**  package (it is in the source package that you downloaded from CRAN) and will serve as a tutorial for your use of this package.  If there are any questions that you may have regarding this package, feel free to contact [Rodney Dyer](mailto://rjdyer@vcu.edu) and I will get back to you as soon as possible.


## Genetic Data 

The overriding philosophy behind the **gstudio**  package is to make it as easy as possible to create, load, use, and integrate, genetic marker data into your analysis workflow.  As such, we typically use *data.frame* objects to hold our data and the addition of the *locus* class as a fundamental data type allows us to continue to do so.

```{r}
x <- locus( c(1,2) )
class(x)
x
```


You can think of a *locus* object as a vector of alleles.  There are several options you can use when constructing a *locus* object based upon what kind of marker data you are using.  These options are passed through the *type* option to the `locus()` function.  Here are the current options.

`type` Option | Function
- | 
              | This is the default value (e.g., nothing passed). It will treat the values passed to locus as alleles to a single locus
`snp`         | Alleles are '0', '1', or '2' indicating the number of minor alleles.
`zyme`        | Genotypes are encoded as "12" like zymes (e.g., "1" & "2" alleles together).
`separated`   | Alleles are already separated by ":" character (for putting in polyploid data)
`column`      | Alleles for a single locus are in two separate columns.

Here is some examples.

```{r}
loc0 <- locus( )
loc1 <- locus( 1 )
loc2 <- locus( c("C","A"), phased=TRUE )
loc3 <- locus( c("C","A") )
loc4 <- locus( 12, type="zyme" )
loc5 <- locus( 1:4 )
loc6 <- locus( "A:C:C:G", type="separated" )
loci <- c(loc0, loc1, loc2, loc3, loc4, loc5, loc6 )
loci
```

Notice how the printing of each *locus* object uses the colon character to separate alleles.  Also, since the *locus* object is a basic data type, it can be used in other data structures.

```{r}
df <- data.frame( ID=0:6, Loci=loci )
df
```

And you can use normal vector processing of the locus vector to do normal R data like operations.

```{r}
is.na( df$Loci )
is_heterozygote( df$Loci )
```


### Importing Data

Importing data is not that difficult.  People tend to keep their data in either spreadsheets or text files, which are easily accessible via R, or some archane program format (wave to Arlequin everyone), which may be less accessable.  As of version 1.1,  **gstudio** handles data both in raw format (e.g., as they appear in your spreadsheet) or in GENEPOP format.  All of these formats are accessed in **gstudio** using the function, `read_population()` that is a mix between the traditional `read.table()` function and the `locus()` functions. 

#### Import from GENEPOP files

The GENEPOP file format, as interpreted by **gstudio** has the following format.  

1. It is a text file and preferrably space delimited.
+ The first line has some info for you, it is mostly ignored but should include a bit about what the data are.
+ The next $K$ lines of the file list the names of the loci to be used, one per line.
+ The rest of the file contains populations.  Each population starts with "Pop" alone on its own line.  All individuals are assumed to be in the same population until such time the next "Pop" is reached.
+ For each individual, there is an identification value first, and followed by a comma are the genotypes.
+ All genotypes are encoded using 3 digits for each allele (e.g., a 3:5 genotype is 003005).  Missing data is all zeros '000000' and haploid is just three digits.

The import from *read_population()* names loci, adds "ID" for the identification column and adds to each population a "Pop-X" designation.  Other than that, it is identical to what is below.  However, if you have your data in a spreadsheet, there is no need to shove it into a genepop format to import into R.

#### Text File Import

Options for this function include:

Option        |    Function
- | --
path          | The full path to the text file
type          | The locus *type* (see above).
locus.columns | A vector of numbers of columns to be treated as `Locus()` objects
sep           | The character used to separate columns (',' is default)
header        | Columns have header names (e.g., locus names, etc.)

Here are some examples of data files with different kinds of genetic data, each of which exercises the `read_population()` function in a different way.  Hopefully this covers the main types of data being imported, if not, drop me an email [Rodney Dyer](rjdyer@vcu.edu).  Missing genotypes should be missing data or encoded as *NA*.  If you do not have a genotype then leave it blank.  There is no reason to use negative numbers or other conventions.  

Columns of genotypes are indicated by the required parameter *locus.columns* so that `read_population()` knows which columns to treat as *locus* objects and which to leave as normal data for R.  Without this parameter, the data will be read in as *character* or *numeric*.

There are some example data files included in the project for you to look at. Depending upon how your computer is set up, they may be placed in different locations.  Here is a quick way to find out where the installed folder is for the **gstudio** package and the location of the 'data' folder within it.

```{r}
system.file("extdata",package="gstudio")
```


#### Two Column Data

Here is an example of data where the genotypes are encoded as two columns of data in a csv file.  

```{r}
file <- system.file("extdata","data_2_column.csv",package="gstudio")
data <- read_population(file,type="column", locus.columns=4:7)
data
```


#### Phased Data

There are times when the gametic phase of the genotypes is important.  By default, **gstudio** will keep alleles sorted in alpha/numeric order.  If you need to keep this from happening, pass the optional *phased* option to `read_population()`.  Notice the differences between this and the previous genotypes.

```{r}
file <- system.file("extdata","data_2_column.csv",package="gstudio")
data <- read_population(file,type="column", locus.columns=4:7, phased=TRUE)
data
```


#### AFLP-like data

Genotypes that are 'aflp'-like are encoded as binary characters (e.g., 0/1) indicating the presence or absence of a particular band.  

```{r}
file <- system.file("extdata","data_aflp.csv",package="gstudio")
data <- read_population(file,type="aflp", locus.columns=c(4,5))
data
```


#### SNP Minor Allele Data

At times, SNP data is encoded in relation to the number of minor alleles.  You can import these data using the *type="snp"* option and it will encode them as 'AA', 'AB', or 'BB' with the 'B' allele as the minor one.

```{r}
file <- system.file("extdata","data_snp.csv",package="gstudio")
data <- read_population(file,type="snp", locus.columns=4:7)
data
```

#### Zyme-Like Data

Some data is encoded as allozyme genotypes (e.g., 33, 35, 55 for diploid individuals with alleles '3' and '5').

```{r}
file <- system.file("extdata","data_zymelike.csv",package="gstudio")
data <- read_population(file,type="zyme", locus.columns=4:7)
data
```


#### Pre-Separated Data For Higher Ploidy

```{r}
file <- system.file("extdata","data_separated.csv",package="gstudio")
data <- read_population(file,type="separated", locus.columns=c(4,5))
data
```


### Saving Data 

There are several ways to export your data to file.  

### Raw R objects

Saving data once it is in R is trivial and you do it as you would for any other R object.  The R object system knows how to serialize its own data using the `load()` and `save()` functions.

```{r,eval=FALSE}
save(df, file="MyData.rda")
```

To load the objects back into the work space, you just do:

```{r,eval=FALSE}
load("MyData.rda")
```

And you can verify that you have data in your work space by listing it.

```{r}
ls()
```


#### Saving as Text

As a default, the function `write_population()` will write your data file as a comma separated text file with the loci encoded as column separated (see `type="separated"` above).

```{r,eval=FALSE}
write_population(df, file="~/Desktop/MyData.csv")
```


#### Saving as GENEPOP

Raw data can be saved in GENEPOP formats by passing an optional argument `mode="genepop"` to the `write_population()` function.

```{r,eval=FALSE}
write_population(df, file="~/Desktop/MyData.txt",mode="genepop")
```


### Saving for STRUCTURE

Raw data can also be exported for analysis using the program STRUCTURE.  Here the optional argument is `type="structure"`

```{r,eval=FALSE}
write_population(df, file="~/Desktop/MyData.str", mode="structure")
```


## The arapat Data Set

The main genetic data included with the package if from the Sonoran desert bark beetle, *Araptus attenuatus* from the Dyer laboratory.  You can load it into your work space by:

```{r}
data(arapat)
```

Which looks like the following:

```{r echo=FALSE}
DT::datatable(arapat, options = list(scrollX = TRUE))
```

You can see several things as you scroll through the data.  First, the locus data are displayed by genotype counts including NA values where there was missing data.  A column of type *locus* is just like any other kind of variable and can be used as such.  This opens up a lot of functionality for you to be able to treat marker data just like everything else in R.

### Convenience Functions

Dealing easily with parts of your data is a critical skill and a huge benefit in using a grammar like R to do your analyses.  In R, *data.frame* objects are almost like little databases and you can do some really creative manipulations with them. The **gstudio**  package provides a few things that may help you work more efficiently with your data.


### Data Classes

Often it is important to know which columns of a data set are actually of a particular data type.  Here is a simple function that tells you either the name or index of columns in a data.frame that of a specific data type.

```{r}
column_class( arapat, class="locus")
column_class( arapat, class="locus", mode="index" )
```


## Partitioning

The `partition()` function takes a data frame and returns a list of data frames, partitioned by the stratum you pass to it.  This is really nice if you are doing a nested analysis of sorts and want to work with subsets of your data that are defined by a categorical *factor* variable.

```{r}
names(arapat)
clades <- partition( arapat, stratum="Species")
names( clades )
```

This kind of partitioning is very common in the analysis of spatial genetic structure and as such should be as simple as possible to provide the most flexibility to you, the analyst.  One of the common analysis patterns that you will come back to over and over again is to partition the entire data set and perform operations on each of the subgroups.  In R this is a pretty easy process if you look into the `lapply()` function (and its relatives).  This is such an important component, that I'm going to spend a little time here to make sure you understand what I am doing.  Once you get it, it will make you life tremendously more awesome!

The basic form of the various 'apply' functions is that you pass them some data and a function on which it will take each part of that data and apply it.  For lists (the *l* part in `lapply()`), the function will take each entry in the list and pass it along to the function.  The function itself can be one that is already available (like `length()` or `is.na()` or something) or it can be something you specify directly, on the fly.  Here is an example looking at the number of samples in each 'Species' as partitioned above.

```{r}
lapply( clades, dim)
```

Here the `dim()` function returns the dimension of the data.frame in each clade, all have the same number of columns, but differ in the number of individuals (the rows).  While this is a stupid example (you could get the same thing from `dim(arapat$Species)` but it shows the general pattern.  


### Plotting Populations

One of the key benefits to using an analysis environment such as R is that you can mash together functionality that you just can't get from a monolithic program.  An example of this is plotting populations.  If your data has spatial coordinates in them then you can use this to plot the location of your sites on a GoogleMap tile.  By default, you *must* have your coordinates in decimal degree format, with west and south as **negative** decimals.  This is the default for the GoogleMaps API.  Moreover, if you name the columns of data "Longitude" and "Latitude" much of the spatial functionality in **gstudio** will be more transparent (if not, you have to specify the Longitude and Latitude names each time you use a function that needs them).

```{r, warning=FALSE,message=FALSE}
library(ggplot2)
library(ggrepel)
library(ggmap)
coords <- strata_coordinates( arapat )
map <- population_map(coords)
ggmap(map) + geom_point(aes(x=Longitude,y=Latitude), data=coords, size=2) + geom_label_repel(aes(x=Longitude,y=Latitude,label=Stratum),data=coords) + xlab("Longitude") + ylab("Latitude")
```


## Allele Frequencies

Another object made easily in **gstudio** are objects related to allele frequencies.  Allele frequencies are just like every other kind of data and can be extracted from a `data.frame` containing *locus* objects using the function `frequencies()`.  

### Single Locus

Grabbing allele frequencies is a fundamental task for any population genetic analysis and should be as easy as possible.  Here are some examples of various ways to get allele frequency information using the *Araptus attenuatus* data set.


```{r}
freq.EF <- frequencies( arapat$EF )
class( freq.EF)
freq.EF
```


### Multilocus

The conversion of loci to a *data.frame* expands beyond the single locus. If you do not specify which locus to use, it will use all *locus* objects and add an additional column to the data frame (n.b., I only print out the first 10 rows to give you the idea).

```{r}
freqs.loci <- frequencies( arapat )
freqs.loci[1:10,]
```


### Substrata and Allele Frequencies

To complete the symmetry here, adding stratum to the analysis, provides yet another categorical variable upon which allele frequencies may be estimated.  Here is an example looking at the "Cluster" strata in the data set and a partial printout of the results.

```{r}
freqs.strata <- frequencies( arapat, stratum="Cluster" )
freqs.strata[1:10,]
```


### Plotting Allele Frequencies

There are several ways you may want to graphically view the locus data and for convenience, the **gstudio**  package provides some interfaces for nice plots using the **ggplot2** package.  

Plotting a vector of loci will by default estimate the frequencies of each allele for graphical output.  There are two different output for this (n.b., a pie chart by its nature can lead to inaccurate interpretations and most statisticians hate them).

```{r}
plot( arapat$MP20 )
plot( arapat$MP20, mode="pie")
```

You can also use the **ggplot2** routine `geom_locus()` to plot the frequencies:

```{r}
ggplot() + geom_locus( aes(x=MP20, fill=Cluster), data=arapat)
```


The frequencies across a collection of loci can easily be plot just as well (internally, this simple plot is just turns the object into a data frame and then plots it). At times, examination of allele spectra can reveal blatant differences in substratum of your data.  For example, consider the following spectrum for the locus MP20.


```{r}
f <- freqs.strata[ freqs.strata$Locus %in% c("MP20","AML"), ]
summary(f)
ggplot(f) + geom_frequencies(f) + facet_grid(Stratum~.) + theme(legend.position="none")
```


### Frequency Gradients

When you have many strata or you are conducting landscape-level analyses, it is often helpful to look at how allele frequencies change in relation to some variable other than stratum.  


```{r}
baja <- arapat[ arapat$Species != "Mainland",]
```

The *EN* locus has a few different alleles but if we look at the frequencies of each, the first two dominate.

```{r}
plot( baja$EN )
```

Using just the first allele *01*, it is pretty easy to plot the strata frequency as a function of latitude using normal R approaches.  To do this, one needs to:

1. Extract the *01* allele frequencies by population.
```{r}
freqs <- frequencies( baja, stratum="Population", loci="EN")
freq.01 <- freqs[ freqs$Allele == "01",]
```
2. Merge this *data.frame* with one containing the coordinates of the populations.
```{r}
coords <- strata_coordinates( baja )
df <- merge( freq.01, coords)
df[1:10,]
```

Now, you can plot the frequencies as either a linear plot (below you will see how to plot these along environmental gradients).

```{r}
ggplot( df, aes(x=Latitude,y=Frequency)) + geom_line(linetype=2) + geom_point(size=4) 
```

This is interesting.  Now, just to 'kick it up a notch' I'm going to look at the *Cluster* variable.  This is from mtDNA and shows punitively cryptic species.  I'm going to remake the plot above but color the points to indicate the presence of the 'SCBP-A' clade (perhaps another species).  Below I grab add a new column of data to *df* and then make it all 'Baja'.  Then I figure out which populations have 'SCBP-A' individuals in it.

```{r}
df$Species <- "Baja"
pops.with.scbp <- as.character(unique(baja$Population[ baja$Cluster=="SCBP-A"]))
df$Species[ df$Stratum %in% pops.with.scbp ] <- "Cape"
```

Then plot it.

```{r}
ggplot( df, aes(x=Latitude,y=Frequency)) + geom_line(linetype=2) + geom_point(size=5, aes(color=Species) ) 
```

I think this leads to some interesting questions about the relationship between potential species differences, where species are gauged by mtDNA, in nuclear allelic diversity.


### Spatial Frequency Plots

It is also possible to plot the data in a spatial context.  Here is an example of how to mix `ggplot()` and `ggmap()` data and I'll plot the locations as proportional in size to the allele frequency.

```{r, warning=FALSE,message=FALSE, error=FALSE}
map <- population_map(baja)
ggmap( map ) + geom_point( aes(x=Longitude, y=Latitude, size=Frequency), data=df)
```

There is also the option to make use of some pie charts.  I know, pie charts suck and any statistician will tell you that they should probably not be used because they can be misleading, but here they are.  For exploratory data analysis, they can be very insightful at times.  Here is the frequency of alleles at the *enolase* locus in *Araptus*.  Any spatial structuring catch your eye?


```{r,warning=FALSE,message=FALSE,fig.width=7,fig.height=7, eval=FALSE}
pies_on_map( arapat, stratum="Population", locus="EN")
```

Which will open a new browser window and produce a graph like the one below.

<iframe width="640" height="480" src="media/pies_on_map.html" frameborder="0" allowfullscreen></iframe>

Note the messages about the approximation.  This is because the google maps API has an integer for zoom factor and at times it is not able to get all the points into the field of view using an integer zoom.  If this happens to you, you can manually specify the *zoom* as an optional argument to either function `pies_on_map()` or `population_map()`.  You also need to be careful with the `pies_on_map()` function because the way it works is that the background tile is plotted and then I plot the pies ontop of it.  If you reshape your plot window outside equal x- and y- coordinates (e.g., make it a non-square figure), the spatial location of the pie charts will move!  This is a very frustrating thing but it has to do with the way viewports are overlain onto graphical objects in R and I have absolutely **no** control over it.  So, the best option is to make the plot square, export it as PNG or TIFF or whaterver, then crop as necessary.


## Multivariate Analogs for Loci

Genotype data is inherently multivariate.  In fact, it is multinomial multivariate *senus stricto* but we generally ignore that. That being said, we can easily translate raw genotypes into raw multivariate encodings for other statistical analyses.  Here is a quick example using a few individuals and the *WNT* locus.

```{r}
to_mv( arapat$WNT[1:10]  )
to_mv( arapat$WNT[1:10], drop.allele=TRUE)
```


For multiple loci, we can use the same approach.  Here is an example of a PCA analysis done on raw genotypes.

```{r}
x <- to_mv( arapat, drop.allele=TRUE)
fit.pca <- princomp(x, cor=TRUE)
summary(fit.pca)
```

Interesting.  It takes several eigenvectors to explain these data sufficiently.  Here is a simple plot of some of model, given by computing the predicted values for each sample.  I then use `ggplot()` to make a scatter plot with clade dictating the shape of the symbol (each symbol is an individual and clade was determined by mtDNA, not these data), and Clade to provide the color.

```{r}
pred <- predict( fit.pca)
df <- data.frame( PC1=pred[,1], PC2=pred[,2], Species=arapat$Species, Clade=arapat$Cluster, Pop=arapat$Population)
ggplot( df ) + geom_point( aes(x=PC1,y=PC2,shape=Species,color=Clade), size=3, alpha=0.75)
```

Looks like there are three main groups divided by clade and within the more dense clade, there is some sub-structuring.  I'll take the data that is in the main clade and do a quick hierarchical clustering analysis.

```{r,fig.width=12,warning=FALSE}
baja <- pred[df$Species=="Peninsula",]
h <- hclust( dist( baja ), method="single")
plot(h,main="Main Baja California Clade", xlab="")
```


## Measures of Genetic Diversity

Genetic diversity is estimated by several different means.  It can be estimated at several different levels; at individuals, at groups, at populations, etc.  It can also be estimated by several different parameters.  This section covers some of the more common parameters used for quantifying genetic diversity.


### Allelic Diversity

At the most basic level, the number of alleles within a group of individuals is a base measure of diversity.  However, there are some caveats to be made about the way in which we count alleles.  Rare alleles may or may not be as informative.  The three ways commonly used to look at allelic diversity are

1. The total number of alleles ($A$).
2. The effective number of alleles ($A_e$).
3. The number of alleles with at least 5% frequency ($A_{95}$).

These parameters are estimated from your data using the `genetic_diversity()` function.  The argument *mode* takes either "A" (the default), "Ae", or "A95" to differentiate.

```{r}
AA <- locus( c("A","A") )
AB <- locus( c("A","B") )
loci <- c(AA,AB,AA,AA,AA,AA,AA,AA,AA,AA,AA)
loci
genetic_diversity(loci)
genetic_diversity(loci,mode="A")
genetic_diversity(loci,mode="A95")
```


### Rarefaction

Rarefaction is a technique used to measure diversity in different populations.  It is particularly important for situations where you have different sample sizes.  Is there more diversity in the larger population because you sampled more or is it a truly more diverse population?  I'll use the data from the beetle to show how diversity changes with sample sizes and highlight how you can use the `rarefaction()` function.  

In the mainland populations, there are only 36 samples and the allelic diversity is relatively low at the WNT locus.

```{r}
loci.son <- arapat$WNT[ arapat$Species == "Mainland" ] 
length( loci.son )
genetic_diversity( loci.son, mode="Ae")
```

The larger clade on the peninsula has many more individuals and is more diverse.  

```{r}
loci.peninsula <- arapat$WNT[ arapat$Species == "Peninsula" ]
length( loci.peninsula )
genetic_diversity( loci.peninsula, mode="Ae")
```

So is this difference a consequence of the sample sizes or is peninsular Baja California more genetically diverse?  To answer this, rarefaction randomly sub-samples the loci.baja data and estimates the value of $\hat{A}_e$ for samples of size 36 (for our case) to see if the observed diversity differences are due to sampling alone.  To visualize the distribution, I throw it into a *data.frame* and use the **ggplot2** functions to make the pretty colored histogram.

```{r }
Ae.peninsula  <- rarefaction(loci.peninsula, mode="Ae", size=36)
df <- data.frame( Ae.peninsula )
ggplot( df, aes(x=Ae.peninsula) ) + geom_histogram(aes(fill=..count..), binwidth=0.05) + scale_fill_gradient("count",low="#cccccc",high="#a60000") + theme_bw()
```


### Heterozygosity

At a base level, heterozygosity is a form of diversity (see Nei).  Heterozygosity can be measured at many different stratum and in two forms.  All of these approaches can be accessed through the functions `Ho()` (for observed heterozygosity)  

```{r}
Ho( arapat$EF )
```

and `He()` (for expected heterozygosity given Hardy Weinberg Equilibrium).

```{r}
He( arapat$EF )
```

Both $H_E$ and $H_O$ can be used with full data sets as well. When you pass a full *data.frame* to these functions, it will return a *data.frame* with loci by row.

```{r}
He( arapat )
```


Given the broadness of these functions, it is easy to integrate them into broader analyses. Here is an example of expected heterozygosity ($H_E$ or genetic diversity in Nei's terms) as a function of latitude for the peninsular populations.  The output is displayed as a plot.

```{r}
baja <- arapat[ arapat$Species != "Mainland",]
coords <- strata_coordinates(baja)
pops <- partition(baja,stratum="Population")
he <- lapply( pops, function(x) return(He(x$EN )) ) 
data <- merge( coords, data.frame( Stratum=names(he), He=unlist(he)))
data <- data[ order( data$Latitude), ]
ggplot( data, aes(x=Latitude, y=He)) + geom_line(linetype=2) + geom_point(color="red",size=4)
```


## Inbreeding

Inbreeding is a consequence of mating patterns and/or demographic population size.  The consequences of inbreeding are related to how alleles are put into genotypes.  One approach to looking at inbreeding is to estimate the expected frequency of heterozygotes (e.g., the $2pq$ part of the classic Hardy-Weinberg equation) and compare it to the observed level of inbreeding. This is the classic $F$ statistic and is estimated as:

\[
F_{IS} = 1 - \frac{H_{O}}{H_{E}}
\]

Using the beetle data again, from the maps above we see three mainland populations and there is a good reason to believe that these are a separate species (see Garrick _et al._ 2013).  These populations are small and isolated and as such may experience inbreeding.  

```{r}
sonora <- arapat[ arapat$Species == "Mainland",]
fis.sonora <- Fis( sonora )
fis.sonora
```

There are various ways to get a confidence interval on these kind of analyses.  In what follows is an implicit test of $H_O: F_{IS} = 0$ using permutation.  If that Null hypothesis is true, then any random permutation of alleles (combined into genotypes) sampled from this population would produce estimates of $F_{IS}$ as large as that observed.  This kind of permutation is handled by the function `permute_ci()` (though it can be applied to more complicated analyses as shown below).  Here is an example of how to use it to create the null distribution of $F_{IS}$ values given these data for the loci $EN$, $AML$, and $ATPS$. 

```{r,warning=FALSE,message=FALSE}
locus.en <- sonora$EN[!is.na(sonora$EN)]
locus.aml <- sonora$EN[ !is.na(sonora$AML)]
locus.atps <- sonora$EN[ !is.na(sonora$ATPS)]
fis.en <- permute_ci( locus.en , FUN=Fis, nperm=99 )
fis.aml <- permute_ci( locus.aml , FUN=Fis, nperm=99 )
fis.atps <- permute_ci( locus.atps , FUN=Fis, nperm=99 )
```

Now we can plot these as histograms to look at their distributions.

```{r}
df <- data.frame( Locus=rep(c("EN","AML","ATPS"), each=length(fis.en)), Fis=c(fis.en,fis.aml,fis.atps))
ggplot( df, aes(x=Fis)) + geom_histogram(binwidth=0.025) + facet_grid( Locus~. ) + theme_bw()
```


Estimating a confidence interval around a point estimate is a bit different.  In the example above, we could ask if $F_{IS,EN}=$ `r fis.sonora$Fis[3]` was different than $F_{IS,ATPS}=$ `r fis.sonora$Fis[7]`, which is an entirely different question than the one addressing $H_O: F_{IS} = 0$.  That being said, it is not too difficult to do this given the tools you have in R.

## Measures of Genetic Distance


There are several genetic distances available within the **gstudio** package and the ability to use multivariate analogs of genotypes opens up all distance essentially all distance metrics to the end user.  In the stuff that follows are the distances that are internally implemented in the package along with a brief overview.

Since all the genetic distance approaches take the same general data, **gstudio** provides a general interface for all distance metrics in the `genetic_distance()` function.  This function takes the data as either a data frame with *locus* objects or a vector of *locus* objects and the genetic distance metric to be estimated and returns the appropriate response.  The general form of this function is as follows and only differs in the sense that individual genetic distances do not require a *stratum* whereas strata distances do. However, as all functions in **gstudio** that need a stratum, if your default name for that column is "Population" you do not need to provide it.

```{r,eval=FALSE}
amova.dist <- genetic_distance(data,mode="AMOVA")
nei.dist <- genetic_distance( data, stratum="Population", mode="Nei")
```

What follows is a more in-depth overview of each of the genetic distance metrics.  

### Individual Distances

At the base level, you can estimate distances among individuals resulting in an $NxN$ matrix of pair-wise distances.  This is internally how the AMOVA analysis is conducted and is a nice heuristic for conceptual understanding of variance decomposition.  

In the following examples, I will use a made-up data locus consisting of four alleles and an individual with each genotype to show how these distances work.  Here is the data:

```{r}
AA <- locus( c("A","A") )
AB <- locus( c("A","B") )
BB <- locus( c("B","B") )
AC <- locus( c("A","C") )
AD <- locus( c("A","D") )
BC <- locus( c("B","C") )
BD <- locus( c("B","D") )
CC <- locus( c("C","C") )
CD <- locus( c("C","D") )
DD <- locus( c("D","D") )
loci <- c(AA,AB,AC,AD,BB,BC,BD,CC,CD,DD) 
```


#### AMOVA Distance

The AMOVA distance metric was first introduced in Excoffier _et al._ (1992) using restriction fragment encodings. However, a more elegant description of it was described in Smouse & Peakall (1999) using a geometric interpretation.  Essentially, the coding of alleles at a locus can be depicted by a vector $\vec{p}$ whose length is equal to the number of alleles at the locus.  The presence of an allele increments the appropriate element of that vector.  For two individuals, the squared vector distance between them is:

\[
\delta_{ij}^2 = 2(p_i-p_j)^2
\]

Across loci, these are additive (though see Smouse & Peakall for additional weighing schemes by locus or allele and the relative power of adopting such approaches).  Across a set of individuals the squared genetic distance between all individuals can be represented by a square symmetric matrix with a zero diagonal, *D*.  The AMOVA analysis itself is conducting by taking the "Sums of Squared Genetic Distances" for all individuals (SSGD(Total)), within group SSW, and among groups SSA and variance is decomposed following the standard approach for a random effects model.  There is essentially no mystery in this approach, though it has been shrouded in obscurity.  Dyer _et al._ (2004) showed how this is just a multivariate linear model amenable to a much broader range of experimental designs than just 1-way and nested designs.


```{r}
D <- dist_amova( loci )
rownames(D) <- colnames(D) <- as.character(loci)
D
```


#### Bray Curtis Individual Distance


Bray-Curtis Distance (Bray \& Curtis 1957) has been primarily used to quantify differences in species composition.  It is defined as the total number of species that are unique to either of the two sites standardized by the number of species in both sites.  

\[
BC_\delta = \frac{S_i + S_j - 2S_{ij}}{S_i+S_j}
\]

where $S_x$ is the species count and $S_{ij}$ is the sum of minimum abundances.  Lately, this has seen considerable use within individual-based landscape genetic studies.  Missing genotypes are set to average allele frequencies, that is to say that every missing genotype is considered to have all the alleles present in the entire population, but with probability equal to their global frequencies.  Essentially, this removes the \texttt{NA} problem like in the \texttt{mode="Jaccard"} situation and does so by taking the non-missing genotype's genetic distance from the global genetic centroid (it's cosmic man!).  Here is the estimation using two loci.

```{r}
D <- dist_bray( loci )
rownames(D) <- colnames(D) <- as.character(loci)
D
```


### Strata Distances


Genetic distances can also be estimated among partitioned groups of individuals.   Here, I will use the data from the mainland Sonoran beetle data set using locus $ATPS$ and the $Population$ stratum as it is small enough to easily display all the results.

```{r}
data <- arapat[arapat$Species == "Mainland",c(3,13)]
```

It should be noted that for some of these metrics, we need to make assumptions about how to represent them as true 'multilocus' estimators.  Where assumptions are being made, a warning message will be displayed to remind the user that there is an assumption being made in the estimation.

#### Euclidean Distance


Euclidean distance is the most straight-forward distance metric available as it is essentially straight-line distance based upon the allele frequencies in each population.  It is given by:

\[
  d_{eucl} = \sqrt{ \sum_{j=1}^L(p_{ij} - p_{kj})^2 }
\]

where $p_{ij}$ and $p_{kj}$ are the frequencies of the $j^{th}$ allele in both the $i^{th}$ and $j^{th}$ population.  In this and the following distance examples, I am going to take the resulting distance matrix among all pairs of populations and put them into a Neighbor joining tree (via the `nj()` function from the **ape** package) as it may be easier to see differences in topologies rather than matrices.

It is perhaps easiest to think of Euclidean distance in x,y coordinate space.  This distance can be estimated by `stratum.distance()` using the optional parameter *method='eucl'* and it will return a *dist* matrix.

```{r}
dist_euclidean( data )
```


#### Cavalli-Sforza Distance

Another distance approach that is commonly used for microsatellite loci is Cavalli-Sforza distance, $D_C$ (Cavalli-Sforza and Edwards, 1967).  Here population allele frequencies are plot on the surface of a sphere (radius=1) using the square root of the allele frequencies.  

\[
  D_C = \frac{2}{\pi}\sqrt{(2-2cos\theta)}
\]

The genetic distance, $D_C$ is measured as the chord distance as indicated in Figure.  The resulting Neighbor joining tree from this distance is shown in Figure \ref{fig:cavalli_dist}

\[
D_{CS,m,n} =  \frac{2}{\pi}\sqrt{ 2 * 1-\sum_{i=1}^{\ell} p_{m,i}*p_{n,i}}
\]


```{r}
dist_cavalli( data )
```


#### The $D_{PS}$ Distance

The Bray Curtis distance can also be estimated on groups of individuals.  However, when it is done it is often represented as $D_{PS} = (1 - P_S)$ (where $P_S$ is the proportion of shared alleles), which I will follow here for simplicity such that the individual distances and the strata distances are not confused.  The $D_{PS}$ distance metric is directly related to Jaccard's distance as:

\[
D_{PS} = \frac{-J_{\delta(A,B)}}{J_{\delta(A,B)}-2}
\]


```{r}
dist_bray( data )
```

It should be noted here that the function `dist_bray()` is used for both individual distance AND strata distance depending upon what you pass to it.  An individual distance is found by passing it a vector of loci whereas a stratum distance is returned by passing it a *data.frame* object (that has a $stratum$ column).  In the `genetic_distance()` function these are also differentiated by *mode="Bray"* for the individual one and *mode="Dps"* for the stratum.


#### Jaccard Distance

Jaccard distance is a set-theoretic distance quantifying dissimilarity.  Assuming that loci are sets of alleles, the Jaccard dissimilarity between genotypes $A$ and $B$ is given by:

\[
J_{\delta(A,B)} = \frac{|A \bigcup B| - |A \bigcap B|}{|A \bigcup B|}
\]

```{r}
dist_jaccard( data )
```

The reverse relationship between $D_{PS}$ and $J_{\delta(A,B)}$ is given as:

\[
J_{\delta(A,B)} = \frac{2+D_{PS}}{1 + D_{PS}}
\]


#### Conditional Genetic Distance (cGD; Graph Distance)

Conditional genetic distance, *cGD*, is a measure based upon conditional genetic covariance and is distinctly different than these other measures as it is not a pair-wise distance metric.  Rather it is the distance through a *Population Graph* topology whose construction is determined by the totality of the data.  To estimate *cGD* from your data, you can use the `genetic.distance()` function as before and it will do right thing and return you a matrix.  However, you should probably look into what a *Population Graph* really is prior to using it.  You will find more information below as well as in the documents for the **popgraph** package itself.  For consistency, the function is shown below BUT this data set with 3 populations is too small to be of interest for a network analysis (how many ways can you connect 3 things...)

```{r,message=FALSE,warning=FALSE}
dist_cgd( data )
```


#### Nei's Genetic Distance

A very common metric of genetic distance, if you think the data you have are due to drift/mutation balance, is that of Nei.  The implementation in **gstudio** of Nei's distance is based upon the sample size correction from 1978, and is calculated as:

\[
I = \frac{(2N-1)\sum_{l=1}^L\sum_{m=1}^M p_{Alm}p_{Blm}}{\sqrt{ \sum_{l=1}^L(2N\sum_{m=1}^{M_A}p_{Alm}^2-1)(2N\sum_{m=1}^{M_B}p_{Blm}^2 -1)}}
\]

for the "Genetic Identity" (where $p_A$ is the allele frequencies at one population and $p_B$ are the corresponding frequencies for the other, $L$ is across loci and $M$ is across alleles at the $l^{th}$ locus).  

Nei's (1978) genetic distance, $D_N$, is:

\[
D_N = -ln(I)
\]

```{r}
dist_nei( data )
```


#### Comparing strata distance metrics 

There are several more genetic distance metrics available and each may have its own set of assumptions.  However, it is also true that across these metrics, there is some similarity between them.  Just for illustrative purposes, we'll look at the various strata distances and plot them against each other to look at the correlative structure between alternative measures using the $ATPS$ locus and the entire beetle data set.

```{r}
x <- arapat[ , c(3,13,14) ]
summary(x)
```

Now, we'll grab the distance metrics

```{r warning=FALSE,message=FALSE}
dist.euc <- genetic_distance( x, mode="Euclidean")
dist.cgd <- genetic_distance( x, mode="cGD")
dist.nei <- genetic_distance( x, mode="Nei")
dist.dps <- genetic_distance( x, mode="Dps")
dist.jac <- genetic_distance( x, mode="Jaccard")
```

and then take the upper triangle of each and put them into a $data.frame$.

```{r}
df <- data.frame( Euclidean= dist.euc[ upper.tri(dist.euc) ] )
df$cGD <- dist.cgd[ upper.tri(dist.cgd) ]
df$Nei <- dist.nei[ upper.tri(dist.nei) ]
df$Dps <- dist.dps[ upper.tri(dist.dps) ]
df$Jaccard <- dist.jac[ upper.tri(dist.jac) ]
```

Before we plot them, there is a bit of cleaning up to do in these data.  For populations that have no alleles in common, Nei's genetic distance will be $Inf$ (e.g., $-log(0)$). Also, with cGD, sets of populations that are independent will also have a infinite distance (e.g., they are not connected so it is impossible to go through the graph from one population to the other).  So with these data, we should first remove them.


and then we can plot them against each other and look at their correlations.

```{r message=FALSE, warning=FALSE}
df <- df[ is.finite(df$Nei), ]
df <- df[ is.finite(df$cGD), ]
library(GGally)
ggpairs( df )
```

As you can see, there is a great deal of correlation between these parameters.


## Genetic Structure

The estimation of structure from genetic data is a common (and commonly misused) endeavor.  The **gstudio** packages makes a distinction between structural parameters and statistical differences.  Structural parameters are the various $X_{ST}$ statistics that crop all too quickly.  These are simply parameters of the data and are not *sensu stricto* measures of differentiation.  To quote Sewell Wright (1978), "..."  All of these parameters are based upon assumptions related to population genetic processes.  Statistical differences, are those analyses we can do on genetic data that test a specific hypothesis that is not based upon population genetic understanding, rather the simple properties of multinomial multivariate data.

### Structural Parameters

Since first introduced by Sewell Wright as F-statistics, there has been a continued development of related parameters that are used to characterize 'population structure' in one way or another.  The parameters that **gstudio** provides are sufficient for most needs.  These include:

- $G_{ST}$: This is Nei's parameter.
- $G_{ST}^\prime$: This is the modification of Nei's $G_{ST}$ as proposed by Hedrick.
- $D_{est}$: This is the parameter of Joost.

Both $G_{ST}^\prime$ and $D_{est}$ are parameters derived for loci with lots of alleles.  There was a heated debate in the literature between Hedrick & Joost about issues related to Nei's $G_{ST}$ for data with many loci (say >6 to be conservative) and these are two options available for your use.  I would recommend looking over the debate to decide which may be more appropriate for you (using them both is a lame option and you will be mocked by your reviewers if you take that approach).  

Here are some examples of how to estimate these parameters using the beetle data.  In all of these approaches, you pass both the stratum and the locus and they return a data frame.  

As in the case for genetic distance measures, structure parameters can also be estimated using either the individual structure functions OR the generalized function `genetic_structure()` with the appropriate options.  The benefit of using `genetic_structure()` is that it allows you to do `pairwise` analyses (it returns a matrix of pairwise structure or a list of pairwise matrices, one for each locus).


#### Nei's $G_{ST}$ Parameter

```{r}
Gst(  arapat$LTRS, arapat$Population, nperm=99 )
```

If the loci that you pass is a bunch of loci in a data.frame, it will return the single locus estimate as well as the multilocus estimate (based upon summing the heterosexuality and then estimating it as in Nei, not in just averaging the $G_{ST}$ values as in Berg & Hamrick (XXXX), see XXXX for more on the differences).

```{r}
Gst(arapat[,c(3,7:14)], nperm=99)
```


#### Hendrick's $G_{ST}^\prime$ Parameter

A correction to Nei's $G_{ST}$ was suggested for loci with a lot of alleles.  This is because the maximum value for expected heterozygosity is determined by the number of alleles and as such $G_{ST}$ for high allelic loci is not bound on the interval $[0,1]$ but is maxed out below 1.0.  As a consequence, Hedrick 


```{r}
sort(unique(matrix( alleles(arapat$MP20), ncol=1)))
Gst_prime(arapat$MP20, arapat$Population,  nperm=99 )
```

In a similar fashion, the multilocus analog can be found by passing a *data.frame* to the function (again I am skipping the *stratum* variable to this function as the data has the strata in a column named 'Population').

```{r}
Gst_prime(arapat[,c(3,7:14)], nperm=99)
```


#### Joost's $D_{est}$ Parameter

Following a discussion back-and-forth between Hedrick & Joost, Joost proposed an alternative measure $D_{EST}$.  The estimation of this parameter is found in a similar way as the other structure parameters.

```{r}
Dest( arapat$MP20, arapat$Population, nperm=99 )
```

```{r}
Dest( arapat[,c(3,7:14)], nperm=99)
```


#### Similarities in Parameters

As in genetic distance metrics, there is some similarity in output from these structure parameters.  Here is a paired plot of the three parameters as above.

```{r}
gst <- Gst( arapat )$Gst
gstp <- Gst_prime( arapat )$Gst 
dest <- Dest( arapat )$Dest
df <- data.frame( Gst=gst, Gst_prime=gstp, Dest=dest )
library(GGally)
ggpairs( df )
```

You can tell from these plots which locus has a lot of alleles and which ones do not...


### Statistical Differences

There are two statistical tests of differentiation you can conduct in R using **gstudio** and both of them are based upon the same approach.  Originally shown by Weir & Cockerham (1984), the parameter $\theta$ can be though of as the proportion of among-strata variance in a random effects Analysis of Variance based upon single loci.  A decade later, Excoffier _et al._ (1992) showed that the same kind of parameter can be estimated using multilocus data.  Unfortunately, they calling the parameter $\Phi_{ST}$ and this has caused some confusion among analysts who confuse this parameter with the structural parameters above.  I should not throw stones, I also followed in the obfuscation of these parameters in Smouse _et al._ (2001) when we showed how to use this approach to analyze pollen pools and called it $\Phi_{FT}$ (the 'F' for father as we are analyzing paternal pollen pools).  We should have also called it PAMOVA for "Pollen AMOVA" but you cannot change history).  Later, we (Dyer _et al._ 2004) showed that both $\theta$ and $\Phi$ are simply multivariate general linear models based upon $NxN$ covariance matrices rather than $pxp$ ones (though the broader use of more powerful statistical models has yet to be fully embraced).  There is a direct link between $\theta$ and $\Phi$ in that they are identical, both statistically and numerically, wen you only use a single locus.  What $\Phi$ does it to extend the analysis to a multilocus analysis.  

Here is an example with the larger Clade using the function `adonis()` from the **vegan** package. 

```{r, message=FALSE}
data <- arapat[ arapat$Species == "Cape",]
D <- dist( dist_amova( data ) )
Pop <- factor( as.character( data$Population ) )
library(vegan)
adonis(D ~ Pop )
```


## Parent-Offspring Data

There are several functions in the **gstudio** package that relate to working with parent/offspring data, due in large part to my own research program.  Here is an example of how one can work with these kinds of data.

### Data Formatting 

So when we have both adult and offspring data in the same data file, we need to come up with a way to make sure we can differentiate between adults and offspring.  In **gstudio** I do this using two identification columns.
1. All adults should have a unique identification column.  The default for this is "ID" in the functions and it is pretty easy to remember that one.  
2. Offspring have an *ID* column as well and it must be the same as the maternal individual.  Again, I'm coming from a plant perspective and think of this as seeds on a tree.
3. A second identification column (default="OffID") allows the differentiation between the maternal individual and their offspring.  By convention, maternal individuals have *OffID="0"* and all the offspring from that mother have a value different than *"0"*.  It should be noted that identification columns are just like stratum columns and are commonly encoded as *factor* objects.

With this out of the way, working with parent/offspring data can be rather easy.  Here are a few things that can be done using the following small data set.  I'm going to make this randomly using the `mate()` function (see below in the section on Simulations for more on this) and then order the data set by mother and offspring.


```{r}
pgm <- c( locus(1:2), locus(c(1,1)), locus( c(2,2) ), locus(1:2), locus(c(1,1)) )
tpi <- c( locus(3:4), locus(c(2:3)), locus( c(4,4) ), locus(c(3,5)), locus(c(3,4)) )
fe <- c( locus(c(5,7)), locus(c(5,7)), locus( c(5,5) ), locus(c(5,5)), locus(c(5,7)) )
ID <- factor( paste("Ind",1:5,sep="-"))
data <- data.frame( ID=ID, OffID=0, PGM=pgm, TPI=tpi, FE=fe )
data
offs <- mate( data[1,], data[2,],N=10)
data <- rbind( data, offs )
data <- data[ order(data$ID,data$OffID),]
data
```

### Paternal Contributions

From each offspring it is possible to remove the contribution of the maternal individual, leaving the male haplotype (e.g, a pollen genotype in plant context).

```{r}
minus_mom(data)
```

Here you can see that some of the offspring can have their genotypes reduced whereas others cannot.  If mother and offspring are both identical heterozygotes, then it makes it difficult to know if the dad gave the first allele and the mother gave the second or vice-versa. You can work with these reduced genotypes in the same way as you would regular populations (this is essentially what we did in the 2Gener analysis).

#### Paternity Exclusion

Another common parent/offspring analysis consists of trying to identify paternity of individuals.  If you know the mother and have genotyped mother, offspring, and potential fathers, you can estimate paternity.  With many loci, you can get very specific.  However, that is not always the case in plants for two reasons:
1. We often have MANY potential fathers, not just the 'accused' one.
+. We may not have the level of exclusion required to have only 1 potential father.

So here are some convenience functions. First we will look at the exclusion probability.  This is the probability that you can exclude a randomly selected individual from paternity based upon the allele frequencies in the population. 

```{r}
freqs <- frequencies( data[ data$OffID==0,] )
freqs
pexcl <-exclusion_probability( freqs )
pexcl
```

So for one locus, with adult allele frequencies equal to $p_A = 0.6;\;p_B=0.4$, we have an exclusion probabilities range from 16-39%.  The `exclusion_probability()` function also provides the theoretical maximum exclusion one could get with a locus where all alleles are at equal frequency, as well as how close your observed loci come to that theoretical maximum.

The multilocus exclusion probability can be found by compounding the individual locus ones as:

\[
P_{excl,multilocus} = 1 - \prod_{i=1}^\ell(1 - P_{excl,i})
\]

```{r}
p.tot <- 1 - prod( 1-pexcl$Pexcl )
p.tot
p.max <- 1 - prod( 1-pexcl$PexclMax )
p.max
p.tot/p.max
```

Showing that we capture a reasonable amount of the theoretical maximum exclusion we could expect.  However, it is still not enough to have a great power of exclusion (e.g., we should at least expect $1/p.tot =$`r 1/p.tot` dads per offspring.


### Fractional Paternity

One approach found in plant gene flow studies is that of fractional paternity analysis.  Here, even if individuals cannot be assigned unambiguously, their relative likelihood of paternity can be estimated.  If all dads are known with certainty, then their posterior likelihood will be $F_{IJ}=1.0$ but if more than one dad my be implicated as a potential father, then their relative likelihood is provided.

In the example below I separate out the offspring, mother and potential dads and then run the first three through the `paternity()` function. 

```{r}
offs <- data[data$OffID!="0",]
dads <- data[data$OffID=="0",]
mom <- data[1,]
p <- paternity( offs, mom, dads)
p
```

While these offspring are randomly created by the `mate()` function (and that means every time I make this document they are different), you can see some variability in how many individuals can actually be assigned paternity.  It should be noted though that *Ind-2* better be indicated as a potential father in all those offspring ('cuz he was the real father).

This is only a starting point and it is intended to provide some quick familiarity to the use of paternity analyses.  As shown below in the section on simulations, it would be easy to simulate such things as cryptic gene flow, etc.


## Population Graphs

A *Population Graph* is a model-free network created using conditional genetic covariance.  Connectivity within a population graph is based solely upon the multilocus genetic covariance, which is dominated by neutral processes such as gene flow and historical demography.  The routines to create population graphs were previously nestled within this package but due to the expansion of the approach outside genetic analyses, I opted to pull those routines out and put them in their own packages.  I will show briefly the use of population graphs here, just enough to show how it is done.  There is a complete write-up on it (like this one) on my website at !(http://dyerlab.org/docs/popgraph.html that goes into it in much greater detail and shows you how to do various analyses on the data.

```{r error=FALSE, message=FALSE}
library(popgraph)
```

By default, the input to `popgraph()` is a multivariate data set and a factor to be used to identify the nodes.  For this example, we'll use the larger clade in the beetle, Peninsula, as an example.

```{r}
data <- to_mv( arapat )
pops <- arapat$Population
graph <- popgraph(x=data,groups=pops)
```

Now you have the graph and can plot and work with it as you find useful.  Here I'll make a color vector to indicate the species collected in each population.  You can get an idea of this by using the `table()` function as follows:

```{r}
t <- table( arapat$Species, arapat$Population)
t
```

And from there create the node colors from there.  I'll color the mainland populations in dark grey, the clade only populations in light gray, the large clade in red and the populations mixed with both peninsular species in salmon colors.

```{r message=FALSE, warning=FALSE}
library(igraph)
nodes <- V(graph)$name
color <- rep( "#f4a582", length(nodes) )
# the A clade
color[ t[1,]!= 0] <- "#bababa"
# the B clade
color[ t[2,] != 0] <- "#404040"
# the C clade
color[ t[3,] != 0] <- "#ca0020"
# the mixed populations
color[ t[1,] != 0 & t[3,] != 0] <- "#f4a582"
color
```

And now we can plot it.  If the graph has a *color* attributed for the vertices (the nodes), the it will automagically use it.  Here I omit the population name for brevity.

```{r,cache=TRUE,fig.width=7,fig.height=7}
V(graph)$color <- color
plot(graph, vertex.label="")
```

For a lot more on how to use and analyze population graph topologies see the documentation associated with it on my website or from [here](http://dyerlab.github.io/popgraph/).


## Simulations

The **gstudio** packages is starting to include some functions that aid in basic simulations of genetic processes.  These functions are generally concerned with combining data through a mating process and/or making random populations that follow some particular format.  

- `'+'` operator for locus objects.  You can take two locus objects and add them together (genetically) and you will produce an offspring genotype.  
- `mate()` This function takes two entire individuals accumulating genetic data across loci.  By convention the first is the maternal individual and as such the meta data associated with her will be transferred to the offspring (e.g., I am assuming a basic plant model where the offspring is spatially and otherwise associated with the maternal individual).
- `make_population()` This function takes a *data.frame* containing allele frequencies.  This *data.frame* should mimic what is returned by the `frequencies()` function.  This means that you can make populations that have a single locus, many loci in a population, or several stratum with several loci.  However, it does check to see that the format of the *data.frame* conforms to the format as returned by `frequencies()`.  

### A Migration/Drift Example

To illustrate a basic example of how these functions work, we'll do a simple migration model as follows:
1. Create two populations, with different allele frequencies.
+. Estimate *Gst* (simple low diversity locus estimate of structure)
+. Migrate some of the individuals (say 10% of the adults) between the populations.  I'm going to just switch them directly such that migration is symmetric.
+. Mate the individuals to create a new set of populations.
+. Go to #2 above and repeat 20 generation recording the estimate of *Gst* each time.
+. Plot the result.

To do this I'll create a few functions for the migration and mating parts so that you can get an idea of how you can use this yourself (and perhaps enjoy a little cut-n-paste action if you like).  


#### Making Populations

But first, lets create the populations.

```{r}
freqs <- data.frame( Stratum=rep(c("Pop-1","Pop-2"),each=2), Locus=rep("TPI",4), Allele=rep(c("A","B"),time=2), Frequency=c(.75,.25, .25, .75) )
freqs
pops <- make_population(freqs,N=100)
summary(pops)
```

You can check the creation of the individuals by looking at the resulting allele frequencies.  However, it should be noted that you cannot often *exactly* mimic the allele frequencies due to randomness in selecting alleles (e.g., a fair coin does not give 50/50 with only a few tosses).  The following shows they are pretty close.


```{r}
frequencies( pops, stratum="Population" )
```


#### Isolate Breaking: A Cautionary Tale

As a side note, this is a great place to point out the effects of isolate breaking (e.g., the issue of combining populations when we should not).  Here each population is pretty close to no inbreeding (e.g., close to HWE).

```{r}
Fis( pops[ pops$Population=="Pop-1",])
Fis( pops[ pops$Population=="Pop-2",])
```

But if we combine them we see a totally different thing.

```{r}
Fis( pops )
```

This is a rather poorly appreciated aspect highlighting the case where we do not quite understand the underlying biology yet still forge ahead blindly throwing the data at an analysis.

#### Mating Entire Populations

The `mate()` function takes two individuals and produces *N* offspring.  However, we want to take random sets of individuals and make a new population that is the same size.  This will take a few lines of code but can easily be accomplished.  I'll wrap it into a function with some comments to illustrate.

```{r}
mate_population <- function( pop ){
  N <- dim(pop)[1] # how many to make
  ret <- pop  # Replace loci in place
  for( i in 1:N){
    # grab 2 random parent indices (no selfing)
    parents <- sample( 1:N, size=2, replace=FALSE)
    # mate them to make 1 offspring
    off <- mate( pop[parents[1],], pop[parents[2],], N=1)
    # add the genotype back to the population.
    ret$TPI[i] <- off$TPI[1]
  }
  return( ret )
}
```

We can check to see if this works by iterating across generations and looking at allele frequencies.  These should bounce around due to drift but should not be all over the place if N is reasonable.  Here is a check on that.

```{r}
T <- 50
df <- data.frame( Time=1:T, Freq_A=rep(NA,T))
test_freq <- data.frame(Locus="TPI", Allele=c("A","B"), Frequency=c(0.5,0.5) )
pop <- make_population( test_freq ,N=100)
for( i in 1:T ){
  f <- frequencies( pop )
  df$Freq_A[i] <- f$Frequency[1]
  pop <- mate_population( pop )
}
ggplot(df, aes(x=Time, y=Freq_A)) + geom_line(size=1.5,color="darkred") + theme_bw() + ylim(c(0,1))
```


#### Migrating Individuals

This migration model will be rather simple.  We will simply replace the genotypes of the first 10 percent of the adults ($m=0.10$ which is rather large). Since each generation is made randomly, the position in the *data.frame* is somewhat arbitrary.  Moreover, this function is useful only for the 1-locus data set we have created but simple examples can be simplistic and still provide some information.

```{r}
migrate <- function( pop ) {
  t <- pop$TPI[1:10]
  pop$TPI[1:10] <- pop$TPI[101:110]
  pop$TPI[101:110] <- t
  return(pop)
}
```


#### Plotting Structure in Drift/Migration Simulations

OK, lets put it together and plot the results.


```{r}
df <- data.frame(Time=1:50, Gst=rep(0,50) )
for( t in 1:50 ){
  
  # find structure
  x <- Gst( pops, stratum="Population" )
  df$Gst[t] <- x$Gst[1]
  
  #migrate
  pops <- migrate( pops )
  
  #mate 
  pops[1:100,] <- mate_population( pops[1:100,]) 
  pops[101:200,] <- mate_population( pops[101:200,])
  
}
ggplot(df,aes(x=Time,y=Gst)) + geom_line(size=1.5,color="darkred") + theme_bw() + ylim(c(-.1,0.4))
```

So with just a few lines of code, we can create some simulations that show the change in structure as a function of both drift (small $N$ in these populations) and migration ($m=0.1$).  The utility of R is that it opens up a wide range of potential simulation work for you and with **gstudio** you can work with genetic data just like any other kind of data.