forked from dyerlab/applied_population_genetics
-
Notifications
You must be signed in to change notification settings - Fork 0
/
gstudio_package.rmd
1306 lines (796 loc) · 61.2 KB
/
gstudio_package.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# the gstudio Library
The **gstudio** package is a package created to make the inclusion of marker based population genetic data in the R workflow easy. An underlying motivation for this package is to provide a link between spatial analysis and graphing packages such that the user can be quickly and easily manipulate data in exploratory ways that aid in gaining biological inferences.
## Installing the package
This package requires several other packages for installation. By default, the install should be easily accomplished using the build-in functionalities in R.
```{r, eval=FALSE,echo=TRUE}
install.packages("gstudio")
```
Occasionally, you should look to see if there are updates to the package by doing the following (this will update all packages you have installed)
```{r, eval=FALSE}
update.packages( ask=FALSE )
```
If you want the most recent version of this package, I make the development builds available on Github (http://github.com/dyerlab/). You can install directly from within `R` as:
```{r,eval=FALSE}
install.packages(devtools)
library(devtools)
install_github("dyerlab/gstudio")
```
I recommend using the latest version, it has a lot of the newer features and I do not check it into github until it has been tested. I only post to CRAN when major versions change.
### Loading the Package
And any time you need to use the package, you would just pull it into your session.
```{r,message=FALSE}
library(gstudio)
```
This should get you everything you need. The **gstudio** package does contain a lot of build-in documentation including a lot of examples. All the functions and examples associated with them can be found in the build-in documents available from any of the following (n.b., the vignette is a simple placeholder with links to the website.)
```{r, eval=FALSE}
help(package="gstudio")
vignette('gstudio')
system( "open http://dyerlab.github.io/gstudio/")
```
At any point if you have any questions about the values or options for a particular function in **gstudio** or any other package, you can use the `help.search(func)` or `?` functionality. This file is also kept in sync with the development of the **gstudio** package (it is in the source package that you downloaded from CRAN) and will serve as a tutorial for your use of this package. If there are any questions that you may have regarding this package, feel free to contact [Rodney Dyer](mailto://[email protected]) and I will get back to you as soon as possible.
## Genetic Data
The overriding philosophy behind the **gstudio** package is to make it as easy as possible to create, load, use, and integrate, genetic marker data into your analysis workflow. As such, we typically use *data.frame* objects to hold our data and the addition of the *locus* class as a fundamental data type allows us to continue to do so.
```{r}
x <- locus( c(1,2) )
class(x)
x
```
You can think of a *locus* object as a vector of alleles. There are several options you can use when constructing a *locus* object based upon what kind of marker data you are using. These options are passed through the *type* option to the `locus()` function. Here are the current options.
`type` Option | Function
- |
| This is the default value (e.g., nothing passed). It will treat the values passed to locus as alleles to a single locus
`snp` | Alleles are '0', '1', or '2' indicating the number of minor alleles.
`zyme` | Genotypes are encoded as "12" like zymes (e.g., "1" & "2" alleles together).
`separated` | Alleles are already separated by ":" character (for putting in polyploid data)
`column` | Alleles for a single locus are in two separate columns.
Here is some examples.
```{r}
loc0 <- locus( )
loc1 <- locus( 1 )
loc2 <- locus( c("C","A"), phased=TRUE )
loc3 <- locus( c("C","A") )
loc4 <- locus( 12, type="zyme" )
loc5 <- locus( 1:4 )
loc6 <- locus( "A:C:C:G", type="separated" )
loci <- c(loc0, loc1, loc2, loc3, loc4, loc5, loc6 )
loci
```
Notice how the printing of each *locus* object uses the colon character to separate alleles. Also, since the *locus* object is a basic data type, it can be used in other data structures.
```{r}
df <- data.frame( ID=0:6, Loci=loci )
df
```
And you can use normal vector processing of the locus vector to do normal R data like operations.
```{r}
is.na( df$Loci )
is_heterozygote( df$Loci )
```
### Importing Data
Importing data is not that difficult. People tend to keep their data in either spreadsheets or text files, which are easily accessible via R, or some archane program format (wave to Arlequin everyone), which may be less accessable. As of version 1.1, **gstudio** handles data both in raw format (e.g., as they appear in your spreadsheet) or in GENEPOP format. All of these formats are accessed in **gstudio** using the function, `read_population()` that is a mix between the traditional `read.table()` function and the `locus()` functions.
#### Import from GENEPOP files
The GENEPOP file format, as interpreted by **gstudio** has the following format.
1. It is a text file and preferrably space delimited.
+ The first line has some info for you, it is mostly ignored but should include a bit about what the data are.
+ The next $K$ lines of the file list the names of the loci to be used, one per line.
+ The rest of the file contains populations. Each population starts with "Pop" alone on its own line. All individuals are assumed to be in the same population until such time the next "Pop" is reached.
+ For each individual, there is an identification value first, and followed by a comma are the genotypes.
+ All genotypes are encoded using 3 digits for each allele (e.g., a 3:5 genotype is 003005). Missing data is all zeros '000000' and haploid is just three digits.
The import from *read_population()* names loci, adds "ID" for the identification column and adds to each population a "Pop-X" designation. Other than that, it is identical to what is below. However, if you have your data in a spreadsheet, there is no need to shove it into a genepop format to import into R.
#### Text File Import
Options for this function include:
Option | Function
- | --
path | The full path to the text file
type | The locus *type* (see above).
locus.columns | A vector of numbers of columns to be treated as `Locus()` objects
sep | The character used to separate columns (',' is default)
header | Columns have header names (e.g., locus names, etc.)
Here are some examples of data files with different kinds of genetic data, each of which exercises the `read_population()` function in a different way. Hopefully this covers the main types of data being imported, if not, drop me an email [Rodney Dyer]([email protected]). Missing genotypes should be missing data or encoded as *NA*. If you do not have a genotype then leave it blank. There is no reason to use negative numbers or other conventions.
Columns of genotypes are indicated by the required parameter *locus.columns* so that `read_population()` knows which columns to treat as *locus* objects and which to leave as normal data for R. Without this parameter, the data will be read in as *character* or *numeric*.
There are some example data files included in the project for you to look at. Depending upon how your computer is set up, they may be placed in different locations. Here is a quick way to find out where the installed folder is for the **gstudio** package and the location of the 'data' folder within it.
```{r}
system.file("extdata",package="gstudio")
```
#### Two Column Data
Here is an example of data where the genotypes are encoded as two columns of data in a csv file.
```{r}
file <- system.file("extdata","data_2_column.csv",package="gstudio")
data <- read_population(file,type="column", locus.columns=4:7)
data
```
#### Phased Data
There are times when the gametic phase of the genotypes is important. By default, **gstudio** will keep alleles sorted in alpha/numeric order. If you need to keep this from happening, pass the optional *phased* option to `read_population()`. Notice the differences between this and the previous genotypes.
```{r}
file <- system.file("extdata","data_2_column.csv",package="gstudio")
data <- read_population(file,type="column", locus.columns=4:7, phased=TRUE)
data
```
#### AFLP-like data
Genotypes that are 'aflp'-like are encoded as binary characters (e.g., 0/1) indicating the presence or absence of a particular band.
```{r}
file <- system.file("extdata","data_aflp.csv",package="gstudio")
data <- read_population(file,type="aflp", locus.columns=c(4,5))
data
```
#### SNP Minor Allele Data
At times, SNP data is encoded in relation to the number of minor alleles. You can import these data using the *type="snp"* option and it will encode them as 'AA', 'AB', or 'BB' with the 'B' allele as the minor one.
```{r}
file <- system.file("extdata","data_snp.csv",package="gstudio")
data <- read_population(file,type="snp", locus.columns=4:7)
data
```
#### Zyme-Like Data
Some data is encoded as allozyme genotypes (e.g., 33, 35, 55 for diploid individuals with alleles '3' and '5').
```{r}
file <- system.file("extdata","data_zymelike.csv",package="gstudio")
data <- read_population(file,type="zyme", locus.columns=4:7)
data
```
#### Pre-Separated Data For Higher Ploidy
```{r}
file <- system.file("extdata","data_separated.csv",package="gstudio")
data <- read_population(file,type="separated", locus.columns=c(4,5))
data
```
### Saving Data
There are several ways to export your data to file.
### Raw R objects
Saving data once it is in R is trivial and you do it as you would for any other R object. The R object system knows how to serialize its own data using the `load()` and `save()` functions.
```{r,eval=FALSE}
save(df, file="MyData.rda")
```
To load the objects back into the work space, you just do:
```{r,eval=FALSE}
load("MyData.rda")
```
And you can verify that you have data in your work space by listing it.
```{r}
ls()
```
#### Saving as Text
As a default, the function `write_population()` will write your data file as a comma separated text file with the loci encoded as column separated (see `type="separated"` above).
```{r,eval=FALSE}
write_population(df, file="~/Desktop/MyData.csv")
```
#### Saving as GENEPOP
Raw data can be saved in GENEPOP formats by passing an optional argument `mode="genepop"` to the `write_population()` function.
```{r,eval=FALSE}
write_population(df, file="~/Desktop/MyData.txt",mode="genepop")
```
### Saving for STRUCTURE
Raw data can also be exported for analysis using the program STRUCTURE. Here the optional argument is `type="structure"`
```{r,eval=FALSE}
write_population(df, file="~/Desktop/MyData.str", mode="structure")
```
## The arapat Data Set
The main genetic data included with the package if from the Sonoran desert bark beetle, *Araptus attenuatus* from the Dyer laboratory. You can load it into your work space by:
```{r}
data(arapat)
```
Which looks like the following:
```{r echo=FALSE}
DT::datatable(arapat, options = list(scrollX = TRUE))
```
You can see several things as you scroll through the data. First, the locus data are displayed by genotype counts including NA values where there was missing data. A column of type *locus* is just like any other kind of variable and can be used as such. This opens up a lot of functionality for you to be able to treat marker data just like everything else in R.
### Convenience Functions
Dealing easily with parts of your data is a critical skill and a huge benefit in using a grammar like R to do your analyses. In R, *data.frame* objects are almost like little databases and you can do some really creative manipulations with them. The **gstudio** package provides a few things that may help you work more efficiently with your data.
### Data Classes
Often it is important to know which columns of a data set are actually of a particular data type. Here is a simple function that tells you either the name or index of columns in a data.frame that of a specific data type.
```{r}
column_class( arapat, class="locus")
column_class( arapat, class="locus", mode="index" )
```
## Partitioning
The `partition()` function takes a data frame and returns a list of data frames, partitioned by the stratum you pass to it. This is really nice if you are doing a nested analysis of sorts and want to work with subsets of your data that are defined by a categorical *factor* variable.
```{r}
names(arapat)
clades <- partition( arapat, stratum="Species")
names( clades )
```
This kind of partitioning is very common in the analysis of spatial genetic structure and as such should be as simple as possible to provide the most flexibility to you, the analyst. One of the common analysis patterns that you will come back to over and over again is to partition the entire data set and perform operations on each of the subgroups. In R this is a pretty easy process if you look into the `lapply()` function (and its relatives). This is such an important component, that I'm going to spend a little time here to make sure you understand what I am doing. Once you get it, it will make you life tremendously more awesome!
The basic form of the various 'apply' functions is that you pass them some data and a function on which it will take each part of that data and apply it. For lists (the *l* part in `lapply()`), the function will take each entry in the list and pass it along to the function. The function itself can be one that is already available (like `length()` or `is.na()` or something) or it can be something you specify directly, on the fly. Here is an example looking at the number of samples in each 'Species' as partitioned above.
```{r}
lapply( clades, dim)
```
Here the `dim()` function returns the dimension of the data.frame in each clade, all have the same number of columns, but differ in the number of individuals (the rows). While this is a stupid example (you could get the same thing from `dim(arapat$Species)` but it shows the general pattern.
### Plotting Populations
One of the key benefits to using an analysis environment such as R is that you can mash together functionality that you just can't get from a monolithic program. An example of this is plotting populations. If your data has spatial coordinates in them then you can use this to plot the location of your sites on a GoogleMap tile. By default, you *must* have your coordinates in decimal degree format, with west and south as **negative** decimals. This is the default for the GoogleMaps API. Moreover, if you name the columns of data "Longitude" and "Latitude" much of the spatial functionality in **gstudio** will be more transparent (if not, you have to specify the Longitude and Latitude names each time you use a function that needs them).
```{r, warning=FALSE,message=FALSE}
library(ggplot2)
library(ggrepel)
library(ggmap)
coords <- strata_coordinates( arapat )
map <- population_map(coords)
ggmap(map) + geom_point(aes(x=Longitude,y=Latitude), data=coords, size=2) + geom_label_repel(aes(x=Longitude,y=Latitude,label=Stratum),data=coords) + xlab("Longitude") + ylab("Latitude")
```
## Allele Frequencies
Another object made easily in **gstudio** are objects related to allele frequencies. Allele frequencies are just like every other kind of data and can be extracted from a `data.frame` containing *locus* objects using the function `frequencies()`.
### Single Locus
Grabbing allele frequencies is a fundamental task for any population genetic analysis and should be as easy as possible. Here are some examples of various ways to get allele frequency information using the *Araptus attenuatus* data set.
```{r}
freq.EF <- frequencies( arapat$EF )
class( freq.EF)
freq.EF
```
### Multilocus
The conversion of loci to a *data.frame* expands beyond the single locus. If you do not specify which locus to use, it will use all *locus* objects and add an additional column to the data frame (n.b., I only print out the first 10 rows to give you the idea).
```{r}
freqs.loci <- frequencies( arapat )
freqs.loci[1:10,]
```
### Substrata and Allele Frequencies
To complete the symmetry here, adding stratum to the analysis, provides yet another categorical variable upon which allele frequencies may be estimated. Here is an example looking at the "Cluster" strata in the data set and a partial printout of the results.
```{r}
freqs.strata <- frequencies( arapat, stratum="Cluster" )
freqs.strata[1:10,]
```
### Plotting Allele Frequencies
There are several ways you may want to graphically view the locus data and for convenience, the **gstudio** package provides some interfaces for nice plots using the **ggplot2** package.
Plotting a vector of loci will by default estimate the frequencies of each allele for graphical output. There are two different output for this (n.b., a pie chart by its nature can lead to inaccurate interpretations and most statisticians hate them).
```{r}
plot( arapat$MP20 )
plot( arapat$MP20, mode="pie")
```
You can also use the **ggplot2** routine `geom_locus()` to plot the frequencies:
```{r}
ggplot() + geom_locus( aes(x=MP20, fill=Cluster), data=arapat)
```
The frequencies across a collection of loci can easily be plot just as well (internally, this simple plot is just turns the object into a data frame and then plots it). At times, examination of allele spectra can reveal blatant differences in substratum of your data. For example, consider the following spectrum for the locus MP20.
```{r}
f <- freqs.strata[ freqs.strata$Locus %in% c("MP20","AML"), ]
summary(f)
ggplot(f) + geom_frequencies(f) + facet_grid(Stratum~.) + theme(legend.position="none")
```
### Frequency Gradients
When you have many strata or you are conducting landscape-level analyses, it is often helpful to look at how allele frequencies change in relation to some variable other than stratum.
```{r}
baja <- arapat[ arapat$Species != "Mainland",]
```
The *EN* locus has a few different alleles but if we look at the frequencies of each, the first two dominate.
```{r}
plot( baja$EN )
```
Using just the first allele *01*, it is pretty easy to plot the strata frequency as a function of latitude using normal R approaches. To do this, one needs to:
1. Extract the *01* allele frequencies by population.
```{r}
freqs <- frequencies( baja, stratum="Population", loci="EN")
freq.01 <- freqs[ freqs$Allele == "01",]
```
2. Merge this *data.frame* with one containing the coordinates of the populations.
```{r}
coords <- strata_coordinates( baja )
df <- merge( freq.01, coords)
df[1:10,]
```
Now, you can plot the frequencies as either a linear plot (below you will see how to plot these along environmental gradients).
```{r}
ggplot( df, aes(x=Latitude,y=Frequency)) + geom_line(linetype=2) + geom_point(size=4)
```
This is interesting. Now, just to 'kick it up a notch' I'm going to look at the *Cluster* variable. This is from mtDNA and shows punitively cryptic species. I'm going to remake the plot above but color the points to indicate the presence of the 'SCBP-A' clade (perhaps another species). Below I grab add a new column of data to *df* and then make it all 'Baja'. Then I figure out which populations have 'SCBP-A' individuals in it.
```{r}
df$Species <- "Baja"
pops.with.scbp <- as.character(unique(baja$Population[ baja$Cluster=="SCBP-A"]))
df$Species[ df$Stratum %in% pops.with.scbp ] <- "Cape"
```
Then plot it.
```{r}
ggplot( df, aes(x=Latitude,y=Frequency)) + geom_line(linetype=2) + geom_point(size=5, aes(color=Species) )
```
I think this leads to some interesting questions about the relationship between potential species differences, where species are gauged by mtDNA, in nuclear allelic diversity.
### Spatial Frequency Plots
It is also possible to plot the data in a spatial context. Here is an example of how to mix `ggplot()` and `ggmap()` data and I'll plot the locations as proportional in size to the allele frequency.
```{r, warning=FALSE,message=FALSE, error=FALSE}
map <- population_map(baja)
ggmap( map ) + geom_point( aes(x=Longitude, y=Latitude, size=Frequency), data=df)
```
There is also the option to make use of some pie charts. I know, pie charts suck and any statistician will tell you that they should probably not be used because they can be misleading, but here they are. For exploratory data analysis, they can be very insightful at times. Here is the frequency of alleles at the *enolase* locus in *Araptus*. Any spatial structuring catch your eye?
```{r,warning=FALSE,message=FALSE,fig.width=7,fig.height=7, eval=FALSE}
pies_on_map( arapat, stratum="Population", locus="EN")
```
Which will open a new browser window and produce a graph like the one below.
<iframe width="640" height="480" src="media/pies_on_map.html" frameborder="0" allowfullscreen></iframe>
Note the messages about the approximation. This is because the google maps API has an integer for zoom factor and at times it is not able to get all the points into the field of view using an integer zoom. If this happens to you, you can manually specify the *zoom* as an optional argument to either function `pies_on_map()` or `population_map()`. You also need to be careful with the `pies_on_map()` function because the way it works is that the background tile is plotted and then I plot the pies ontop of it. If you reshape your plot window outside equal x- and y- coordinates (e.g., make it a non-square figure), the spatial location of the pie charts will move! This is a very frustrating thing but it has to do with the way viewports are overlain onto graphical objects in R and I have absolutely **no** control over it. So, the best option is to make the plot square, export it as PNG or TIFF or whaterver, then crop as necessary.
## Multivariate Analogs for Loci
Genotype data is inherently multivariate. In fact, it is multinomial multivariate *senus stricto* but we generally ignore that. That being said, we can easily translate raw genotypes into raw multivariate encodings for other statistical analyses. Here is a quick example using a few individuals and the *WNT* locus.
```{r}
to_mv( arapat$WNT[1:10] )
to_mv( arapat$WNT[1:10], drop.allele=TRUE)
```
For multiple loci, we can use the same approach. Here is an example of a PCA analysis done on raw genotypes.
```{r}
x <- to_mv( arapat, drop.allele=TRUE)
fit.pca <- princomp(x, cor=TRUE)
summary(fit.pca)
```
Interesting. It takes several eigenvectors to explain these data sufficiently. Here is a simple plot of some of model, given by computing the predicted values for each sample. I then use `ggplot()` to make a scatter plot with clade dictating the shape of the symbol (each symbol is an individual and clade was determined by mtDNA, not these data), and Clade to provide the color.
```{r}
pred <- predict( fit.pca)
df <- data.frame( PC1=pred[,1], PC2=pred[,2], Species=arapat$Species, Clade=arapat$Cluster, Pop=arapat$Population)
ggplot( df ) + geom_point( aes(x=PC1,y=PC2,shape=Species,color=Clade), size=3, alpha=0.75)
```
Looks like there are three main groups divided by clade and within the more dense clade, there is some sub-structuring. I'll take the data that is in the main clade and do a quick hierarchical clustering analysis.
```{r,fig.width=12,warning=FALSE}
baja <- pred[df$Species=="Peninsula",]
h <- hclust( dist( baja ), method="single")
plot(h,main="Main Baja California Clade", xlab="")
```
## Measures of Genetic Diversity
Genetic diversity is estimated by several different means. It can be estimated at several different levels; at individuals, at groups, at populations, etc. It can also be estimated by several different parameters. This section covers some of the more common parameters used for quantifying genetic diversity.
### Allelic Diversity
At the most basic level, the number of alleles within a group of individuals is a base measure of diversity. However, there are some caveats to be made about the way in which we count alleles. Rare alleles may or may not be as informative. The three ways commonly used to look at allelic diversity are
1. The total number of alleles ($A$).
2. The effective number of alleles ($A_e$).
3. The number of alleles with at least 5% frequency ($A_{95}$).
These parameters are estimated from your data using the `genetic_diversity()` function. The argument *mode* takes either "A" (the default), "Ae", or "A95" to differentiate.
```{r}
AA <- locus( c("A","A") )
AB <- locus( c("A","B") )
loci <- c(AA,AB,AA,AA,AA,AA,AA,AA,AA,AA,AA)
loci
genetic_diversity(loci)
genetic_diversity(loci,mode="A")
genetic_diversity(loci,mode="A95")
```
### Rarefaction
Rarefaction is a technique used to measure diversity in different populations. It is particularly important for situations where you have different sample sizes. Is there more diversity in the larger population because you sampled more or is it a truly more diverse population? I'll use the data from the beetle to show how diversity changes with sample sizes and highlight how you can use the `rarefaction()` function.
In the mainland populations, there are only 36 samples and the allelic diversity is relatively low at the WNT locus.
```{r}
loci.son <- arapat$WNT[ arapat$Species == "Mainland" ]
length( loci.son )
genetic_diversity( loci.son, mode="Ae")
```
The larger clade on the peninsula has many more individuals and is more diverse.
```{r}
loci.peninsula <- arapat$WNT[ arapat$Species == "Peninsula" ]
length( loci.peninsula )
genetic_diversity( loci.peninsula, mode="Ae")
```
So is this difference a consequence of the sample sizes or is peninsular Baja California more genetically diverse? To answer this, rarefaction randomly sub-samples the loci.baja data and estimates the value of $\hat{A}_e$ for samples of size 36 (for our case) to see if the observed diversity differences are due to sampling alone. To visualize the distribution, I throw it into a *data.frame* and use the **ggplot2** functions to make the pretty colored histogram.
```{r }
Ae.peninsula <- rarefaction(loci.peninsula, mode="Ae", size=36)
df <- data.frame( Ae.peninsula )
ggplot( df, aes(x=Ae.peninsula) ) + geom_histogram(aes(fill=..count..), binwidth=0.05) + scale_fill_gradient("count",low="#cccccc",high="#a60000") + theme_bw()
```
### Heterozygosity
At a base level, heterozygosity is a form of diversity (see Nei). Heterozygosity can be measured at many different stratum and in two forms. All of these approaches can be accessed through the functions `Ho()` (for observed heterozygosity)
```{r}
Ho( arapat$EF )
```
and `He()` (for expected heterozygosity given Hardy Weinberg Equilibrium).
```{r}
He( arapat$EF )
```
Both $H_E$ and $H_O$ can be used with full data sets as well. When you pass a full *data.frame* to these functions, it will return a *data.frame* with loci by row.
```{r}
He( arapat )
```
Given the broadness of these functions, it is easy to integrate them into broader analyses. Here is an example of expected heterozygosity ($H_E$ or genetic diversity in Nei's terms) as a function of latitude for the peninsular populations. The output is displayed as a plot.
```{r}
baja <- arapat[ arapat$Species != "Mainland",]
coords <- strata_coordinates(baja)
pops <- partition(baja,stratum="Population")
he <- lapply( pops, function(x) return(He(x$EN )) )
data <- merge( coords, data.frame( Stratum=names(he), He=unlist(he)))
data <- data[ order( data$Latitude), ]
ggplot( data, aes(x=Latitude, y=He)) + geom_line(linetype=2) + geom_point(color="red",size=4)
```
## Inbreeding
Inbreeding is a consequence of mating patterns and/or demographic population size. The consequences of inbreeding are related to how alleles are put into genotypes. One approach to looking at inbreeding is to estimate the expected frequency of heterozygotes (e.g., the $2pq$ part of the classic Hardy-Weinberg equation) and compare it to the observed level of inbreeding. This is the classic $F$ statistic and is estimated as:
\[
F_{IS} = 1 - \frac{H_{O}}{H_{E}}
\]
Using the beetle data again, from the maps above we see three mainland populations and there is a good reason to believe that these are a separate species (see Garrick _et al._ 2013). These populations are small and isolated and as such may experience inbreeding.
```{r}
sonora <- arapat[ arapat$Species == "Mainland",]
fis.sonora <- Fis( sonora )
fis.sonora
```
There are various ways to get a confidence interval on these kind of analyses. In what follows is an implicit test of $H_O: F_{IS} = 0$ using permutation. If that Null hypothesis is true, then any random permutation of alleles (combined into genotypes) sampled from this population would produce estimates of $F_{IS}$ as large as that observed. This kind of permutation is handled by the function `permute_ci()` (though it can be applied to more complicated analyses as shown below). Here is an example of how to use it to create the null distribution of $F_{IS}$ values given these data for the loci $EN$, $AML$, and $ATPS$.
```{r,warning=FALSE,message=FALSE}
locus.en <- sonora$EN[!is.na(sonora$EN)]
locus.aml <- sonora$EN[ !is.na(sonora$AML)]
locus.atps <- sonora$EN[ !is.na(sonora$ATPS)]
fis.en <- permute_ci( locus.en , FUN=Fis, nperm=99 )
fis.aml <- permute_ci( locus.aml , FUN=Fis, nperm=99 )
fis.atps <- permute_ci( locus.atps , FUN=Fis, nperm=99 )
```
Now we can plot these as histograms to look at their distributions.
```{r}
df <- data.frame( Locus=rep(c("EN","AML","ATPS"), each=length(fis.en)), Fis=c(fis.en,fis.aml,fis.atps))
ggplot( df, aes(x=Fis)) + geom_histogram(binwidth=0.025) + facet_grid( Locus~. ) + theme_bw()
```
Estimating a confidence interval around a point estimate is a bit different. In the example above, we could ask if $F_{IS,EN}=$ `r fis.sonora$Fis[3]` was different than $F_{IS,ATPS}=$ `r fis.sonora$Fis[7]`, which is an entirely different question than the one addressing $H_O: F_{IS} = 0$. That being said, it is not too difficult to do this given the tools you have in R.
## Measures of Genetic Distance
There are several genetic distances available within the **gstudio** package and the ability to use multivariate analogs of genotypes opens up all distance essentially all distance metrics to the end user. In the stuff that follows are the distances that are internally implemented in the package along with a brief overview.
Since all the genetic distance approaches take the same general data, **gstudio** provides a general interface for all distance metrics in the `genetic_distance()` function. This function takes the data as either a data frame with *locus* objects or a vector of *locus* objects and the genetic distance metric to be estimated and returns the appropriate response. The general form of this function is as follows and only differs in the sense that individual genetic distances do not require a *stratum* whereas strata distances do. However, as all functions in **gstudio** that need a stratum, if your default name for that column is "Population" you do not need to provide it.
```{r,eval=FALSE}
amova.dist <- genetic_distance(data,mode="AMOVA")
nei.dist <- genetic_distance( data, stratum="Population", mode="Nei")
```
What follows is a more in-depth overview of each of the genetic distance metrics.
### Individual Distances
At the base level, you can estimate distances among individuals resulting in an $NxN$ matrix of pair-wise distances. This is internally how the AMOVA analysis is conducted and is a nice heuristic for conceptual understanding of variance decomposition.
In the following examples, I will use a made-up data locus consisting of four alleles and an individual with each genotype to show how these distances work. Here is the data:
```{r}
AA <- locus( c("A","A") )
AB <- locus( c("A","B") )
BB <- locus( c("B","B") )
AC <- locus( c("A","C") )
AD <- locus( c("A","D") )
BC <- locus( c("B","C") )
BD <- locus( c("B","D") )
CC <- locus( c("C","C") )
CD <- locus( c("C","D") )
DD <- locus( c("D","D") )
loci <- c(AA,AB,AC,AD,BB,BC,BD,CC,CD,DD)
```
#### AMOVA Distance
The AMOVA distance metric was first introduced in Excoffier _et al._ (1992) using restriction fragment encodings. However, a more elegant description of it was described in Smouse & Peakall (1999) using a geometric interpretation. Essentially, the coding of alleles at a locus can be depicted by a vector $\vec{p}$ whose length is equal to the number of alleles at the locus. The presence of an allele increments the appropriate element of that vector. For two individuals, the squared vector distance between them is:
\[
\delta_{ij}^2 = 2(p_i-p_j)^2
\]
Across loci, these are additive (though see Smouse & Peakall for additional weighing schemes by locus or allele and the relative power of adopting such approaches). Across a set of individuals the squared genetic distance between all individuals can be represented by a square symmetric matrix with a zero diagonal, *D*. The AMOVA analysis itself is conducting by taking the "Sums of Squared Genetic Distances" for all individuals (SSGD(Total)), within group SSW, and among groups SSA and variance is decomposed following the standard approach for a random effects model. There is essentially no mystery in this approach, though it has been shrouded in obscurity. Dyer _et al._ (2004) showed how this is just a multivariate linear model amenable to a much broader range of experimental designs than just 1-way and nested designs.
```{r}
D <- dist_amova( loci )
rownames(D) <- colnames(D) <- as.character(loci)
D
```
#### Bray Curtis Individual Distance
Bray-Curtis Distance (Bray \& Curtis 1957) has been primarily used to quantify differences in species composition. It is defined as the total number of species that are unique to either of the two sites standardized by the number of species in both sites.
\[
BC_\delta = \frac{S_i + S_j - 2S_{ij}}{S_i+S_j}
\]
where $S_x$ is the species count and $S_{ij}$ is the sum of minimum abundances. Lately, this has seen considerable use within individual-based landscape genetic studies. Missing genotypes are set to average allele frequencies, that is to say that every missing genotype is considered to have all the alleles present in the entire population, but with probability equal to their global frequencies. Essentially, this removes the \texttt{NA} problem like in the \texttt{mode="Jaccard"} situation and does so by taking the non-missing genotype's genetic distance from the global genetic centroid (it's cosmic man!). Here is the estimation using two loci.
```{r}
D <- dist_bray( loci )
rownames(D) <- colnames(D) <- as.character(loci)
D
```
### Strata Distances
Genetic distances can also be estimated among partitioned groups of individuals. Here, I will use the data from the mainland Sonoran beetle data set using locus $ATPS$ and the $Population$ stratum as it is small enough to easily display all the results.
```{r}
data <- arapat[arapat$Species == "Mainland",c(3,13)]
```
It should be noted that for some of these metrics, we need to make assumptions about how to represent them as true 'multilocus' estimators. Where assumptions are being made, a warning message will be displayed to remind the user that there is an assumption being made in the estimation.
#### Euclidean Distance
Euclidean distance is the most straight-forward distance metric available as it is essentially straight-line distance based upon the allele frequencies in each population. It is given by:
\[
d_{eucl} = \sqrt{ \sum_{j=1}^L(p_{ij} - p_{kj})^2 }
\]
where $p_{ij}$ and $p_{kj}$ are the frequencies of the $j^{th}$ allele in both the $i^{th}$ and $j^{th}$ population. In this and the following distance examples, I am going to take the resulting distance matrix among all pairs of populations and put them into a Neighbor joining tree (via the `nj()` function from the **ape** package) as it may be easier to see differences in topologies rather than matrices.
It is perhaps easiest to think of Euclidean distance in x,y coordinate space. This distance can be estimated by `stratum.distance()` using the optional parameter *method='eucl'* and it will return a *dist* matrix.
```{r}
dist_euclidean( data )
```
#### Cavalli-Sforza Distance
Another distance approach that is commonly used for microsatellite loci is Cavalli-Sforza distance, $D_C$ (Cavalli-Sforza and Edwards, 1967). Here population allele frequencies are plot on the surface of a sphere (radius=1) using the square root of the allele frequencies.
\[
D_C = \frac{2}{\pi}\sqrt{(2-2cos\theta)}
\]
The genetic distance, $D_C$ is measured as the chord distance as indicated in Figure. The resulting Neighbor joining tree from this distance is shown in Figure \ref{fig:cavalli_dist}
\[
D_{CS,m,n} = \frac{2}{\pi}\sqrt{ 2 * 1-\sum_{i=1}^{\ell} p_{m,i}*p_{n,i}}
\]
```{r}
dist_cavalli( data )
```
#### The $D_{PS}$ Distance
The Bray Curtis distance can also be estimated on groups of individuals. However, when it is done it is often represented as $D_{PS} = (1 - P_S)$ (where $P_S$ is the proportion of shared alleles), which I will follow here for simplicity such that the individual distances and the strata distances are not confused. The $D_{PS}$ distance metric is directly related to Jaccard's distance as:
\[
D_{PS} = \frac{-J_{\delta(A,B)}}{J_{\delta(A,B)}-2}
\]
```{r}
dist_bray( data )
```
It should be noted here that the function `dist_bray()` is used for both individual distance AND strata distance depending upon what you pass to it. An individual distance is found by passing it a vector of loci whereas a stratum distance is returned by passing it a *data.frame* object (that has a $stratum$ column). In the `genetic_distance()` function these are also differentiated by *mode="Bray"* for the individual one and *mode="Dps"* for the stratum.
#### Jaccard Distance
Jaccard distance is a set-theoretic distance quantifying dissimilarity. Assuming that loci are sets of alleles, the Jaccard dissimilarity between genotypes $A$ and $B$ is given by:
\[
J_{\delta(A,B)} = \frac{|A \bigcup B| - |A \bigcap B|}{|A \bigcup B|}
\]
```{r}
dist_jaccard( data )
```
The reverse relationship between $D_{PS}$ and $J_{\delta(A,B)}$ is given as:
\[
J_{\delta(A,B)} = \frac{2+D_{PS}}{1 + D_{PS}}
\]
#### Conditional Genetic Distance (cGD; Graph Distance)
Conditional genetic distance, *cGD*, is a measure based upon conditional genetic covariance and is distinctly different than these other measures as it is not a pair-wise distance metric. Rather it is the distance through a *Population Graph* topology whose construction is determined by the totality of the data. To estimate *cGD* from your data, you can use the `genetic.distance()` function as before and it will do right thing and return you a matrix. However, you should probably look into what a *Population Graph* really is prior to using it. You will find more information below as well as in the documents for the **popgraph** package itself. For consistency, the function is shown below BUT this data set with 3 populations is too small to be of interest for a network analysis (how many ways can you connect 3 things...)
```{r,message=FALSE,warning=FALSE}
dist_cgd( data )
```
#### Nei's Genetic Distance
A very common metric of genetic distance, if you think the data you have are due to drift/mutation balance, is that of Nei. The implementation in **gstudio** of Nei's distance is based upon the sample size correction from 1978, and is calculated as:
\[
I = \frac{(2N-1)\sum_{l=1}^L\sum_{m=1}^M p_{Alm}p_{Blm}}{\sqrt{ \sum_{l=1}^L(2N\sum_{m=1}^{M_A}p_{Alm}^2-1)(2N\sum_{m=1}^{M_B}p_{Blm}^2 -1)}}
\]
for the "Genetic Identity" (where $p_A$ is the allele frequencies at one population and $p_B$ are the corresponding frequencies for the other, $L$ is across loci and $M$ is across alleles at the $l^{th}$ locus).
Nei's (1978) genetic distance, $D_N$, is:
\[
D_N = -ln(I)
\]
```{r}
dist_nei( data )
```
#### Comparing strata distance metrics
There are several more genetic distance metrics available and each may have its own set of assumptions. However, it is also true that across these metrics, there is some similarity between them. Just for illustrative purposes, we'll look at the various strata distances and plot them against each other to look at the correlative structure between alternative measures using the $ATPS$ locus and the entire beetle data set.
```{r}
x <- arapat[ , c(3,13,14) ]
summary(x)
```
Now, we'll grab the distance metrics
```{r warning=FALSE,message=FALSE}
dist.euc <- genetic_distance( x, mode="Euclidean")
dist.cgd <- genetic_distance( x, mode="cGD")
dist.nei <- genetic_distance( x, mode="Nei")
dist.dps <- genetic_distance( x, mode="Dps")
dist.jac <- genetic_distance( x, mode="Jaccard")
```
and then take the upper triangle of each and put them into a $data.frame$.
```{r}
df <- data.frame( Euclidean= dist.euc[ upper.tri(dist.euc) ] )
df$cGD <- dist.cgd[ upper.tri(dist.cgd) ]
df$Nei <- dist.nei[ upper.tri(dist.nei) ]
df$Dps <- dist.dps[ upper.tri(dist.dps) ]
df$Jaccard <- dist.jac[ upper.tri(dist.jac) ]
```
Before we plot them, there is a bit of cleaning up to do in these data. For populations that have no alleles in common, Nei's genetic distance will be $Inf$ (e.g., $-log(0)$). Also, with cGD, sets of populations that are independent will also have a infinite distance (e.g., they are not connected so it is impossible to go through the graph from one population to the other). So with these data, we should first remove them.
and then we can plot them against each other and look at their correlations.
```{r message=FALSE, warning=FALSE}
df <- df[ is.finite(df$Nei), ]
df <- df[ is.finite(df$cGD), ]
library(GGally)
ggpairs( df )
```
As you can see, there is a great deal of correlation between these parameters.
## Genetic Structure
The estimation of structure from genetic data is a common (and commonly misused) endeavor. The **gstudio** packages makes a distinction between structural parameters and statistical differences. Structural parameters are the various $X_{ST}$ statistics that crop all too quickly. These are simply parameters of the data and are not *sensu stricto* measures of differentiation. To quote Sewell Wright (1978), "..." All of these parameters are based upon assumptions related to population genetic processes. Statistical differences, are those analyses we can do on genetic data that test a specific hypothesis that is not based upon population genetic understanding, rather the simple properties of multinomial multivariate data.
### Structural Parameters
Since first introduced by Sewell Wright as F-statistics, there has been a continued development of related parameters that are used to characterize 'population structure' in one way or another. The parameters that **gstudio** provides are sufficient for most needs. These include:
- $G_{ST}$: This is Nei's parameter.
- $G_{ST}^\prime$: This is the modification of Nei's $G_{ST}$ as proposed by Hedrick.
- $D_{est}$: This is the parameter of Joost.
Both $G_{ST}^\prime$ and $D_{est}$ are parameters derived for loci with lots of alleles. There was a heated debate in the literature between Hedrick & Joost about issues related to Nei's $G_{ST}$ for data with many loci (say >6 to be conservative) and these are two options available for your use. I would recommend looking over the debate to decide which may be more appropriate for you (using them both is a lame option and you will be mocked by your reviewers if you take that approach).
Here are some examples of how to estimate these parameters using the beetle data. In all of these approaches, you pass both the stratum and the locus and they return a data frame.
As in the case for genetic distance measures, structure parameters can also be estimated using either the individual structure functions OR the generalized function `genetic_structure()` with the appropriate options. The benefit of using `genetic_structure()` is that it allows you to do `pairwise` analyses (it returns a matrix of pairwise structure or a list of pairwise matrices, one for each locus).
#### Nei's $G_{ST}$ Parameter
```{r}
Gst( arapat$LTRS, arapat$Population, nperm=99 )
```
If the loci that you pass is a bunch of loci in a data.frame, it will return the single locus estimate as well as the multilocus estimate (based upon summing the heterosexuality and then estimating it as in Nei, not in just averaging the $G_{ST}$ values as in Berg & Hamrick (XXXX), see XXXX for more on the differences).
```{r}
Gst(arapat[,c(3,7:14)], nperm=99)
```
#### Hendrick's $G_{ST}^\prime$ Parameter
A correction to Nei's $G_{ST}$ was suggested for loci with a lot of alleles. This is because the maximum value for expected heterozygosity is determined by the number of alleles and as such $G_{ST}$ for high allelic loci is not bound on the interval $[0,1]$ but is maxed out below 1.0. As a consequence, Hedrick
```{r}
sort(unique(matrix( alleles(arapat$MP20), ncol=1)))
Gst_prime(arapat$MP20, arapat$Population, nperm=99 )
```
In a similar fashion, the multilocus analog can be found by passing a *data.frame* to the function (again I am skipping the *stratum* variable to this function as the data has the strata in a column named 'Population').
```{r}
Gst_prime(arapat[,c(3,7:14)], nperm=99)
```
#### Joost's $D_{est}$ Parameter
Following a discussion back-and-forth between Hedrick & Joost, Joost proposed an alternative measure $D_{EST}$. The estimation of this parameter is found in a similar way as the other structure parameters.
```{r}
Dest( arapat$MP20, arapat$Population, nperm=99 )
```
```{r}
Dest( arapat[,c(3,7:14)], nperm=99)
```
#### Similarities in Parameters
As in genetic distance metrics, there is some similarity in output from these structure parameters. Here is a paired plot of the three parameters as above.
```{r}
gst <- Gst( arapat )$Gst
gstp <- Gst_prime( arapat )$Gst
dest <- Dest( arapat )$Dest
df <- data.frame( Gst=gst, Gst_prime=gstp, Dest=dest )
library(GGally)
ggpairs( df )
```
You can tell from these plots which locus has a lot of alleles and which ones do not...