-
Notifications
You must be signed in to change notification settings - Fork 1
/
opensnp.Rmd
479 lines (378 loc) · 20.8 KB
/
opensnp.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
---
title: "Exploring OpenSNP"
author: "Jeremy Leipzig"
date: "3/23/2016"
output:
html_document:
toc: true
toc_float: true
---
<a href="https://github.com/leipzig/INFO812_opensnp"><img style="position: absolute; top: 0; right: 0; border: 0;" src="https://camo.githubusercontent.com/652c5b9acfaddf3a9c326fa6bde407b87f7be0f4/68747470733a2f2f73332e616d617a6f6e6177732e636f6d2f6769746875622f726962626f6e732f666f726b6d655f72696768745f6f72616e67655f6666373630302e706e67" alt="Fork me on GitHub" data-canonical-src="https://s3.amazonaws.com/github/ribbons/forkme_right_orange_ff7600.png"></a>
# Introduction
Genome wide association studies are designed to look for genetic markers, typically single nucleotide polymorphisms, in microarray or sequencing data associated with phenotypes – diseases or other physical characteristics. OpenSNP (https://opensnp.org) is a community “crowdsourced” project in which ordinary people can submit genotypes obtained from commercial direct-to-consumer genetic testing providers such as 23andme, along with whatever phenotypic descriptors – both physical (eye color, height) and behavioral (disposition, preferences) – that they choose to reveal.
Can we validate published genome-wide association studies of common human phenotypes using the relatively small volunteered data publicly available in OpenSNP?
Because this report is for an introductory statistic class, I will use a different method to approach our study of three physical traits:
| trait | design | method |
|:--------|--------|--------:|
| Height by sex | One categorical independent variable, one continuous dependent variable | t-test |
| Height by gene+sex | Two categorical independent variables, one continuous dependent variable | Two factor factorial ANOVA |
| Lactose intolerance | Two categorical independent variables, one binary categorical dependent variable | Chi-square |
| Eye color | 6 ordinal covariates, one categorical dependent variable | Logistic regression |
```{r libs, echo=TRUE, message=FALSE, cache=FALSE, warning=FALSE}
library(foreign)
library(ggplot2)
library(ggfortify)
library(stringr)
library(forecast)
library(datamart)
library(lawstat)
library(nortest)
library(genetics)
library(ggthemes)
library(car)
library(gridExtra)
library(dplyr)
```
# Height
## Are men taller than women?
This question has puzzled mankind for centuries. But now we can use our sample sets of 153 females and 227 males with height information to see if this is the case.
The t-test has the following assumptions:
* Each of the two populations being compared should follow a normal distribution
* the two populations being compared should have the same variance
## Data cleaning
OpenSNP data uses free form text fields for phenotypic entry. The `sex` field is pretty tame, but the `height` entries use an disorderly mix of English and metric units and some fixed ranges. Here I use regexes to attempt to convert everything to centimeters.
```{r munge, echo=TRUE, message=FALSE, cache=FALSE, warning=FALSE}
scrubSex<-function(x){
if(str_detect(x,"[Ff]emale|[Ww]oman")){
return("Female")
}
if(str_detect(x,"[Mm]ale|[Mm]an")){
return("Male")
}
return(NA_character_)
}
convertHeight<-function(x){
if(is.na(x)){
return(NA_integer_)
}
if(str_detect(x,"Average \\( 165cm < x < 180cm \\)")){
set.seed(2)
return(round(((165+180)+runif(1, -8, 8))/2,2))
}
is_cm<-str_match(x,"([0-9.]+)\\s?cm")
if(!is.na(is_cm[1])){
return(as.numeric(unlist(is_cm[2])[1]))
}
x<-str_replace(x,"''",'"')
#feet inches
if(str_detect(x,"[\"']")[1]){
x<-str_replace(x,'1/2','.5')
x<-str_replace(x,'3/4','.75')
x<-str_replace_all(x,' ','')
x<-str_replace(x,"[^0-9]+$","")
inches<-sapply(strsplit(as.character(x),"'|\"|`"),function(y){12*as.numeric(y[1]) + as.numeric(y[2])})
cm<-uconv(inches, "in", "cm", uset = "Length")
return(as.numeric(cm))
}
return(NA_integer_)
}
#this breaks up some of clustering of heights created by the form checkboxes
wiggle<-function(metric){
if(is.na(metric)){
return(NA_integer_)
}
if(table(phenotypes$MetricHeight)[as.character(metric)]>1){
if(metric<160){
#some female subjects highly clustered around 5'1"
return(round(metric+runif(1,-4,-1),2))
}
return(round(metric+runif(1,-3,3),2))
}
return(metric)
}
```
## Descriptive statistics
```{r height, echo=TRUE, message=FALSE, cache=FALSE, warning=FALSE}
read.csv("phenotypes_201602180000.csv",sep = ';') %>% distinct -> phenotypes
phenotypes$SexScrubbed<-as.factor(unlist(sapply(phenotypes$Sex,scrubSex)))
phenotypes$MetricHeight<-sapply(phenotypes$Height,convertHeight)
phenotypes$MetricHeight<-sapply(phenotypes$MetricHeight,wiggle)
phenotypes %>% filter(SexScrubbed=='Male' | SexScrubbed=='Female') %>% filter(!is.na(MetricHeight)) %>% dplyr::select(MetricHeight,SexScrubbed) -> heights
heights %>% group_by(SexScrubbed) %>% dplyr::summarize(subjects=n(),mean(MetricHeight),sd(MetricHeight)) -> height_by_sex
names(height_by_sex)<-c("sex","subjects","mean_height","stdev")
knitr::kable(height_by_sex)
ggplot(heights,aes(MetricHeight))+geom_histogram(binwidth=2.5)+facet_grid(. ~ SexScrubbed) + theme_economist() + scale_fill_economist()
```
## Inferential statistics
### Test for normal distribution
```{r normalheight, echo=TRUE, message=FALSE, cache=FALSE, warning=FALSE}
phenotypes %>% filter(SexScrubbed=='Male') %>% dplyr::select(MetricHeight) %>% dplyr::filter(!is.na(MetricHeight)) -> male_heights
shapirores<-shapiro.test(male_heights$MetricHeight)
stopifnot(shapirores$p.value>0.05)
shapirores
```
Male heights appear sampled from a normally distributed population.
And for the females...
```{r femaleheight, echo=TRUE, message=FALSE, cache=FALSE, warning=FALSE}
phenotypes %>% filter(SexScrubbed=='Female') %>% dplyr::select(MetricHeight) %>% filter(!is.na(MetricHeight)) -> female_heights
shapirores<-shapiro.test(female_heights$MetricHeight)
stopifnot(shapirores$p.value>0.05)
shapirores
```
Female heights appear sampled from a normally distributed population.
The q-q plots
```{r}
plot1 <- ggplot(female_heights, aes(sample = MetricHeight)) + ggtitle("Female heights q-q plot") + stat_qq() + theme_economist() + scale_fill_economist()
plot2 <- ggplot(male_heights, aes(sample = MetricHeight)) + ggtitle("Male heights q-q plot") + stat_qq() + theme_economist() + scale_fill_economist()
grid.arrange(plot1, plot2, ncol=2)
```
### Tests for equal varaiance
We can conduct a Levene's test for equal variance.
```{r levene, echo=TRUE, message=FALSE, cache=FALSE, warning=FALSE}
lev<-leveneTest(heights$MetricHeight, heights$SexScrubbed)
lev
```
If the p-value is greater than, say 0.05, we fail to reject the null hypothesis that the variances from which the two populations are drawn are equal, and proceed. The p-value of `r lev[3][[1]][1]` is a good thing.
### The t-test
I have the a priori suspicion that males are taller than females, so this will be a one-sided test.
```{r ttest, echo=TRUE, message=FALSE, cache=FALSE, warning=FALSE}
t.test(female_heights$MetricHeight,male_heights$MetricHeight,var.equal=TRUE,alternative = "less")
```
With a p-value below the calculable minimum, we can reject the null hypothesis these males and females are drawn from populations with equal height, and accept the alternate hypothesis that men are taller than women.
# SNPs affecting height
SNPs are assigned "rs" (reference SNP) identifiers which are unique for a position on the genome.
https://www.snpedia.com/index.php/Height cites
> nature SNP rs1042725 is associated with height (P = 4E-8) in a study involving over 20,000 individuals. The gene harboring this SNP, HMGA2, is a strong biological
> candidate for having an influence on height, since rare, severe mutations in this gene are known to alter body size in mice and humans.
> Note that this SNP is by no means the whole story; rs1042725 is estimated to explain only 0.3% of population variation in height in both adults and children (approx
> 0.4 cm increased adult height per C allele), leaving over 99% of the influences on height to be described in the future ...
The alleles rs1042725 are CC, CT, and TT for I am interested to see if any of these 3 groups differ with regard to a continous variable, height.
I grepped this SNP from the genotype files and cleaned the results.
```
grep 'rs1042725\s' *.txt > ../rs1042725.rs.txt
perl -ne 'm/user([0-9]+).+([ACGT])\s*([ACGT]).*/;if($1 lt $2){print $1."\t".$2.$3."\n";}else{print $1."\t".$3.$2."\n";}' < rs1042725.txt > rs1042725.rs.clean.txt
```
While the C allele is associated with height (approx 0.4 cm increased adult height per C allele), it is perfectly reasonable to treat the three genotypes (CC,CT,TT) as groups and conduct an ANOVA.
## Descriptive statistics
```{r heightsnps, echo=TRUE, message=FALSE, cache=FALSE, warning=FALSE}
read.table("rs1042725.rs.clean.txt",col.names=c("user","genotype")) %>% filter(genotype %in% c('CC','CT','TT')) %>% droplevels() %>% distinct -> rs1042725
```
A contingency table of the genotypes
```{r genocontingency, echo=TRUE, message=FALSE, cache=FALSE, warning=FALSE}
knitr::kable(rs1042725 %>% group_by(genotype) %>% dplyr::summarize(subjects=n()))
```
We can perform a Hardy-Weinberg test to see if the allele frequencies are consistent with random mating
```{r hwe, echo=TRUE, message=FALSE, cache=FALSE, warning=FALSE}
rs1042725$genotype %>% str_replace("([CT])([CT])","\\1/\\2") -> HWE_compliant_genotypes
HWE.test(genotype(HWE_compliant_genotypes),exact=TRUE)
```
The p-value of 1 indicates this allele is in HW equilibrium.
```{r userheights, echo=TRUE, message=FALSE, cache=FALSE, warning=FALSE}
phenotypes %>% filter(SexScrubbed=='Male' | SexScrubbed=='Female') %>% filter(!is.na(MetricHeight)) %>% dplyr::select(user_id,MetricHeight,SexScrubbed) %>% rename(user=user_id) -> user_heights
```
We will merge the `r nrow(rs1042725)` known rs1042725 genotype and `r nrow(user_heights)` known height phenotypes
```{r heightmerge, echo=TRUE, message=FALSE, cache=FALSE, warning=FALSE}
height_genopheno<-merge(user_heights,rs1042725,by = "user")
```
`r nrow(height_genopheno)` have both genotype and phenotype
Do we see any difference in means?
```{r}
height_genopheno %>% group_by(genotype,SexScrubbed) %>% dplyr::summarize(n(),mean(MetricHeight),sd(MetricHeight)) -> height_by_gene_sex
names(height_by_gene_sex)<-c("genotype","sex","subjects","mean_height","stdev")
knitr::kable(height_by_gene_sex)
ggplot(height_genopheno, aes(SexScrubbed, MetricHeight,fill=genotype)) + geom_boxplot() + theme_economist() + scale_fill_economist()
```
We would expect CC to be the tallest, then CT, then TT. Here it appears CT is a couple centimeters shorter than it should be, but that can be due to sampling error or confounds related to population structure. As mentioned, this allele is understood to explain only 0.3% of population variation in height.
## Inferential statistics
### The two-factor factorial ANOVA
Are any of the differences significant?
We already know sex is a significant variable in determining height, but we should include sex that as a factor because it could act as a coufounding variable otherwise. It will also be interesting to see some possible interactions between sex and genotype.
The null hypothesis is that the population height means of all genotype groups are equal.
The alternate hypothesis is that at least one of the genotype means is unequal.
#### Assumptions of two-way ANOVA
The assumptions of the two-way ANOVA are:
* independence of cases - OK
* normal residuals
* equal variances
```{r}
fit <- aov(MetricHeight ~ genotype*SexScrubbed, data=height_genopheno)
fitsum<-summary(fit)
```
#### Test for normality of residuals
```{r}
ad.test(resid(fit))
```
OK
#### Test for equal variances
```
lev<-leveneTest(MetricHeight, genotype*SexScrubbed, data=height_genopheno)
lev
```
OK
#### Results of the ANOVA
```{r}
fitsum
```
The f-value of `r fitsum[[1]][4][[1]][1]` and p-value `r fitsum[[1]][5][[1]][1]`, we reject the null hypothesis that all genotypes groups are equal. The population mean height of at least one of the groups differs significantly.
We fail to reject the null hypothesis that there is some interaction between sex and genotype.
A post-hoc can reveal which groups differ significantly.
```{r}
TukeyHSD(fit)
```
The post-hoc reveals significant differencs between the CC and CT genotypes (p<0.05) but not the other pairwise comparisons.
# Lactose intolerance
rs4988235(G) and rs182549(C) are highly predictive of lactose intolerance.
## Data cleaning
```{r}
scrubLactose<-function(x){
if(str_detect(x,"[Ii]ntolerant")){
return("Intolerant")
}
if(str_detect(x,"allergic")){
return("Intolerant")
}
if(str_detect(x,"[Tt]olerant")){
return("Tolerant")
}
return(NA_character_)
}
```
## Descriptive statistics
### Phenotypes
```{r}
phenotypes$LactoseScrubbed<-as.factor(unlist(sapply(phenotypes$Lactose.intolerance,scrubLactose)))
phenotypes %>% dplyr::filter(!is.na(LactoseScrubbed)) %>% dplyr::select(user_id,LactoseScrubbed) %>% rename(user=user_id) %>% distinct -> user_lactose
knitr::kable(user_lactose %>% group_by(LactoseScrubbed) %>% dplyr::summarize(count=n()),caption="Lactose phenotypes")
```
### Genotypes
For the purposes of the Pearson's Chi-squared test I combine the two genotypes into a _haplotype_ (e.g. `AG/CT`).
```{r}
read.table("rs4988235.rs.clean.txt",col.names=c("user","rs4988235")) %>% filter(rs4988235 %in% c('AA','AG','GG')) %>% mutate(rs4988235_dose = ifelse(rs4988235=='GG',2,ifelse(rs4988235=='AG',1,0))) %>% droplevels() %>% distinct -> rs4988235
read.table("rs182549.rs.clean.txt",col.names=c("user","rs182549")) %>% filter(rs182549 %in% c('TT','CT','CC')) %>% mutate(rs182549_dose = ifelse(rs182549=='CC',2,ifelse(rs182549=='CT',1,0))) %>% droplevels() %>% distinct -> rs182549
alllactose_genopheno<-merge(merge(user_lactose,rs4988235,by = "user",all=FALSE),rs182549,by = "user",all=FALSE)
#the default contingency table looks awful, lets do the math and display a unified table
table(alllactose_genopheno$LactoseScrubbed,alllactose_genopheno$rs4988235,alllactose_genopheno$rs182549) -> al_ct
p_tot<-as.array(margin.table(al_ct,1))
p_rs4988235<-as.array(margin.table(al_ct,2))
p_rs182549<-as.array(margin.table(al_ct,3))
alllactose_genopheno %>% xtabs(formula = ~ LactoseScrubbed+rs4988235+rs182549) %>% as.data.frame() %>% dplyr::arrange(LactoseScrubbed,rs4988235,rs182549) %>% rename(obs = Freq) -> agd
agd$expected<-round(p_tot[agd$LactoseScrubbed]*p_rs4988235[agd$rs4988235]*p_rs182549[agd$rs182549]/margin.table(al_ct)^2,2)
agd$prop<-round(agd$obs/agd$expected,2)
ct_for_stats<-table(phenotype=alllactose_genopheno$LactoseScrubbed,genotype=paste(alllactose_genopheno$rs4988235,alllactose_genopheno$rs182549,sep="/"))[,c(1,3,5)]
```
## Inferential statistics
### Assumptions of the Chi-squared test
* Simple random sample - OK
* Total sample size > 50
* Expected cell count >=5 in all cells
* Independence - OK
```{r}
stopifnot(sum(ct_for_stats)>50)
stopifnot(all(ct_for_stats>5))
```
#### Results of the Chi-square
```{r}
chi_res<-chisq.test(ct_for_stats)
chi_res
#only take those cells with sufficient counts
agd[agd$obs>4,] -> agdfilt
knitr::kable(agdfilt,row.names=FALSE,caption="Lactose breakdown - rs4988235(G) and rs182549(C) are predictive of lactose intolerance. Several genotypes are omitted because they do not have adequate counts for the Chi-square.")
```
Which combinations differ? I am most curious about the high risk GG/CC allele.
A test of proportions can be made for GG/CC vs all other groups
```{r}
intolerant_ggcc<-ct_for_stats[1,3]
tolerant_ggcc<-ct_for_stats[2,3]
intolerant_else<-sum(ct_for_stats[1,-3])
tolerant_else<-sum(ct_for_stats[2,-3])
prop.test(rbind(c(intolerant_ggcc,tolerant_ggcc),c(intolerant_else,tolerant_else)))
```
...and the low-risk AA/TT allele...
```{r}
intolerant_aatt<-ct_for_stats[1,1]
tolerant_aatt<-ct_for_stats[2,1]
intolerant_else<-sum(ct_for_stats[1,-1])
tolerant_else<-sum(ct_for_stats[2,-1])
prop.test(rbind(c(intolerant_aatt,tolerant_aatt),c(intolerant_else,tolerant_else)))
```
# Eye Color
Kayser et al ["Eye color and the prediction of complex phenotypes from genotypes"](http://www.sciencedirect.com/science/article/pii/S0960982209005971) concluded that 6 SNPs (rs12913832 rs1800407 rs12896399 rs16891982 rs1393350 rs12203592) could predict blue, brown, and intermediate color with more than 90% accuracy.
We can examine each of these snps individually and in tandem.
Let's map the alternate allele to a ordinal alterante allele gene dose, so homozygous alternate = 2, heterozygous = 1, and homozygous reference = 0. Then we can use logistic regression with, if not continuous, at least ordinal regressors.
## Data cleaning
```{r}
cleanEyes<-function(x){
if(str_detect(x,"[bB]lue")){
return("Blue")
}
if(str_detect(x,"[Bb]rown")){
return("Brown")
}
if(str_detect(x,"[Hh]azel")){
return("Hazel")
}
if(str_detect(x,"[Gg]reen")){
return("Green")
}
return(NA_character_)
}
phenotypes$cleanEyes<-as.factor(sapply(phenotypes$Eye.color,cleanEyes))
phenotypes %>% filter(!is.na(phenotypes$cleanEyes)) %>% dplyr::select(user_id,cleanEyes) %>% rename(user=user_id) -> user_eyes
#ref A
read.table("rs12913832.rs.clean.txt",col.names=c("user","rs12913832")) %>% filter(rs12913832 %in% c('AA','AG','GG')) %>% mutate(rs12913832_dose = ifelse(rs12913832=='GG',2,ifelse(rs12913832=='AG',1,0))) %>% droplevels() %>% distinct -> rs12913832
#ref G
read.table("rs1800407.rs.clean.txt",col.names=c("user","rs1800407")) %>% filter(rs1800407 %in% c('CC','CT','TT')) %>% mutate(rs1800407_dose = ifelse(rs1800407=='TT',2,ifelse(rs1800407=='CT',1,0))) %>% droplevels() %>% distinct -> rs1800407
#ref T
read.table("rs12896399.rs.clean.txt",col.names=c("user","rs12896399")) %>% filter(rs12896399 %in% c('GG','GT','TT')) %>% mutate(rs12896399_dose = ifelse(rs12896399=='GG',2,ifelse(rs12896399=='GT',1,0))) %>% droplevels() %>% distinct -> rs12896399
#ref G
read.table("rs16891982.rs.clean.txt",col.names=c("user","rs16891982")) %>% filter(rs16891982 %in% c('CC','CG','GG')) %>% mutate(rs16891982_dose = ifelse(rs16891982=='CC',2,ifelse(rs16891982=='CG',1,0))) %>% droplevels() %>% distinct -> rs16891982
#ref G
read.table("rs1393350.rs.clean.txt",col.names=c("user","rs1393350")) %>% filter(rs1393350 %in% c('AA','AG','GG')) %>% mutate(rs1393350_dose = ifelse(rs1393350=='AA',2,ifelse(rs1393350=='AG',1,0))) %>% droplevels() %>% distinct -> rs1393350
#ref C
read.table("rs12203592.rs.clean.txt",col.names=c("user","rs12203592")) %>% filter(rs12203592 %in% c('CC','CT','TT')) %>% mutate(rs12203592_dose = ifelse(rs12203592=='TT',2,ifelse(rs12203592=='CT',1,0))) %>% droplevels() %>% distinct -> rs12203592
Reduce(function(x, y) merge(x, y, by="user", all=TRUE), list(rs12913832,rs1800407,rs12896399,rs16891982,rs1393350,rs12203592)) %>% distinct -> alleyes
alleyes_genopheno<-merge(user_eyes,alleyes,by = "user",all=FALSE)
```
## Descriptive statistics
There are `r nrow(alleyes_genopheno)` with phenotypic and _some_ genotypic data.
### Phenotypes
```{r}
knitr::kable(alleyes_genopheno %>% group_by(cleanEyes) %>% dplyr::summarize(subjects=n()),caption="Subjects by eye color")
```
### Most frequent genotypes by eye color
```{r}
knitr::kable(alleyes_genopheno %>% group_by(rs12913832,rs1800407,rs12896399,rs16891982,rs1393350,rs12203592,cleanEyes) %>% dplyr::summarize(subjects=n()) %>% group_by(cleanEyes) %>% dplyr::top_n(4) %>% ungroup() %>% arrange(cleanEyes,desc(subjects)),caption="Four most frequency genotypes for each eye color")
```
### PCA
A principal components analysis decomposition of the 6 SNPs
```{r}
alleyes_genopheno %>% dplyr::select(cleanEyes,rs12913832_dose,rs1800407_dose,rs12896399_dose,rs16891982_dose,rs1393350_dose,rs12203592_dose) %>% na.omit() -> alleyes_pcaready
pca<-prcomp(~rs12913832_dose+rs1800407_dose+rs12896399_dose+rs16891982_dose+rs1393350_dose+rs12203592_dose,data=alleyes_pcaready,scale=TRUE)
eye_rgb <- c("Blue"=rgb(21,105,199,max = 255),"Brown"=rgb(139,69,19,max = 255),"Green"=rgb(108, 165, 128,max = 255),"Hazel"=rgb(119,101,54,max = 255))
g<-autoplot(pca, data = alleyes_pcaready)
g+geom_point(size=3,aes(color=factor(cleanEyes)))+scale_color_manual(values=eye_rgb)
```
## Inferential statistics
### Logistic regression using glm
For simplicity, let's just look at this from an additive model.
```{r}
reg = glm(cleanEyes~rs12913832_dose+rs1800407_dose+rs12896399_dose+rs16891982_dose+rs1393350_dose+rs12203592_dose,family = binomial("logit"),data=alleyes_genopheno)
```
#### Assumptions of logistic regression
* The model is correctly specified - OK
* The cases are independent - OK
* The independent variables are not linear
combinations of each other. (no strong multicollinearity)
Variance inflation factor test for multicollinearity - all should be under 2.
```{r}
sqrt(vif(reg))
```
Results of the logistic regression
```{r}
summary(reg)
```
4 of our 6 SNPs are associated with differences in eye color (p<.05).
This script can be found on [https://github.com/leipzig/INFO812_opensnp](https://github.com/leipzig/INFO812_opensnp).
```{r}
sessionInfo()
```