-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathfertility_documentation_RStudio.Rpres
115 lines (87 loc) · 4.16 KB
/
fertility_documentation_RStudio.Rpres
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
Predicting male fertility from personal risk factors
========================================================
author: iNLyze
date: `r Sys.Date()`
```{r includes, warning=FALSE, echo=FALSE, results='hide'}
if (!require(plyr))
install.packages("plyr")
if (!require(caret))
install.packages("caret")
if (!require(randomForest))
install.packages("randomForest")
if (!require(dplyr))
install.packages("dplyr")
```
```{r loadData, echo=FALSE}
destfile <- "fertility.csv"
destpath <- paste(getwd(), destfile, sep = "/")
fertility <- read.csv(destpath, header = F, sep = ",", dec = ".")
names(fertility) <- c("season", "age", "diseases", "trauma", "surgery", "fevers", "alcohol", "smoking", "sitting", "Outcome")
```
```{r partitionData, echo=FALSE}
set.seed(2545784)
inTrain <- createDataPartition(y=fertility$Outcome, p = 0.7, list = FALSE)
training <- fertility[inTrain, ]
testing <- fertility[-inTrain, ]
```
```{r train_rf, echo=FALSE}
# Build random forest model
mtryGrid <- expand.grid(mtry = 9)
set.seed(239584)
fitControl <- trainControl(method = 'boot632',
number = 100,
repeats = 100
)
set.seed(40958)
rf_Fit1 <- train(Outcome ~ ., data = training,
method = "rf",
importance = TRUE,
ntree = 20,
tuneGrid = mtryGrid,
trControl = fitControl
)
# predict and compare
prediction.rf <- predict(rf_Fit1, testing)
```
<img src="father-and-son-silhouette.png" style="background-color:transparent; border:0px; box-shadow:none; width:200px"></img>
<style>
.small-code pre code {
font-size: 0.8em;
}
</style>
Significance of fertility
========================================================
class: small-code
"Infertility [...] is also a significant clinical problem today, which affects 8–12% of couples worldwide. Of all infertility cases, approximately 40–50% is due to “male factor” infertility and as many as 2% of all men will exhibit suboptimal sperm parameters."
[Kumar et al](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4691969/)
*J Hum Reprod Sci.* 2015 Oct-Dec; 8(4): 191–196.
* Direct analysis of sperm quality requires expert knowledge and specialized equipment
* Sperm quality is often affected by a number of risk factors
* Predicting male fertility from such factors may give a first indication
* Could save cost and time by focusing more in-depth analysis on persons who are more likely whose fertility might be altered.
Data and model building
========================================================
## How the data was obtained:
* Data was originally created as part of [Gil et al](https://www.researchgate.net/profile/Joaquin_De_Juan/publication/230868076_Predicting_seminal_quality_with_artificial_intelligence_methods/links/09e415058f10cc3081000000.pdf)
* A subset of the original data (100 x 10) is available at [UCI](https://archive.ics.uci.edu/ml/datasets/Fertility)
## How the model was built:
* The data set was split into training and a test set (70%/30%, respectively)
* A random forest model was built using 632 Bootstrapping with 100 iterations and 100 repeats
* This model achieved an accuracy of 93%
* About comparable to performance achieved with best model in original publication
Model performance
========================================================
class: small-code
* Data set imbalanced, classes N, O have `r table(fertility$Outcome)` occurences
* Small data set, only 100 samples
* 10 dimensions, original publication used 34 dimensions
```{r confusion_metric, echo=FALSE}
confusionMatrix(prediction.rf, testing$Outcome, positive = "O")
```
How to use the application
====
* Type in require information and press submit
* Data is normalized to match format of data set
* To save time model is loaded from file, only prediction is made
* Output has been chosen to be as neutral as possible due to the very personal nature of the matter.
* Note: Age is confined to inputs only between 18 - 36 years due to nature of study (and relavent age when fatherhood is desired by large proportion of men). Predicting outside this age interval is potentially unreliable.