forked from joseph-rickert/DataScienceRWebinar
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathCARET.Rmd
209 lines (182 loc) · 7.12 KB
/
CARET.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
---
title: "CARET"
author: "Joseph Rickert"
date: "Wednesday, September 24, 2014"
output: html_document
---
## INTRODUCTION TO THE CARET PACKAGE
caret is the most feature rich package for doing data mining in R. It collects together in one place machine learning algorithms from multiple R packages, provides a uniform interface to these algorithms and includes many functions for facilitating the model building process. This script explores caret's capabilities using data included in the package that was described in the paper: Hill et al "Impact of image segmentation on high-content screening data quality for SK-BR-3 cells" BMC fioinformatics (2007) vol 8 (1) pp. 340
The analysis presented here is based on examples presented by Max Kuhn, caret's author, at Use-R 2012.
### Background
"Well-segmented"" cells are cells for which location and size may be accurrately detremined through optical measurements. Cells that are not Well-segmented (WS) are said to be "Poorly-segmented"" (PS).
### Problem
Given a set of optical measurements can we predict which cells will be PS? This is a classic classification problem
```{r}
library(caret)
library(rpart) # CART algorithm for decision trees
library(partykit) # Plotting trees
library(gbm) # Boosting algorithms
library(doParallel) # parallel processing
library(pROC) # plot the ROC curve
library(corrplot) # plot correlations
```
### Get the Data
Load the data and construct indices to divie it into training and test data sets.
```{r}
data(segmentationData) # Load the segmentation data set
dim(segmentationData)
head(segmentationData)
#
trainIndex <- createDataPartition(segmentationData$Case,p=.5,list=FALSE)
trainData <- segmentationData[trainIndex,]
dim(trainData)
#
testData <- segmentationData[-trainIndex,]
dim(testData)
```
## rpart Tree Model
We build a basic tree model with rpart.
```{r}
tree.mod <- rpart(Class ~ .,data=trainData,control=rpart.control(maxdepth=2))
tree.mod
```
Visualize the tree
```{r}
tree.mod.p <- as.party(tree.mod) # make the tree.mod object into a party object
plot(tree.mod.p)
```
## Generalized Boosted Regression Model
We build a gbm model. Note that the gbm function does not allow factor "class" variables
```{r}
gbmTrain <- trainData
gbmTrain$Class <- ifelse(gbmTrain$Class=="PS",1,0)
gbm.mod <- gbm(formula = Class~., # use all variables
distribution = "bernoulli", # for a classification problem
data = gbmTrain,
n.trees = 2000, # 2000 boosting iterations
interaction.depth = 7, # 7 splits for each tree
shrinkage = 0.01, # the learning rate parameter
verbose = FALSE) # Do not print the details
summary(gbm.mod) # Plot the relative inference of the variables in the model
```
This is an interesting model, but how do you select the best values for the for the three tuning parameters?
* - n.trees
* - interaction.depth
* - shrinkage
Algorithm for training the model:
* for each resampled data set do
* hold out some samples
* for each combination of the three tuning parameters
* do
* Fit the model on the resampled data set
* Predict the values of class on the hold out samples
* end
* Calculate AUC: the area under the ROC for each sample
* Select the combination of tuning parmeters that yields the best AUC
caret provides the "train" function to do all of this
The trainControl function to set the training method
Note the default method of picking the best model is accuracy and Cohen's Kappa
```{r}
ctrl <- trainControl(method="repeatedcv", # use repeated 10fold cross validation
repeats=5, # do 5 repititions of 10-fold cv
summaryFunction=twoClassSummary, # Use AUC to pick the best model
classProbs=TRUE)
```
Use the expand.grid to specify the search space
Note that the default search grid selects 3 values of each tuning parameter
```{r}
#grid <- expand.grid(.interaction.depth = seq(1,7,by=2), # look at tree depths from 1 to 7
# .n.trees=seq(100,1000,by=50), # let iterations go from 100 to 1,000
# .shrinkage=c(0.01,0.1)) # Try 2 values of the learning rate parameter
grid <- expand.grid(.interaction.depth = seq(1,4,by=2), # look at tree depths from 1 to 4
.n.trees=seq(10,100,by=10), # let iterations go from 10 to 100
.shrinkage=c(0.01,0.1)) # Try 2 values of the learning rate parameter
#
set.seed(1)
names(trainData)
trainX <-trainData[,4:61]
registerDoParallel(4) # Registrer a parallel backend for train
getDoParWorkers()
system.time(gbm.tune <- train(x=trainX,y=trainData$Class,
method = "gbm",
metric = "ROC",
trControl = ctrl,
tuneGrid=grid,
verbose=FALSE))
```
### Tuning Results
ROC was the performance criterion used to select the optimal model. The final values used for the model were:
* interaction.depth = 7
* n.trees = 500
* shrinkage = 0.01
```{r}
gbm.tune$bestTune
plot(gbm.tune) # Plot the performance of the training models
res <- gbm.tune$results
names(res) <- c("depth","trees", "shrinkage","ROC", "Sens","Spec", "sdROC", "sdSens", "seSpec")
res
```
### GBM Model Predictions and Performance
Make predictions using the test data set
```{r}
testX <- testData[,4:61]
gbm.pred <- predict(gbm.tune,testX)
head(gbm.pred)
```
Look at the confusion matrix
```{r}
confusionMatrix(gbm.pred,testData$Class)
```
Draw the ROC curve
```{r}
gbm.probs <- predict(gbm.tune,testX,type="prob")
head(gbm.probs)
gbm.ROC <- roc(predictor=gbm.probs$PS,
response=testData$Class,
levels=rev(levels(testData$Class)))
gbm.ROC
plot(gbm.ROC)
```
Plot the propability of poor segmentation
```{r}
histogram(~gbm.probs$PS|testData$Class,xlab="Probability of Poor Segmentation")
```
##SUPPORT VECTOR MACHINE MODEL
We follow steps similar to those above to build a SVM mocel
```{r}
set.seed(1)
registerDoParallel(4,cores=4)
getDoParWorkers()
system.time(
svm.tune <- train(x=trainX,
y= trainData$Class,
method = "svmRadial",
tuneLength = 9, # 9 values of the cost function
preProc = c("center","scale"),
metric="ROC",
trControl=ctrl) # same as for gbm above
)
svm.tune
# Plot the SVM results
plot(svm.tune,
metric="ROC",
scales=list(x=list(log=2)))
#---------------------------------------------------
# SVM Predictions
svm.pred <- predict(svm.tune,testX)
head(svm.pred)
confusionMatrix(svm.pred,testData$Class)
```
## Comparing Models
Having set the seed to 1 before running gbm.tune and svm.tune we have generated paired samples (See Hothorn at al, "The design and analysis of benchmark experiments-Journal of Computational and Graphical Statistics (2005) vol 14 (3) pp 675-699) and are in a position to compare models using a resampling technique.
The resamples function in caret collates the resampling results from the two models
```{r}
rValues <- resamples(list(svm=svm.tune,gbm=gbm.tune))
rValues$values
summary(rValues)
xyplot(rValues,metric="ROC") # scatter plot
bwplot(rValues,metric="ROC") # boxplot
parallelplot(rValues,metric="ROC") # parallel plot
dotplot(rValues,metric="ROC") # dotplot
splom(rValues,metric="ROC")
```