-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy pathREADME.Rmd
175 lines (114 loc) · 6.05 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
---
output: github_document
---
# Surrogate Assisted Feature Extraction in R <img src="man/figures/logo.png" align="right" width="150"/>
[![CRAN\_Status\_Badge](http://www.r-pkg.org/badges/version/rSAFE)](https://ccitran.r-project.org/package=rSAFE)
[![Build Status](https://travis-ci.org/ModelOriented/rSAFE.svg?branch=master)](https://travis-ci.org/ModelOriented/rSAFE)
[![Coverage Status](https://codecov.io/gh/ModelOriented/rSAFE/branch/master/graph/badge.svg)](https://codecov.io/gh/ModelOriented/rSAFE)
## Overview
The `rSAFE` package is a model agnostic tool for making an interpretable white-box model more accurate using alternative black-box model called surrogate model. Based on the complicated model, such as neural network or random forest, new features are being extracted and then used in the process of fitting a simpler interpretable model, improving its overall performance.
## Installation
The package can be installed from GitHub using the code below:
```{r, eval=FALSE}
install.packages("devtools")
devtools::install_github("ModelOriented/rSAFE")
```
## Demo
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
warning = FALSE,
message = FALSE
)
```
In this vignette we present an example of an application of the `rSAFE` package in case of regression problems. It is based on `apartments` and `apartmentsTest` datasets which come from the `DALEX` package but are also available in the `rSAFE` package. We will use these artificial datasets to predict the price per square meter of an apartment based on features such as construction year, surface, floor, number of rooms and district. It should be mentioned that four of these variables are continuous while the fifth one is categorical.
```{r}
library(rSAFE)
head(apartments)
```
## Building a black-box model
First we fit a random forest model to the original `apartments` dataset - this is our complex model that will serve us as a surrogate.
```{r}
library(randomForest)
set.seed(111)
model_rf1 <- randomForest(m2.price ~ construction.year + surface + floor + no.rooms + district, data = apartments)
```
## Creating an explainer
We also create an `explainer` object that will be used later to create new variables and at the end to compare models performance.
```{r}
library(DALEX)
explainer_rf1 <- explain(model_rf1, data = apartmentsTest[1:3000,2:6], y = apartmentsTest[1:3000,1], label = "rf1", verbose = FALSE)
explainer_rf1
```
## Creating a safe_extractor
Now, we create a `safe_extractor` object using `rSAFE` package and our surrogate model. Setting the argument `verbose=FALSE` stops progress bar from printing.
```{r}
safe_extractor <- safe_extraction(explainer_rf1, penalty = 25, verbose = FALSE)
```
Now, let's print summary for the new object we have just created.
```{r}
print(safe_extractor)
```
We can see transformation propositions for all variables in our dataset.
In the plot below we can see which points have been chosen to be the breakpoints for a particular variable:
```{r, fig.width=7}
plot(safe_extractor, variable = "construction.year")
```
For factor variables we can observe in which order levels have been merged and what is the optimal clustering:
```{r, fig.width=7}
plot(safe_extractor, variable = "district")
```
## Transforming data
Now we can use our `safe_extractor` object to create new categorical features in the given dataset.
```{r}
data1 <- safely_transform_data(safe_extractor, apartmentsTest[3001:6000,], verbose = FALSE)
```
```{r, echo = FALSE}
knitr::kable(head(data1))
```
We can also perform feature selection if we wish. For each original feature it keeps exactly one of their forms - original one or transformed one.
```{r, fig.width=6}
vars <- safely_select_variables(safe_extractor, data1, which_y = "m2.price", verbose = FALSE)
data1 <- data1[,c("m2.price", vars)]
print(vars)
```
It can be observed that for some features the original form was preferred and for others the transformed one.
Here are the first few rows for our data after feature selection:
```{r, echo = FALSE}
knitr::kable(head(data1))
```
Now, we perform transformations on another data that will be used later in explainers:
```{r, fig.width=6}
data2 <- safely_transform_data(safe_extractor, apartmentsTest[6001:9000,], verbose = FALSE)[,c("m2.price", vars)]
```
## Creating white-box models on original and transformed datasets
Let's fit the models to data containing newly created columns. We consider a linear model as a white-box model.
```{r}
model_lm2 <- lm(m2.price ~ ., data = data1)
explainer_lm2 <- explain(model_lm2, data = data2, y = apartmentsTest[6001:9000,1], label = "lm2", verbose = FALSE)
set.seed(111)
model_rf2 <- randomForest(m2.price ~ ., data = data1)
explainer_rf2 <- explain(model_rf2, data2, apartmentsTest[6001:9000,1], label = "rf2", verbose = FALSE)
```
Moreover, we create a linear model based on original `apartments` dataset and its corresponding explainer in order to check if our methodology improves results.
```{r}
model_lm1 <- lm(m2.price ~ ., data = apartments)
explainer_lm1 <- explain(model_lm1, data = apartmentsTest[1:3000,2:6], y = apartmentsTest[1:3000,1], label = "lm1", verbose = FALSE)
```
## Comparing models performance
Final step is the comparison of all the models we have created.
```{r}
mp_lm1 <- model_performance(explainer_lm1)
mp_rf1 <- model_performance(explainer_rf1)
mp_lm2 <- model_performance(explainer_lm2)
mp_rf2 <- model_performance(explainer_rf2)
```
```{r, fig.width=7, fig.height=6}
plot(mp_lm1, mp_rf1, mp_lm2, mp_rf2, geom = "boxplot")
```
In the plot above we can see that the linear model based on transformed features has generally more accurate predictions that the one fitted to the original dataset.
## References
* [Python version of SAFE package](https://github.com/ModelOriented/SAFE)
* [SAFE article](https://arxiv.org/abs/1902.11035) - the article about SAFE algorithm, including benchmark results obtained using Python version of SAFE package
The package was created as a part of master's diploma thesis at Warsaw University of Technology at Faculty of Mathematics and Information Science by Anna Gierlak.