-
Notifications
You must be signed in to change notification settings - Fork 3
/
README.Rmd
380 lines (206 loc) · 11.9 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
---
date: "`r Sys.Date()`"
output: github_document
title: "manymodelr: Build and Tune Several Models"
vignette: >
%\VignetteIndexEntry{ "A Gentle Introduction to manymodelr"}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
[![CRAN_Status_Badge](https://www.r-pkg.org/badges/version/manymodelr)](https://CRAN.R-project.org/package=manymodelr)
[![Codecov test coverage](https://codecov.io/gh/Nelson-Gon/manymodelr/branch/develop/graph/badge.svg)](https://app.codecov.io/gh/Nelson-Gon/manymodelr?branch=develop)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3891106.svg)](https://zenodo.org/record/5211272)
[![R CMDCheck](https://github.com/Nelson-Gon/manymodelr/workflows/R-CMD-check-devel/badge.svg)](https://github.com/Nelson-Gon/manymodelr)
![test-coverage](https://github.com/Nelson-Gon/manymodelr/workflows/test-coverage/badge.svg)
[![license](https://img.shields.io/badge/license-GPL--2-blue.svg)](https://www.gnu.org/licenses/gpl-3.0.en.html)
[![CRAN\_Release\_Badge](https://www.r-pkg.org/badges/version-ago/manymodelr)](https://CRAN.R-project.org/package=manymodelr)
[![Downloads](https://cranlogs.r-pkg.org/badges/manymodelr)](https://CRAN.R-project.org/package=manymodelr)
[![TotalDownloads](https://cranlogs.r-pkg.org/badges/grand-total/manymodelr?color=yellow)](https://CRAN.R-project.org/package=manymodelr)
[![lifecycle](https://img.shields.io/badge/lifecycle-stable-brightgreen.svg)](https://lifecycle.r-lib.org/articles/stages.html)
[![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](https://GitHub.com/Nelson-Gon/manymodelr/graphs/commit-activity)
[![Project Status](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/)
[![GitHub last commit](https://img.shields.io/github/last-commit/Nelson-Gon/manymodelr.svg)](https://github.com/Nelson-Gon/manymodelr/commits/master)
[![GitHub issues](https://img.shields.io/github/issues/Nelson-Gon/manymodelr.svg)](https://GitHub.com/Nelson-Gon/manymodelr/issues/)
[![GitHub issues-closed](https://img.shields.io/github/issues-closed/Nelson-Gon/manymodelr.svg)](https://GitHub.com/Nelson-Gon/manymodelr/issues?q=is%3Aissue+is%3Aclosed)
[![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat-square)](https://makeapullrequest.com)
In this vignette, we take a look at how we can simplify many machine learning tasks using `manymodelr`.
## Installation
```{r, eval=FALSE}
install.packages("manymodelr")
```
Once the package has been successfully installed, we can then proceed by loading the package and exploring some of the key functions.
**Loading the package**
```{r}
library(manymodelr)
# data for examples
data("yields", package="manymodelr")
```
## Modeling
First, a word of caution. The examples shown in this section are meant to simply show what the functions do and not what the best model is. For a specific use case, please perform the necessary model checks, post-hoc analyses, and/or choose predictor variables and model types as appropriate based on domain knowledge.
With this in mind, let us look at how we can perform modeling tasks using `manymodelr`.
- **`multi_model_1`**
This is one of the core functions of the package. `multi_model_1` aims to allow
model fitting, prediction, and reporting with a single function. The `multi`
part of the function's name reflects the fact that we can fit several model
types with one function. An example follows next.
For purposes of this report, we create a simple dataset to use.
```{r}
set.seed(520)
train_set<-createDataPartition(yields$normal,p=0.6,list=FALSE)
valid_set<-yields[-train_set,]
train_set<-yields[train_set,]
ctrl<-trainControl(method="cv",number=5)
m<-multi_model_1(train_set,"normal",".",c("knn","rpart"),
"Accuracy",ctrl,new_data =valid_set)
```
The above returns a list containing metrics, predictions, and a model summary. These can be extracted as shown below.
```{r}
m$metric
```
```{r}
head(m$predictions)
```
- **multi_model_2**
This is similar to `multi_model_1` with one difference: it does not use metrics such as RMSE, accuracy and the like. This function is useful if one would like to fit and predict "simpler models" like generalized linear models or linear models. Let's take a look:
```{r}
# fit a linear model and get predictions
lin_model <- multi_model_2(mtcars[1:16,],mtcars[17:32,],"mpg","wt","lm")
lin_model[c("predicted", "mpg")]
```
From the above, we see that `wt` alone may not be a great predictor for `mpg`. We
can fit a multi-linear model with other predictors. Let's say `disp` and `drat` are important too, then we add those to the model.
```{r}
multi_lin <- multi_model_2(mtcars[1:16, ], mtcars[17:32,],"mpg", "wt + disp + drat","lm")
multi_lin[,c("predicted", "mpg")]
```
- **`fit_model`**
This function allows us to fit any kind of model without necessarily returning predictions.
```{r}
lm_model <- fit_model(mtcars,"mpg","wt","lm")
lm_model
```
- `fit_models`
This is similar to `fit_model` with the ability to fit many models with many predictors at once. A simple linear model for instance:
```{r}
models<-fit_models(df=yields,yname=c("height", "weight"),xname="yield",
modeltype="glm")
```
One can then use these models as one may wish. To add residuals from these models for example:
```{r}
res_residuals <- lapply(models[[1]], add_model_residuals,yields)
res_predictions <- lapply(models[[1]], add_model_predictions, yields, yields)
# Get height predictions for the model height ~ yield
head(res_predictions[[1]])
```
If one would like to drop non-numeric columns from the analysis, one can set `drop_non_numeric` to `TRUE` as follows. The same can be done for `fit_model` above:
```{r}
(models<-fit_models(df=yields,yname=c("height","weight"),
xname=".",modeltype=c("lm","glm"), drop_non_numeric = TRUE))
```
## Generating a (simple) model report
One can generate a very simple model report using `report_model` as follows:
```{r}
report_model(models[[2]][[1]])
```
## Extraction of Model Information
To extract information about a given model, we can use `extract_model_info` as follows.
```{r}
extract_model_info(lm_model, "r2")
```
To extract the adjusted R squared:
```{r}
extract_model_info(lm_model, "adj_r2")
```
For the p value:
```{r}
extract_model_info(lm_model, "p_value")
```
To extract multiple attributes:
```{r}
extract_model_info(lm_model,c("p_value","response","call","predictors"))
```
This is not restricted to linear models but will work for most model types. See `help(extract_model_info)` to see currently supported model types.
## Correlations
- `get_var_corr`
As can probably(hopefully) be guessed from the name, this provides a convenient way to get variable correlations. It enables one to get correlation between one variable and all other variables in the data set.
**Previously, one would set `get_all` to `TRUE` if they wanted to get correlations between all variables. This argument has been dropped in favor of simply supplying an optional `other_vars` vector if one does not want to get all correlations.**
Sample usage:
```{r}
# getall correlations
# default pearson
head( corrs <- get_var_corr(mtcars,comparison_var="mpg") )
```
**Previously, one would also set `drop_columns` to `TRUE` if they wanted to drop factor columns.** Now, a user simply provides a character vector specifying which column types(classes) should be dropped. It defaults to `c("character","factor")`.
```{r}
# purely demonstrative
get_var_corr(yields,"height",other_vars="weight",
drop_columns=c("factor","character"),method="spearman",
exact=FALSE)
```
Similarly, `get_var_corr_` (note the underscore at the end) provides a convenient way to get combination-wise correlations.
```{r}
head(get_var_corr_(yields),6)
```
To use only a subset of the data, we can use provide a list of columns to `subset_cols`. By default, the first value(vector) in the list is mapped to `comparison_var` and the other to `other_Var`. The list is therefore of length 2.
```{r}
head(get_var_corr_(mtcars,subset_cols=list(c("mpg","vs"),c("disp","wt")),
method="spearman",exact=FALSE))
```
- `plot_corr`
Obtaining correlations would mostly likely benefit from some form of visualization. `plot_corr` aims to achieve just that. There are currently two plot styles, `squares` and `circles`. `circles` has a `shape` argument that can allow for more flexibility. It should be noted that the correlation matrix supplied to this function is an object produced by `get_var_corr_`.
To modify the plot a bit, we can choose to switch the x and y values as shown below.
```{r}
plot_corr(mtcars,show_which = "corr",
round_which = "correlation",decimals = 2,x="other_var", y="comparison_var",plot_style = "squares"
,width = 1.1,custom_cols = c("green","blue","red"),colour_by = "correlation")
```
To show significance of the results instead of the correlations themselves, we can set `show_which` to "signif" as shown below. By default, significance is set to 0.05. You can override this by supplying a different `signif_cutoff`.
```{r}
# color by p value
# change custom colors by supplying custom_cols
# significance is default
set.seed(233)
plot_corr(mtcars, x="other_var", y="comparison_var",plot_style = "circles",show_which = "signif", colour_by = "p.value", sample(colours(),3))
```
To explore more options, please take a look at the documentation.
## Extra Functions
- **`agg_by_group`**
As can be guessed from the name, this function provides an easy way to manipulate grouped data. We can for instance find the number of observations in the yields data set. The formula takes the form `x~y` where `y` is the grouping variable(in this case `normal`). One can supply a formula as shown next.
```{r}
head(agg_by_group(yields,.~normal,length))
```
```{r}
head(agg_by_group(mtcars,cyl~hp+vs,sum))
```
- `rowdiff`
This is useful when trying to find differences between rows. The `direction` argument specifies how the subtractions are made while the `exclude` argument is used to specify classes that should be removed before calculations are made. Using `direction="reverse"` performs a subtraction akin to `x-(x-1)` where `x` is the row number.
```{r}
head(rowdiff(yields,exclude = "factor",direction = "reverse"))
```
- `na_replace`
This allows the user to conveniently replace missing values. Current options are `ffill` which replaces with the next non-missing value, `samples` that samples the data and does replacement, `value` that allows one to fill `NA`s with a specific value. Other common mathematical methods like `min`, `max`,`get_mode`, `sd`, etc are no longer supported. They are now available with more flexibility in standalone [mde](https://github.com/Nelson-Gon/mde)
```{r}
head(na_replace(airquality, how="value", value="Missing"),8)
```
- `na_replace_grouped`
This provides a convenient way to replace values by group.
```{r}
test_df <- data.frame(A=c(NA,1,2,3), B=c(1,5,6,NA),groups=c("A","A","B","B"))
# Replace NAs by group
# replace with the next non NA by group.
na_replace_grouped(df=test_df,group_by_cols = "groups",how="ffill")
```
The use of `mean`,`sd`,etc is no longer supported. Use [mde](https://github.com/Nelson-Gon/mde) instead which is focused on missingness.
---
**Exploring Further**
The vignette has been short and therefore is non exhaustive. The best way to explore this and any package or language is to practise. For more examples, please use `?function_name` and see a few implementations of the given function.
**Reporting Issues**
If you would like to contribute, report issues or improve any of these functions, please raise a pull request at ([manymodelr](https://github.com/Nelson-Gon/manymodelr))
> "Programs must be written for people to read, and only incidentally for machines to execute." - Harold Abelson ([Reference](https://www.goodreads.com/quotes/tag/programming))
**Thank You**