04_regression_ames.Rmd

---
title: "Regression: Ames Model"
author: "Axel R"
date: "2024-01-21"
output: 
  rmdformats::robobook:
    self_contained: true
    thumbnails: true
    lightbox: true
    gallery: false
    highlight: tango
---

**Remark**: The material is based on an activity assigned to us by PhD. Renan Escalante.

```{r eval=FALSE, message=FALSE, warning=FALSE, include=FALSE}
library(rmdformats)
```


```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidymodels)
library(dotwhisker)
library(vip)
```

-   `knitr::opts_chunk$set(echo = TRUE)` sets an option in the knitr package, which is commonly used for dynamic report generation in R.
-   The `dotwhisker` package is used for creating dot-and-whisker plots, which are useful for visualizing model coefficients and their uncertainty intervals.
-   The `vip` (Variable Importance in Predictive models) package provides functions for calculating and visualizing variable importance in predictive models. Variable importance helps to understand which features or variables have the most impact on a model's predictions.

# Fitting data

```{r}
tidymodels_prefer() 

data(ames)

set.seed(123)
```

-   `tidymodels_prefer()` is used to set a series of preferences that make the tidymodels functions and syntax more consistent with the tidyverse style. The tidymodels ecosystem is built around the principles of the tidyverse, providing a consistent and tidy approach to modeling and machine learning in R.
-   The `ames` dataset is often used for regression modeling tasks and is included in the tidymodels package. It contains information about housing in Ames, Iowa.
-   `set.seed(123)`sets the seed for the random number generator in R to 123. Setting the seed ensures that the sequence of random numbers generated by the code will be reproducible. If you run the code again, you should get the same random numbers, which can be important when working with randomness in statistical or machine learning models.

Next, we prepare the dataset (ames) for a regression model by transforming the target variable (`Sale_Price`) using the logarithm base 10 (`log10`). Then we perform a train/test split using the initial_split function from the tidymodels package. Finally, it separates the dataset into training (ames_train) and testing (ames_test) sets.

```{r}
ames <- ames %>% mutate(Sale_Price = log10(Sale_Price))

# If you do one train/test split 
data_split <- initial_split(ames, strata = "Sale_Price", prop = 0.75) #Create Train/Test set
ames_train <- training(data_split) # Fit model to this
ames_test  <- testing(data_split) # Don't use until evaluating final model
```

-   `ames <- ames %>% mutate(Sale_Price = log10(Sale_Price))` uses the `%>%` (pipe) operator from the `dplyr` package to modify the `ames` dataset. It adds a new variable (`Sale_Price`) to the dataset, which is the logarithm (base 10) of the original `Sale_Price` variable. The transformation is often applied to make the distribution of the target variable more symmetric or to stabilize variance in regression modeling.
-   `data_split <- initial_split(ames, strata = "Sale_Price", prop = 0.75)` creates a train/test split of the dataset using the `initial_split` function from the `rsample` package (part of the tidymodels ecosystem). The `strata` argument specifies a variable to stratify the split, ensuring that both the training and testing sets have a similar distribution of the target variable (`Sale_Price`). The `prop` argument sets the proportion of the data to be included in the training set (0.75 means 75% in the training set and 25% in the testing set). `strata` asegura que ambos test tengan la misma distribucion en los dataset de training y test.
-   `ames_test <- testing(data_split)` extracts the testing set from the split using the `testing` function. The `ames_test` dataset is kept separate and is not used until the final model is trained and needs to be evaluated.

```{r}
# Model Specification:
lm_spec <- 
  linear_reg() %>%  # Specify Model and Engine
  set_engine( engine = 'lm') %>%
  set_mode('regression') 
# Recipe Specification
lm_rec <- recipe(Sale_Price ~ Lot_Area + Year_Built +  House_Style + Gr_Liv_Area + Fireplaces, data = ames_train) %>%
  step_lincomb(all_numeric_predictors()) %>% # Specify Formula and Preprocessing Recipe
  step_zv(all_numeric_predictors()) %>%
  step_mutate(Gr_Liv_Area = Gr_Liv_Area/100, Lot_Area = Lot_Area/100) %>%
  step_mutate(Fireplaces = Fireplaces > 0) %>%
  step_cut(Year_Built, breaks = c(0, 1950, 1990, 2020)) %>%
  step_other(House_Style,threshold = .1) %>%
  step_dummy(all_nominal_predictors())
# Pre-processing Training Data
train_prep <- lm_rec %>% 
  prep() %>%
  bake(new_data = ames_train) # Pre-process Training Data
```

-   `linear_reg()`: Specifies that a linear regression model will be used.
-   `set_engine(engine = 'lm')`: Sets the engine for the model to 'lm', indicating the use of the R linear regression engine.
-   `set_mode('regression')`: Sets the mode of the model to regression.
-   `recipe(...)`: Defines a recipe for pre-processing data in a machine learning workflow.
-   `step_lincomb(all_numeric_predictors())`: Applies a linear combination of numeric predictors.
-   `step_zv(all_numeric_predictors())`: Removes variables with zero variance.
-   `step_mutate(...)`: Performs variable transformations, such as scaling (Gr_Liv_Area and Lot_Area are divided by 100) and creating binary - indicators (Fireplaces is transformed to be true if the original value is greater than 0).
-   `step_cut(...)`: Cuts numeric predictors into intervals (e.g., Year_Built is binned into intervals).
-   `step_other(...)`: Groups infrequent levels of a categorical predictor into an '`Other`' category.
-   `step_dummy(all_nominal_predictors())`: Converts categorical predictors into dummy variables (numbers).
-   `prep()`: Prepares the recipe, estimating necessary parameters.
-   `bake(new_data = ames_train)`: Applies the pre-processing steps to the training data (`ames_train`), producing a new pre-processed dataset (`train_prep`).

```{r}
# Create Workflow
ames_wf <- workflow() %>% # Create Workflow (Recipe + Model Spec)
  add_recipe(lm_rec) %>%
  add_model(lm_spec)  
# Fit Model to Training Data
lm_fit_train <- ames_wf %>%
  fit(data = ames_train)  # Fit Model to Training Data
# Calculate Training Metrics
train_prep %>%
  select(Sale_Price) %>%
  bind_cols( predict(lm_fit_train, ames_train) ) %>% 
  metrics(estimate = .pred, truth = Sale_Price)  # Calculate Training metrics
# Extract Model Coefficients
lm_fit_train %>%
  tidy() # Model Coefficients from Trained Model
```

-   `workflow()`: Initializes an empty workflow object.
-   `add_recipe(lm_rec)`: Adds the pre-processing recipe (lm_rec) to the workflow.
-   `add_model(lm_spec)`: Adds the linear regression model specification (lm_spec) to the workflow.
-   `fit(data = ames_train)`: Fits the workflow to the training data (ames_train). This step applies the pre-processing steps defined in the recipe to the training data and then fits the linear regression model.
-   `train_prep %>% select(Sale_Price):` Selects the target variable (Sale_Price) from the pre-processed training data.
-   `bind_cols(predict(lm_fit_train, ames_train))`: Predicts the target variable using the fitted linear regression model (lm_fit_train) on the training data (ames_train) and binds the predicted values to the selected target variable.
-   `metrics(estimate = .pred, truth = Sale_Price)`: Calculates metrics for model evaluation. It compares the predicted values (`.pred`) with the actual values (`Sale_Price`) in the training data.
-   `tidy()`: Extracts model coefficients from the trained linear regression model (lm_fit_train). The tidy function is a part of the broom package, and it provides a tidy data frame of model coefficients.

```{r}
library(dotwhisker)

tidy(lm_fit_train) %>%  # Viz of Trained Model Coef
  dwplot(dot_args = list(size = 2, color = "black"),
         whisker_args = list(color = "black"),
         vline = geom_vline(xintercept = 0, color = "grey50", linetype = 2))
```

-   `tidy` is used to extract the coefficients and associated statistics from the trained linear regression model (`lm_fit_train`). It converts the model information into a tidy data frame.

-   `dwplot` is used to create a dot-and-whisker plot to visualize the model coefficients.

-   `dot_args = list(size = 2, color = "black")`: This argument specifies the appearance of the dots representing the coefficients. Here, dots are set to be of size 2 and black in color.

-   `whisker_args = list(color = "black")`: This argument specifies the appearance of the whiskers. Whiskers are the lines extending from the dots, indicating the uncertainty or confidence interval around the coefficients. Here, whiskers are set to be black.

-   `vline = geom_vline(xintercept = 0, color = "grey50", linetype = 2)`: This argument adds a vertical line at x = 0 (the baseline) to the plot. It helps in visualizing whether the coefficients are significantly different from zero. The line is colored in grey50 and has a dashed linetype.

```{r}
# Create Cross-Validation Folds
ames_cv <- vfold_cv(ames_train, v = 10, strata = Sale_Price) # Create 10 Folds of Training Data for CV
# Fit Model to Cross-Validation Folds and Evaluate Metrics
lm_fit_cv <- fit_resamples(ames_wf, # Fit Model to 10 Folds of Training Data
              resamples = ames_cv,
              metrics = metric_set(rmse, mae, rsq))
```

-   `vfold_cv(...)`: Creates a set of cross-validation folds using the vfold_cv function. This function is part of the rsample package.
-   `ames_train`: The dataset used for training the model.
-   `v = 10`: Specifies the number of folds for cross-validation. In this case, it creates 10 folds.
-   `strata = Sale_Price`: Indicates that the cross-validation should be stratified based on the values of the Sale_Price variable. Stratified cross-validation ensures that each fold has a similar distribution of the target variable.
-   `fit_resamples(...)`: Fits the specified model (`ames_wf - workflow`) to the cross-validation folds and evaluates performance metrics using the fit_resamples function from the yardstick package.
-   `ames_wf`: The workflow object that includes both the data pre-processing recipe and the linear regression model.
-   `resamples = ames_cv`: The cross-validation folds generated earlier are provided as the resampling plan.
-   `metrics = metric_set(rmse, mae, rsq)`: Specifies the metrics to be calculated during cross-validation. In this case, root mean squared error (rmse), mean absolute error (mae), and R-squared (rsq) are used.

```{r}
lm_fit_cv %>% collect_metrics() # Evaluate Trained Model using CV
```

-   Queremos que $R^2$ se mas grande (explique más variabilidad)

```{r}
# Apply Model to Test Data
lm_fit_test <- last_fit(ames_wf,
         split = data_split) 
# Collect Evaluation Metrics on Test Data
lm_fit_test %>%
  collect_metrics() #Evaluation on Test Data
```

-   `last_fit(...)`: Applies the final trained model to the test dataset using the `last_fit` function from the `yardstick` package.
-   `ames_wf`: The workflow object containing both the pre-processing recipe and the linear regression model.
-   `split = data_split`: Specifies the test dataset using the `data_split` object, which was likely created earlier using the `initial_split` function.
-   `%>%`: The pipe operator is used to pass the output of `lm_fit_test` to the next operation.
-   `collect_metrics()`: Extracts and collects the evaluation metrics on the test data from the `lm_fit_test` object.

```{r}
library(vip)
# Resolve Namespace Conflict
conflicted::conflict_prefer("vi", "vip")
# Extract Fit Engine Information
mod <- lm_fit_train %>% extract_fit_engine() 
```

-   `conflicted::conflict_prefer("vi", "vip")`: Uses the conflicted package to resolve a potential conflict between two functions, "vi" and "vip." If there's a conflict, this statement indicates a preference for using the "vip" function.
-   `extract_fit_engine()` extracts information about the fitting engine used in the model fitting process. The specific details of what information is extracted depend on the implementation of the extract_fit_engine function.

```{r}
vi(mod, method = 'permute', target = 'Sale_Price', metric = 'rmse', train = train_prep, pred_wrapper = predict)
```

-   `mod`: This is likely the model object. It represents the trained predictive model for which variable importance is being assessed.

-   `method = 'permute'`: This specifies the method used for variable importance. In this case, it seems to be using the permutation method.

-   `target = 'Sale_Price'`: Specifies the target variable for which variable importance is being assessed. In this case, it's "Sale_Price."

-   `metric = 'rmse'`: Specifies the metric used to evaluate the model's performance. In this case, it's the root mean squared error (rmse).

-   `train = train_prep`: The preprocessed training data (features) used for training the model. It's likely passed to ensure consistency in variable names and data types.

-   `pred_wrapper = predict`: Specifies the prediction function to be used when generating predictions from the model. It's likely used to ensure compatibility with the model type.

The `vi()` function, when executed with these arguments, would perform variable importance analysis using the permutation method, considering the specified target variable ("`Sale_Price`") and using the root mean squared error as the evaluation metric. The importance values of different features in predicting the target variable would be calculated based on the permutation method.

```{r}
vip(mod, method = 'permute', target = 'Sale_Price', metric = 'rmse', train = train_prep, pred_wrapper = predict) + theme_classic()
```

# Check performance

with easystats performance

```{r, fig.height=12, fig.width=8}
#mod <- lm_fit_train %>% extract_fit_engine() 
library(performance)
mod %>% check_model()
```

-   `check_model()`: This function from the **`performance`** package is used for model diagnostics. It assesses the model's performance, checks assumptions, and provides diagnostic plots and summaries.