Q. What is linear regression, and how does it work?
Answer
Linear regression is a statistical model that assumes the regression function
It takes the following form:
Note that here
- Quantitative inputs or its transformations
- Basis expansion, such as
$X_{2} = X_{1}^2$ ,$X_{3} = X_{1}^3$ leading to a polynomial representation - Encoded categorical values
- Interaction between variables like
$X_{3} = X_{1} \dot X_{2}$
It uses least squares as a estimation method to calculate the values of coefficients.
Q. How to determine the coefficients of a simple linear regression model?
Answer
Suppose we have a set of training data
Alternatively we can rewrite the above equations as:
In order to minimize the above expression, differentiating with respect to
Assuming that
To obtain the unique solution
Q. In which scenarios linear model can outperforms fancier non linear models?
Answer
In following cases in may happen:
- Low signal to noise ratio
- Near perfect linearity between predictors and the target variable
- Sparse data
- Small number of training instances
Q. Suppose a model takes form of
Answer
Yes model is still linear in nature. This is polynomial representation of a linear model.
We can write the given form in its linear mode:
Where
No matter the source of the
Q. What are the assumptions of linear regression?
Answer
The main assumptions of linear models are following: - Linear relationship between predictors and response - If not then model may underfit and give bias predictions - Predictors should be independent of each other(Non-Collinearity) - Otherwise it makes interpretation of output messy and unnecessary complicate model - Homoscedasticity : Constant variance in the error terms(residual) - The standard errors, confidence intervals and hypothesis testing rely on this assumption - Uncorrelated error terms(residuals) - If residuals are correlated then we may have pseudo confidence in our model - Data should not have outliers - Can messed up the predictions if we have heavy outliersQ. Explain the difference between simple linear regression and multiple linear regression.
Answer
The key difference between simple linear regression and multiple linear regression lies in the number of independent variables used to predict the dependent variable.
Differences
-
Number of Independent Variables:
- Simple Linear Regression: One independent variable.
- Multiple Linear Regression: Two or more independent variables.
-
Complexity:
- Simple Linear Regression: Simpler and easier to interpret since it involves only one predictor.
- Multiple Linear Regression: More complex due to the involvement of multiple predictors, and it requires more sophisticated techniques for interpretation and model validation.
-
Equation Form**:
- Simple Linear Regression:
$Y = \beta_0 + \beta_1 X + \epsilon$ - Multiple Linear Regression:
$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_k X_k + \epsilon$
- Simple Linear Regression:
Q. What is Residual Standard Error(RSE) and how to interpret it?
Answer
The RSE is an estimate of the standard deviation of residuals(
It is computed using the formula:
It is considered as the lack of the fit of the data. Lower values indicates model fits the data very well.
Q. What is the purpose of the coefficient of determination (R-squared) in linear regression?
Answer
It takes the form of a proportion proportion of the variance explained it always takes values between 0 and 1 and it is not dependent on the scale of
To calculate the
Here,
And,
Statistically, it measures the proportion of variability in
Q. How to interpret the values of
Answer
A number near 0 indicates the regression does not explain the variability in the response, whereas 1 indicates a large proportion of the variability in the response is explained by the regression.
Q. How do you interpret the coefficients in a linear regression model?
Answer
Suppose we have a model of form:
Here's how to interpret them:
- Intercept(
$\beta_0$ ):- It is the point where the regression line crosses the y-axis.
- Slope Coefficient(
$\beta_i$ ):- For each independent variable, the slope coefficient (
$\beta_{i}$ ) indicates the expected change in the dependent variable for a one-unit increase in the independent variable, holding all other variables constant. - positive/negative values means increase/decrease in independent variables lead to increase/decrease in response
- For each independent variable, the slope coefficient (
- Statistical Significance:
- The p-value associated with each coefficient helps determine if the relationship is statistically significant.
- Magnitude:
- The magnitude of the coefficient shows the strength of the relationship between the independent and dependent variables.
Q. What is the difference between correlation and regression?
Answer
- Correlation quantifies the degree to which two variables are related, without distinguishing between dependent and independent variables.
- Regression models the dependence of a variable on one or more other variables, providing a predictive equation and allowing for an analysis of the effect of each predictor.
Q. What are the methods to assess the goodness of fit of a linear regression model?
Answer
There are several methods to measure goodness of fit with some pros and cons:
- R-squared (
$R^2$ ) - Adjusted R-squared
- Residual Standard Error (RSE) or Root Mean Squared Error (RMSE)
- Mean Absolute Error (MAE)
- Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC)
We can use combinations of above statistic to evaluate the model performance.
Q. What is the purpose of the F-statistic in linear regression?
Answer
F-statistic is mainly used for hypothesis testing where we want to assess whether at least one of the predictors
For example null hypothesis:
alternate hypothesis:
here hypothesis test is performed by computing the F-statistic,
If the linear model assumptions are true:
and provided
So, when there is no relationship between predictors and the response then F-statistic is near to 1 and if
Q. What are the potential problems in linear regression analysis, and how can you address them?
Answer
Linear regression model may suffer from following issues mainly:
- Non-linearity: Transform variables or use polynomial regression.
- Multicollinearity: Remove or combine correlated predictors, use regularization.
- Heteroscedasticity: Use robust standard errors, transform the dependent variable.
- Outliers: Identify and handle outliers using diagnostic plots or robust regression.
- Overfitting: Use cross-validation, simplify the model, or apply regularization.
- Non-normality of Residuals: Transform variables or use non-parametric methods.
Q. What are some regularization techniques used in linear regression, and when are they applicable?
Answer
Q. Can you explain the concept of bias-variance trade-off in the context of linear regression?
Answer
Q. What do you mean by subset selection and how it is useful in linear regression models?
Answer
This approach involves identifying a subset of the p
predictors that we believe to be related to the response. We then fit a model on the reduced set of variables or predictors.
This is useful because it:
- Improves model interpretability: Reduces complexity by using fewer predictors.
- Enhances prediction accuracy: Removes irrelevant or redundant predictors that may add noise.
- Prevents overfitting: Reduces the risk of the model fitting to the noise in the training data, leading to better generalization to new data.
Q. What are some methods for selecting a subset of predictors?
Answer
There are mainly two methods for subset selection:
- Best Subset Selection
- Stepwise Selection
- Forward Stepwise Selection
- Backward Stepwise Selection
Q. Explain Best Subset Selection method?
Answer
Suppose we have
Here is the stepwise algorithm:
- Let
$M_0$ denote the null model, which contains no predictors. This model simply predicts the sample mean of each observations - For
$k = 1, 2,...,p$ :- Fit all
$\binom{p}{k}$ models that contains exactly$k$ predictors. - Pick the best among these
$\binom{p}{k}$ models and call it$M_k$ , on the basis of RSS or$R^2$ score.
- Fit all
- Select a single best model from among
$M_0,....,M_p$ using cross validation prediction error like adjusted$R^2$ or$BIC$ etc.
Q. What is the drawback of selecting the best subset of features on the basis of Residual Square Error(RSS) or
Answer
As we induct more features in the model RSS monotonically decreases and
Number of Predictors vs R2 and RSS |
Q. What are the limitations of Best Subset Selection method?
Answer
This method is simple and easy to understand but it suffers from the computational limitations. As we increase the number of predictors
In general, there are
Q. Given a use case that necessitates building a predictive model with a large number of features/predictors, which feature selection method would be most appropriate?
Answer
Stepwise Selection
Q. Why Forward Stepwise Selection method is better than the Best Subset Selection?
Answer
Forward stepwise selection is a computationally efficient alternative to the best subset selection. While the best subset selection procedure considers all
Q. How does the forward stepwise selection method works?
Answer
Forward stepwise selection steps:
- Let
$M_0$ denote the null model, which contains no predictors. - For
$k=0,...,p-1$ - Consider all
$p-k$ models that augment the predictors in$M_k$ with one additional predictor. - Choose the best among these
$p-k$ models and call it$M_{k+1}$ on the basis$R^2$ or$RSS$ basis.
- Consider all
- Select a single best model from among
$M_0,...,M_p$ using cross validate prediction error,$C_p$ ,$BIC$ and adjusted$R^2$ .
Example:
Suppose you have a dataset with five predictors:
Steps:
-
Initialization:
- Start with an empty model (no predictors).
-
Iteration:
Iteration 1:
- Fit five simple linear regression models, each with one predictor:
- Model 1:
$Y \sim X_1$ - Model 2:
$Y \sim X_2$ - Model 3:
$Y \sim X_3$ - Model 4:
$Y \sim X_4$ - Model 5:
$Y \sim X_5$
- Model 1:
- Choose the model that has the best performance according to a criterion (e.g., lowest AIC). Suppose
$X_3$ provides the best improvement. - Add
$X_3$ to the model.
Iteration 2:
- Fit four models, each adding one more predictor to the model with
$X_3$ :- Model 1:
$Y \sim X_3 + X_1$ - Model 2:
$Y \sim X_3 + X_2$ - Model 3:
$Y \sim X_3 + X_4$ - Model 4:
$Y \sim X_3 + X_5$
- Model 1:
- Choose the model that has the best performance. Suppose
$X_2$ provides the best improvement. - Add
$X_2$ to the model.
Iteration 3:
- Fit three models, each adding one more predictor to the model with
$X_3$ and$X_2$ - Model 1:
$Y \sim X_3 + X_2 + X_1$ - Model 2:
$Y \sim X_3 + X_2 + X_4$ - Model 3:
$Y \sim X_3 + X_2 + X_5$
- Model 1:
- Choose the model that has the best performance. Suppose
$X_1$ provides the best improvement. - Add
$X_1$ to the model.
- Fit five simple linear regression models, each with one predictor:
-
Stopping Criterion:
- Stop adding predictors when the model improvement is no longer significant according to the chosen criterion.
Q. What is the major issue with stepwise selection?
Answer
stepwise selection methods like forward/backward stepwise methods are basically a greedy algorithm and hence can produce sub-optimal subset of features as the best combinations. It is not guaranteed to yield the best model containing a subset of the
Q. Imagine you have a dataset with
Answer
For cases
Q. Explain Backward stepwise selection method?
Answer
Backward stepwise selection steps:
- Let
$M_p$ denote the full model, which contains all the predictors. - For
$k=p,...,0$ - Consider all
$k$ models contain all but one of the predictors in$M_k$ , for total of$k-1$ predictors. - Choose the best among these
$k$ models and call it$M_{k-1}$ on the basis$R^2$ or$RSS$ basis.
- Consider all
- Select a single best model from among
$M_0,...,M_p$ using cross validate prediction error,$C_p$ ,$BIC$ and adjusted$R^2$ .
Q. One issue with Backward stepwise selection method?
Answer
It is not suitable for the cases having
Q. Is it good idea to combine forward and backward stepwise subset selection technique as a single method?
Answer
Yeah we can adopt a hybrid approach where we can add predictors sequentially as like we do in forward selection. However after adding any predictors we can also remove any predictor which is not adding any improvement in the model fit. It mimics the best subset selection while retaining the computational advantages of forward and backward stepwise selection.
Q. What is shrinkage methods?
Answer
Shrinkage methods are statistical techniques used to improve the accuracy and interpretability of regression models, especially when dealing with high-dimensional data. These methods work by introducing a penalty on the size of the regression coefficients, effectively shrinking them towards zero. This can help to reduce overfitting and improve the generalizability of the model.
Q. What are the benefits of using shrinkage methods over subset selection methods?
Answer
Subset selection method produces a model that is interpretable and has possibly less prediction error than the full model. However because it is a discrete process variables are either retained or discarded so it does not reduce the prediction error of the full model. Shrinkage methods are more continuous and don't suffer much from high variability. Also in shrinkage method we can fit model with all
Q. Name some shrinkage methods?
Answer
- Ridge Regression(L2 Regularization)
- Lasso Regression(L1 Regularization)
- Elastic Net
Q. What's the main purpose of L1 and L2 regularization in linear regression?
Answer
Ordinary least square method suffers from the following issues:
- Poor prediction accuracy(Overfitting)
- If
$n ~ p$ i.e number of observations($n$ ) is not much larger than the number of predictors ($p$ ) then there can be a lot of variability in the least square fit, resulting in overfitting and consequently poor prediction on test set. - If
$p > n$ , then there is no longer a unique least square coefficient estimation and variance is infinite so method an not be used at all
- If
- Lack of model interpretability
- It is often the case that some of many of the variables used in ols might not be associated with the response. They unnecessary complicate the model and makes hard to interpret the model output.
L1(Lasso) and L2(Ridge) regularization techniques help in addressing the above shortcomings of ols method. They can shrink the regression coefficient to zero or nearly zero and hence can help feature selection. Effectively they addresses the overfitting issue in ols by reducing the variance of the model.
Q. How do we estimate coefficients in Ridge regression?
Answer
The ridge regression coefficients estimates
$$ = RSS + \lambda\sum_{j=1}^p\beta_{j}^2$$
where
Q. Explain the effect of tuning parameter
Answer
It essentially balances the trade-off between fitting the data well (minimizing RSS) and keeping the model coefficients small (minimizing the penalty term).
When
Q. How can we determine optimal value of
Answer
Using cross validation technique
Q. Suppose you fit a ordinary linear regression model over your data and you find it is under-fitting. Is it good idea to use Ridge or Lasso regression here?
Answer
No, Ridge or Lasso regression addresses the variance issue in ols technique, here the model is suffering from the biasness and hence they can provide any help here. We can use polynomial regression or some other more complex model.
Q. How do L1 and L2 regularization affect the model's coefficients?
Answer
L1 or L2 regularization are shrinkage methods which helps in reducing the coefficient of the estimate to zero or nearly zero.
Q. Can we use ridge regression for variable selection purpose?
Answer
No we can't use it as feature selection technique. The penalty term
Q. Write the loss function involve in lasso regression?
Answer
The loss function expression in lasso regression:
$$ = RSS + \lambda\sum_{j=1}^p|\beta_{j}|$$
Q. Which regression technique leads to sparse models? Lasso or ridge regression?
Answer
Lasso regression method
Q. Why is it that the lasso, unlike ridge regression, results in coefficient estimates that are exactly equal to zero?
Answer
The loss function involve in lasso and ridge regression can be re-written as follows:
Lasso loss function:
Ridge loss function:
For
The ellipse that are centered around
Since ridge regression has a circular constraint with no sharp points, this intersection will not generally occur on an axis, and so the ridge regression coefficient estimate will be exclusively non-zero. However, the lasso constraint has corners at each of the axes, and so the ellipse will often intersect the constraint region at an axis. When this occurs, one of the coefficients will equal to zero.
Q. List some advantages of using lasso regression over ridge regression?
Answer
- It produces simpler and more interpretable model
- Can perform feature selection
Q. What are the hyper-parameters associated with L1 and L2 regularization?
Answer
Q. When would you choose L1 regularization over L2, and vice versa?
Answer
L1 regularization performs better in a setting where a relatively small number of predictors have substantial coefficients and remaining predictors have coefficients that are very small or near to zero. Also L1 performs variable selection be forcing some coefficients of the estimates to zero and hence easier to interpret.
L2 regularization is effective when the response is a function of many predictors, all with the coefficients roughly equal in size. Also L1 technique may lead to sparse model.
Q. What is Elastic Net regularization, and how does it relate to L1 and L2 regularization?
Answer
Q. How do you choose the optimal regularization strength (alpha) for L1 and L2 regularization?
Answer
Cross validation can be used to tune the alpha.
- Choose a grid of
$\alpha$ values and compute cross validation error for each value of$\alpha$ - Select the
$\alpha$ which is yielding smallest cross validation error - Refit the model with all the available variables and the selected
$\alpha$ value.
Q. What are the consequences of multi-collinearity?
Answer
Multi collinearity can pose problems in regression context:
- It can be difficult to separate out the individual effects of collinear variables on the response.
- It reduces the accuracy of the estimate of the regression coefficients
- It reduces the power of the hypothesis testing - like probability of correctly detecting a non zero coefficient
Q. How can you detect collinearity in a regression model?
Answer
A simple way to detect collinearity is to look at the correlation matrix of the predictors. A large absolute value in that matrix indicates a pair of highly correlated variables.
Q. How can you detect multi-collinearity in a regression model?
Answer
We can detect multi-collinearity using variance inflation factor(VIF). VIF measures how much the variance of a regression coefficient is inflated due to multi-collinearity.
The VIF is the ratio of the variance of
$$VIF(\beta_j) = \frac{1}{1-R^{2}{X{j}|X_{-j}}}$$
where $R^{2}{X{j}|X_{-j}}$ is the
The smallest possible value of VIF is 1, which indicates complete absence of collinearity. In practice we have small collinearity among the predictors so VIF greater tha 5 or 10 depicts problematic amount of collinearity.
Q. How can you address multi-collinearity?
Answer
There are several methods to address multi-collinearity:
- Remove Highly Correlated Predictors
- Combine Correlated Predictors
- Principal Component Analysis (PCA)
- Use ridge or lasso regression or combination of both(elastic net regression) for modeling
Q. Can you have perfect multi-collinearity?
Answer
Yeah, It may occur when one predictor variable in a regression model is an exact linear combination of one or more other predictor variables. In other words, the correlation between the variables is exactly 1 (or -1).
Q. How can we determine whether a dataset is high-dimensional or low-dimensional?
Answer
We can determine whether a dataset is high-dimensional or low-dimensional by comparing the number of features(
-
High dimensional
$p \gg n$
-
Low dimensional
$n \gg p$
Q. What is the issue of using least squares regression in high dimensional setting?
Answer
When
Q. What happens to the Train MSE and Test MSE in a linear regression model if we add features that are completely unrelated to the response?
Answer
As we induct more redundant features in a linear regression model, the Train MSE will typically decrease because the model can fit the training data better due to the additional features. However, the Test MSE is likely to increase because including the additional predictors leads to a vast increase in the variance of the coefficient estimates(overfitting).
Q. What is curse of dimensionality?
Answer
The curse of dimensionality refers to the phenomenon where the test error typically increases as the number of features or predictors in a dataset grows, unless these additional features have a genuine relationship with the response variable. This occurs because as dimensionality rises, the data becomes sparser, making it harder for models to generalize well and increasing the risk of overfitting.
Q. If I add a feature to my linear regression model, how will it affect the Train MSE and Test MSE?
Answer
When you add a feature to your linear regression model, the Train MSE will generally decrease because the model becomes better at fitting the training data. However, the Test MSE may increase if the new feature does not contribute meaningful information and instead leads to overfitting. This happens because the additional feature can cause the model to capture noise rather than the underlying pattern, which harms its performance on new, unseen data.
Q. Can we use lasso or ridge regression in high dimensional setup?
Answer
Yeah, They will make the linear regression less flexible in higher dimensional and prevent to overfit the train data.
Q. Can we use lasso or ridge regression in high dimensional setup?
Answer
Yeah, They will make the linear regression less flexible in higher dimensional and prevent to overfit the train data.
Q. [True/False] In the high-dimensional setting, the multicollinearity problem is extreme?
Answer
True
Q. [True/False] In the high-dimensional setting, the multicollinearity problem is extreme?
Answer
True
Q. Why should one be cautious when reporting errors and measures of model fit in high-dimensional settings?
Answer
In high-dimensional settings (where