#[fit] Ai 1
#[fit] Regression
- What is
$$x$$ ,$$f$$ ,$$y$$ , and that damned hat? - The simplest models and evaluating them
- Frequentist Statistics
- Noise and Sampling
- Bootstrap
- Prediction is more uncertain than the mean
- how many dollars will you spend?
- what is your creditworthiness
- how many people will vote for Bernie t days before election
- use to predict probabilities for classification
- causal modeling in econometrics
##[fit] 1. What are
##[fit] damned hats
We will assume that the measured response variable,
Here,
In real life we never know the true generating model
A statistical model is any algorithm that estimates
There are two reasons for this:
- We have no idea about the true generating process. So the function we use (we'll sometimes call this
$$g$$ ) may have no relation to the real function that generated the data. - We only have access to a sample, with reality denying us access to the population.
We shall use the notation
##[fit] 2. The Simplest Models ##[fit] Fitting and evaluating them
How? Use the Mean Squared Error:
Minimize this with respect to the parameters. (Here the intercept and slope)
basically go opposite the direction of the derivative.
Consider the objective function: $$ J(x) = x^2-6x+5 $$
gradient = fprime(old_x)
move = gradient * step
current_x = old_x - move
CONVEX:
The value of the MSE depends on the units of y.
So eliminate dependence on units of y.
- If our model is as good as the mean value
$$\bar{y}$$ , then$$R^2 = 0$$ . - If our model is perfect then
$$R^2 = 1$$ . -
$$R^2$$ can be negative if the model is worst than the average. This can happen when we evaluate the model on the test set.
##[fit] 3. Frequentist Statistics
What is Data?
with
"data is a sample from an existing population"
- data is stochastic, variable
- model the sample. The model may have parameters
- find parameters for our sample. The parameters are considered FIXED.
How likely it is to observe values
How likely are the observations if the model is true?
We can then write the likelihood:
The log likelihood
where we stack rows to get:
Dem_Perc(t) ~ Dem_Perc(t-2) + I
- done in statsmodels
- From Gelman and Hwang
##[fit] 4. Noise and Sampling
- lack of knowledge of the true generating process
- sampling
- measurement error or irreducible error
$$\epsilon$$ . - lack of knowledge of
$$x$$
We will first address
Predictions made with the "true" function
Because of
But in real life we only measure
Now, imagine that God gives you some M data sets drawn from the population. This is a hallucination, a magic realism..
..and you can now find the regression on each such dataset (because you fit for the slope and intercept). So, we'd have M estimates of the slope and intercept.
As we let
We can use these sampling distribution to get confidence intervals on the parameters.
The variance of these distributions is called the standard error.
Here is an example:
But we dont have M samples. What to do?
##[fit] 5. Bootstrap
- If we knew the true parameters of the population, we could generate M fake datasets.
- we dont, so we use our existing data to generate the datasets
- this is called the Non-Parametric Bootstrap
Sample with replacement the
- We create fake datasets by resampling with replacement.
- Sampling with replacement allows more representative points in the data to show up more often
- we fit each such dataset with a line
- and do this M times where M is large
- these many regressions induce sampling distributions on the parameters
- Each line is a rendering of
$$\mu = a+bx$$ , the mean value of the MLE Gaussian at each point$$x$$ - Thus the sampling distributions on the slope and intercept induce a sampling distribution on the lines
- And then the estimated
$$\hat{f}$$ is taken to be the line with the mean parameters
Which parameters are important?
You dont want the parameters in a regression to be 0.
So, in a sense, you want parameters to have their sampling distributions as far away from 0 as possible.
Is this enough? Its certainly evocative.
But we must consider the "Null Hypothesis": a given parameter has no effect. We can do this by re-permuting just that column
##[fit] 6. Prediction is more uncertain .. ##[fit] than the mean
- In machine learning we do not care too much about the functional form of our prediction
$$\hat{y} = \hat{f}(x)$$ , as long as we predict "well" - Remember however our origin story for the data: the measured
$$y$$ is assumed to have been a draw from a gaussian distribution at each x: this means that our prediction at an as yet not measured x should also be a draw from such a gaussian - Still, we use the mean value of the gaussian as the value of the "prediction", but note that we can have many "predicted" data sets, all consistent with the original data we have
- the band on the previous graph is the sampling distribution of the regression line, or a representation of the sampling distribution of the
$$\mathbf{w}$$ . -
$$p(y \vert \mathbf{x}, \mu_{MLE}, \sigma^2_{MLE})$$ is a probability distribution - thought of as $$p(y^{} \vert \mathbf{x}^, { \mathbf{x}i, y_i}, \mu{MLE}, \sigma^2_{MLE})$$, it is a predictive distribution for as yet unseen data $$y^{}$$ at $$\mathbf{x}^{}$$, or the sampling distribution for data, or the data-generating distribution, at the new covariates
$$\mathbf{x}^{*}$$ . This is a wider band.
- We see only one sample, and never know the "God Given" model. Thus we always make Hat estimates
- The simplest models are flat and sloped line. To fit these we use linear algebra or gradient descent.
- We use the
$$R^2$$ to evaluate the models. - Maximum Likelihood estimation or minimum loss estimation are used to find the best fit model
- Gradient Descent is one way to minimize the loss
- Noise in regression models comes from model-lisspecification, measurement noise, and sampling
- sampling can be used to replicate most of these noise sources, and thus we can use a "magic realism" to study the impacts of noise
- in the absence of real samples we can construct fake ones using the bootstrap
- these samples can be used to understand the sampling variance of parameters and thus regression lines
- predictions are even more variant since our likelihood indicates a generative process involving sampling from a gaussian
- We'll look at more complex models, such as local regression using K-nearest-neighbors, and Polynomial Regression
- We'll use Polynomial Regression to understand the concepts of model complexity and overfitting
- We'll also see the use of validation sets to figure the value of hyperparameters
- We'll learn how to fit models using
sklearn
.