- Writer: Minho Park
- Title: (cs229) Lecture 3 : Weighted Least Squares. Logistic regression. Newton's Method
- Link: http://cs229.stanford.edu/notes2020fall/notes2020fall/cs229-notes1.pdf
- Keywords: Parametric / Non-parametric Learning, Locally Weighted Regression, Probability Interpretation, Logistic Regression, Newton's Method
-
Parametric Learning: Fit fixed set of parameters (
$\theta_i$ ) to data ($\mathcal{D}$ ). - Non-parametric Learning: Amount of data/parameters you need to keep grows (linearly) with size of data. (It seems too biased to Locally Weighted Regression)
- Non-parametric Learning: Machine Learning algorithms other than Parametric Learning algorithms. (In our study!)
-
Parametric Learning: Suppose the model can fully describe
$\mathcal{D}$ as$\theta$ . - Non-parametric Learning: No, it can't.
Therefore, in parametric aspect, we can erase
However, Non-parametric Learning use
-
Motivation (Why?): Underfitting / Overfitting problem. (e.g. Polynomial Regression)
-
Locally Weighted Linear Regression Algorithm:
- Fit
$\theta$ to minimize$\sum_i w^{(i)}(y^{(i)} - \theta^T x^{(i)})^2$ . (Weighted Least Squares) - Output
$\theta^T x$ .
where
$w^{(i)} = \exp \left( - {(x^{(i)} - x) \over 2\tau^2}\right)$ . - Fit
-
Overfitting / Underfitting:
- If
$\tau$ (which makes$w^{(i)} = e^{-1/2}$ ) is too small, model is likely to be overfitting. - Contrary, if
$\tau$ is too big, model is likely to be underfitting. -
$\tau \rightarrow \infty$ : Original Linear Regression
- If
- There is a lot of version of LWR
Why Least Squares?
- There is Error term which is Additive.
$y^{(i)} = \theta^T x^{(i)} + \epsilon^{(i)}$ . -
$\epsilon^{(i)}$ are distributed i.i.d. (Independently and Identically Distributed) -
$\epsilon^{(i)}$ are distributed according to a Gaussian distribution.$\epsilon^{(i)} \sim \mathcal{N}(0, \sigma^2)$ .
I.e. the density of
This implies that,
It looks like an AWGN Channel.
-
Frequentist Statistics
$p(y^{(i)} | x^{(i)};\theta)$ : Probability of$y^{(i)}$ given$x^{(i)}$ and parameterized as$\theta$ .
This notation implies parameters are deterministic. ($\theta$ is not a Random Variable.) -
Bayesian Statistics
$p(y^{(i)} | x^{(i)},\theta)$ : Probability of$y^{(i)}$ given$x^{(i)}$ and$\theta$ .
This notation implies parameters are stochastic. ($\theta$ is a Random Variable.)
-
Definition: Distribution of
$\vec{y}$ ($y^{(i)}$ 's), given$X$ (Matrix of$x^{(i)}$ 's) and$\theta$ .
I.e.$L(\theta) = L(\theta;X,\vec{y}) = p(\vec{y}|X;\theta)$ -
Likelihood vs. Probability:
-
Likelihood of Parameters: Function of
$\theta$ , parameterized as fixed$X, \vec{y}$ . -
Probability of Data: Probability(Function) of
$\vec{y} | X$ , parameterized as fixed$\theta$ .
-
Likelihood of Parameters: Function of
-
Maximum Likelihood: Choose
$\theta$ to maximize$L(\theta)$ . (Make the data as high probability as possible.)- MAP vs. MLE (Will be added)
- Log Likelihood: The derivations will be a bit simpler.
Hence, Maximize Log Likelihood is written as,
$$ \begin{aligned} \operatorname*{argmax}\theta l(\theta) &= \operatorname*{argmin}\theta {1 \over 2} \sum_{i=1}^m (y^{(i)}-\theta^Tx^{(i)})^2 \ &= \operatorname*{argmin}_\theta J(\theta) \quad \text{(Least-squares cost function)} \end{aligned} $$
- Motivation (Why?): To design predictive model with categorical-value.
- Classification Problem: When applicate Linear Regression in binary classification, the model doesn't represent well.
Simply, we want
There is a function, which range is
Its graph is shown below.
Then,
Which Was What We Want.
There are lots of functions which
For example,
Other functions that smoothly increase from 0 to 1 can also be used, but for a couple of reasons (GLMs, generative learning algorithms), the choice of the logistic function is a fairly natural one.
-
Assumption:
$y \in {0, 1}$ - Likelihood function. $$ \begin{aligned} P(y=1|x;\theta) &= h_\theta (x) \ P(y=0|x;\theta) &= 1 - h_\theta (x) \end{aligned} $$
-
Maximum Likelihood Estimation.
-
Using Gradient Ascent Rule:
$\theta := \theta + \alpha \nabla_\theta l(\theta)$ -
Note that, we already know below equations.
$$ \begin{aligned} l^{(i)}(\theta) &:= y^{(i)} \log h_\theta(x^{(i)}) + (1 - y^{(i)}) \log (1 - h_\theta(x^{(i)})) \ h_\theta(x^{(i)}) &:= g(\theta^T x^{(i)}) = \sigma(\theta^T x^{(i)}) = 1/{(1+e^{-\theta^T x^{(i)}})}\ \sigma'(z) &= \sigma(z)(1-\sigma(z)) \ \end{aligned} $$
-
Then,
$$ \begin{aligned} {\partial \over \partial\theta_j} l^{(i)}(\theta) &= \left( {y^{(i)} \over h_\theta(x^{(i)})} - {(1-y^{(i)}) \over (1-h_\theta(x^{(i)}))} \right) {\partial \over \partial\theta_j} h_\theta(x^{(i)}) \ &= \left( {y^{(i)} \over \sigma(\theta^T x^{(i)})} - {(1-y^{(i)}) \over (1-\sigma(\theta^T x^{(i)}))} \right) {\partial \over \partial\theta_j} \sigma(\theta^T x^{(i)}) \ &= \left( {y^{(i)} \over \sigma(\theta^T x^{(i)})} - {(1-y^{(i)}) \over (1-\sigma(\theta^T x^{(i)}))} \right) \sigma(\theta^T x^{(i)}) (1 - \sigma(\theta^T x^{(i)})) {\partial (\theta^T x^{(i)}) \over \partial\theta_j} \ &= \left(y^{(i)}(1-\sigma(\theta^T x^{(i)})) - (1-y^{(i)})\sigma(\theta^T x^{(i)})\right)x^{(i)}_j \ &= (y^{(i)} - \sigma(\theta^T x^{(i)})) x^{(i)}j \ &= (y^{(i)} - h\theta(x^{(i)})) x^{(i)}_j \ \end{aligned} $$
-
Therefore, the result is similar to result of Linear Regression (is this just luck?); but not the same algorithm, because
$h_\theta(x^{(i)})$ is now defined as a non-linear function of$\theta^T x^{(i)}$ .$$ \begin{cases} \theta_j := \theta_j + \alpha (y^{(i)} - h_\theta(x^{(i)}))x^{(i)}j & \text{Stochastic} \ \theta_j := \theta_j + \alpha \sum{i=1}^m (y^{(i)} - h_\theta(x^{(i)}))x^{(i)}_j & \text{Batch} \ \end{cases} $$
-
-
What is this?: Suppose we have some function
$f : \mathbb{R} \to \mathbb{R}$ and find a value of$\theta$ so that$f(\theta) = 0$ . - Algorithm: $$ \theta := \theta - {f(\theta) \over f'(\theta)} $$
We have to find
The maxima of
Let
This method means that approximate our function 2nd-order function, and find local minima/maxima of 2nd-order function. However, our function would not be 2nd-order in most cases. Therefore, we have to iterate this algorithm.
-
Hessian:
$(n + 1) \times (n + 1)$ matrix (include intercept term). $$ H_{ij} = {{\partial^2 l(\theta)} \over {\partial\theta_i\partial\theta_j}} $$
- Graphically:
- Pros: Newton's Method has less iteration
-
Cons: Lots of computations when calculate
$H^{-1}$ , if$H$ is big.