This markdown file contains my notes on Chapter 3 of the book Neural Networks and Deep Learning. This chapter will introduce many techniques to improve the network learning including:
- Cross Entropy cost function
- Softmax on the output layer
- 4 Regularisation methods
- L1 / L2 regularisation
- dropout
- artificial expansion of the training data
- Weight Initialisation
- Heuristics for tuning Hyperparameters
- Variations of Gradient Descent
- Other activation funcitons: Tanh, ReLu
Motivation: As seen in Chap 2, a neuron learns very slowly if the activation function is saturated (i.e.
Cross-Entropy loss resolves this by removing the dependency on the gradient of the sigmoid activation function. The CE loss function
To find the gradient wrt to a weight
i.e. the gradient is proprtional to the error - this makes much more sense, because now the network will learn quicker if the error is larger (which is much more human like as well). (Note: vector-matrix derivative and Kronecher product, though the last line is dubious.)
The right form of cross-entropy
Consider the incorrect form;
Regression problems
In regression problems,
Using the quadratic cost when we have linear neurons in the output layer
Then we have:
This resolves the saturation issue in the final layer, however the issue still remains in the previous layers.
Working backwards, we can derive the CE loss by supposing that we want to satisfy this differential equation for the bias:
Another factor that may inhibit learning is the presence of the
$x_j$ term in Equation (61). Because of this term, when an input$x_j$ is near to zero, the corresponding weight$w_j$ will learn slowly.
- In the final layer of a multi-layer NN,
$x_j = a_j^{L-1}$ (i.e. the activation of the neurons in the previous layer). - Equation 61 refers to:
$\frac{\partial C}{\partial w^L} = \delta^L {a^{L-1}}^T$
It is not possible to elimnate this term through a clever choice of cost function because of the weighted input equation:
Instead of using sigmoid as the activation function in the output layer, we can use softmax:
- This can be interpreted as a probability distribution because it obeys the two laws:
-
$a_j^L > 0$ because$\exp(z) > 0$ $\sum_j a_j^L = \sum_j \frac{\exp(z_j^L)}{\sum_k \exp(z_k^L)} = \frac{\sum_j \exp(z_j^L)}{\sum_k \exp(z_k^L)} = 1$
-
Note: It is obvious that the output of a sigmoid layer will not form a probability distribution, because the outputs aren't rescaled.
The associated cost function is the log-likelihood function:
where
Monoticity of softmax
Consider the derivative of the softmax function wrt an input:
Hence, increasing
Non-locality of softmax
A consequence of the denominator of the softmax function is that the output depends on all the weighted inputs (unlike the sigmoid).
Inverting the softmax layer
To find the weighted input
Avoiding learning slowdown
Consider the derivative of the cost function wrt the weights:
TODO above..
Where does the "softmax" name come from?
Consider:
This is equivalent to the argmax function (if there are no
Backpropagation with softmax and the log-likelihood cost
To find an expression for
Slightly confused by the textbook.
- early stopping is used to stop training when the classification accuracy of the validation set plateus.
- hold out method is when a validation set is used to tune the hyperparameters of the model.
- Use more training data to prevent overfitting
where
Hence, on each update, the weight is rescaled to be smaller, unless this is a detriment to the unregularized cost function. Note in SGD, the weight decay has the same factor of
- Note that the update equation for the biases doesn't change, because the regularization term doesn't include the biases! Whether or not to incldue biases is dependent on the network. Genereally it doesn't affect the accuracy so much, however large biases can lead to saturation which may be desirable in some cases.
- Note that when the training examples increases, the regularization patameter must be increased as well to keep
$\frac{\eta \lambda}{n}$ the same - Without regularization, SGD can easily be stuck in a local minima. Intuitively, this can be because when the weight vectors are large, small changes to them will still point them in a similar direction to before - i.e. not all directions will be explored in the cost function landscape. Hence, regularization resolves this by keeping the weight vectors small, so that small changes will cause larger changes in direction, resulting in SGD not being stuck in local minima so often.
This gives the update equation:
Comparing this to L2 regularization, the differences are:
- When the weights are large, L2 will penalise more
- When the weights are small, L1 will drive them to 0
This gives the effect that L1 regularization will tend to have a relatively small number of high-importnace connections, while the other weigths are driven to 0.
During the training procedure, we add extra steps:
- deactivate out half of the neurons in the hidden layer
- perform fwd and backprop
- reintroduce the neurons and repeat
After training, all the nerons are kept active, but the weights in the connected layer are halved (because previously it was learning on only half as many neurons).
Heuristically, this works because dropout is analogous to ensembling multiple NNs. Another explanation is:
"This technique reduces complex co-adaptations of neurons, since a neuron cannot rely on the presence of particular other neurons. It is, therefore, forced to learn more robust features that are useful in conjunction with many different random subsets of the other neurons."
This graph shows that as the training set size increases, the accuracy increases. However gathering more data can be expensive. Instead, artificially increasing the training set size can make vast improvements:
- Rotating the images by some amount (if arbitrarily large rotations are allowed, then a 6 may look like a 9!)
- "elastic distortions" aim to mimic the variance added by real human handwriting
Note, the converse is also true: reducing the noise in the training data can be more beneficial in e.g. signal processing tasks such as speech recognition.
Key takeaway: Improving training data is just as important as improving the algorithms.
How do our machine learning algorithms perform in the limit of very large data sets?
Tbh, I'm not sure what is the correct extrapolation. Currently, LLMs continue to improve with more data. At some point, the data will be large enought that it encompasses all possibilties, hence the accuracy should reach 100% asymptotically (perhaps only asmyptotically, because one could construct an example that catches out the model?)
Motivation: Consider intialising the weights and biases:
A better intialisation scheme is to set the variance of the weights:
Interestingly, the initialisation of the bias doesn't matter so much (as long as it doesn't cause early saturation).
Note: weight initialisation improves the rate of convergence to an optimal set of weights and biases; however, it doesn't necessarly improve the accuracy of the model. (in Chap 4, better initialisations do actually enable a other methods to improve the performance).
Connecting regularization and the improved method of weight initialzation
Recall: Weight update in stochastic gradient descent
Suppose we are using the original approach to weight initialization. A wavy argument can be formulated to show that L2 regularization leads to the improved initialization scheme:
- In the first few epochs,
$\frac{\partial C}{\partial w}$ will be small because the neurons are saturated. Hence, assuming$\lambda$ is large enough, the weight update equation simplifies to:
- For
$n/m \gg \eta \lambda / m$ , the weight decay per epoch simplifies to:$w' = \exp - (\eta \lambda / m)$ . - (?? Not sure about this step) Supposing
$\lambda$ is not too large, weight decays will tail off when the$|w| \sim 1/\sqrt{n}$
It can be overwhelming to choos hyperparemeters, because you don't know where to start.
As a baseline, aim to get any non-trivial learning by performing better than chance. The idea is to meaningfully reduce the problem so that you can feedback quick enough for you to iterate.
- Reduce the training set classes (e.g. 0s and 1s only)
- Speeds up training significantly
- This can be useful for debugging as well
- Note that changing the size of training set will require modification to the relevant hyper-parameters
- Reduce validation set size to speed up evaluation
- Start off with no hidden layer
- Pick the learning rate
$\eta$ by monitoring the training cost. - Use early stopping to determine the number of epochs
- However, in the early stages, don't use early stopping, because increased epochs can help monitor regularization performance
- A good rule of thumb is no-improvement-in-ten-rule, i.e. stop if no improvement after 10 epochs. This can be made more lenient as other parameters are well adjusted.
- Other methods exist, which compromise achieving high validation accuracies for not training too long.
-
Learning rate schedule is used to vary the learning rate
- One scheme is to hold the learning rate constant until the validation accuracy starts to worsen, in which case the learning rate should be decreased by a factor of 2 or 10.
- Termination can be done when the learning rate has decrease by a factor of 100 or 1000.
- To determine the regularization parameter
$\lambda$ , it is worth setting it to 0 and choosing the learning rate. Then set$\lambda=1$ and experiment - Use the validation accuracy to pick regularization hyperparemeter, mini-batch size, network parameters.
It's tempting to use gradient descent to try to learn good values for hyper-paremeters
Potentially this isn't the best idea because the landscae isn't broadly convex. Gradient descent is probably too slow for this. Alternative automated methods are:
- grid search
- bayesian approach
Taylor expansion gives
Advantages:
- converges faster than 1st order approximation
- versions of backprop exist to calculate the Hessian
Disadvantage: Too expensive and space heavy
The Hessian adds information about the way the gradient is changing. The momentum based methods aims to do this by augmenting with a velocity term:
Note: for
- Conjugate gradient descent
- BFGS / L-BFGS (the latter being the limited memory implentation)
- What about Barzilai-Borwein?
- Promising: Nesterov's accelerated gradient technique (though it may be outdated now)
Hence, it is simply a linear transformation.
Note that the output of tanh is
rectified linear unit
ReLu doesn't suffer from the saturation problem. On the flip side, when
Note: ReLu requires a different initialisation scheme. This article is very helpful introduction.