- Deep neural networks typically can have several of thousands of parameters.
- With so many parameters, the network has an incredible amount of freedom and can fit a huge variety of complex datasets.
- But this great flexibility also means that it is prone to overfitting the training set.
- Regularization is a technique that reduces Overfitting.
- One of the best regularization techniques is Early stopping.
- Even though Batch Normalization was designed to solve the vanishing/exploding gradients problems, is also acts like a pretty good regularizer.
- Other popular regularization techniques for neural networks:
- ℓ1 and ℓ2 regularization ( L1 (Lasso Regression), L2 (Ridge Regression)). Based on norms
- Dropout
- Max-norm regularization.
- We can use ℓ1 and ℓ2 regularization to constrain a neural network’s connection weights (but typically not its biases).
layer = keras.layers.Dense(100, activation="elu",
kernel_initializer="he_normal",
kernel_regularizer=keras.regularizers.l2(0.01))
-
The l2() function returns a regularizer that will be called to compute the regularization loss, at each step during training.
-
This regularization loss is then added to the final loss.
-
You can just use keras.regularizers.l1() if you want ℓ1 regularization, and if you want both ℓ1 and ℓ2 regularization, use keras.regu larizers.l1_l2() (specifying both regularization factors).
from functools import partial
RegularizedDense = partial(keras.layers.Dense,
activation="elu",
kernel_initializer="he_normal",
kernel_regularizer=keras.regularizers.l2(0.01))
model = keras.models.Sequential([
keras.layers.Flatten(input_shape=[28, 28]),
RegularizedDense(300),
RegularizedDense(100),
RegularizedDense(10, activation="softmax",
kernel_initializer="glorot_uniform")
Since you will typically want to apply the same regularizer to all layers in your network, as well as the same activation function and the same initialization strategy in all hidden layers, you may find yourself repeating the same arguments over and over. This makes it ugly and error-prone. To avoid this, you can try refactoring your code to use loops. Another option is to use Python’s functools.partial() function: it lets you create a thin wrapper for any callable, with some default argument values. For
-
The main objective of creating a model(training data) is making sure it fits the data properly and reduce the loss.
-
Sometimes the model that is trained which will fit the data but it may fail and give a poor performance during analyzing of data (test data). This leads to overfitting. Regularization came to overcome overfitting.
-
Lasso Regression (Least Absolute Shrinkage and Selection Operator) adds “Absolute value of magnitude” of coefficient, as penalty term to the loss function.
-
Lasso shrinks the less important feature’s coefficient to zero; thus, removing some feature altogether.
-
So,this works well for feature selection in case we have a huge number of features.
-
Methods like Cross-validation, Stepwise Regression are there to handle overfitting and perform feature selection work well with a small set of features.
-
These techniques are good when we are dealing with a large set of features.
-
Along with shrinking coefficients, the lasso performs feature selection, as well. (Remember the ‘selection‘ in the lasso full-form?) Because some of the coefficients become exactly zero, which is equivalent to the particular feature being excluded from the model.
- Ridge regression adds “squared magnitude of the coefficient" as penalty term to the loss function. Here the box part in the above image represents the L2 regularization element/term.
- R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression.
- Data augmentation artificially increases the size of the training set by generating many realistic variants of each training instance.
- This reduces overfitting, making this a regularization technique.
- The generated instances should be as realistic as possible:
- ideally, given an image from the augmented training set, a human should not be able to tell whether it was augmented or not.
- Simply adding white noise will not help; the modifications should be learnable (white noise is not).
- For example, you can slightly shift, rotate and resize every picture in the training set by various amounts and add the resulting pictures to the training set.
- This forces the model to be more tolerant to variations in the position, orientation and size of the objects in the pictures.
- For a model that’s more tolerant of different lighting conditions, you can similarly generate many images with various contrasts.
- In general, you can also flip the pictures horizontally (except for text, and other asymmetrical objects).
- By combining these transformations, you can greatly increase the size of your training set.
- Text Augmentation
- Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition
- MATHEMATICS FOR MACHINE LEARNING by Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong