Improving Deep Neural Networks - Week 1
Week 1
Settings up your Machine Learning application
Train / Dev / Test sets
- NN require a lot of decisions:
- How many layers?
- How many hidden units?
- What learning rate?
- What activation functions?
- ML is a highly iterative process: start with idea and keep iterating.
- Usually take data and split into 3 sets:
- Training
- Hold-out, cross val, dev set.
- Test set.
- Traditional ML: 60%/20%/20% split when dealing with 100, 1000 or 10000 items in the set..
- Neural nets tend to have much larger datasets (1m+), so split may be more like: 98%/%1/1%, with 10000 items in the dev set.
- Want a dev set that let's you quickly iterate through algorithms to figure out quickly which ones are better.
- May be training on a mismatched train/test set distribution.
- Dev/test sets may have pictures uploaded by users.
- Train may have pictures downloaded from the internet.
- Should ensure dev/test sets are from the same distribution.
- May not need a test set (only dev set), but won't give you an unbias estimate of performance (maybe don't need one?).
Bias / Variance
- High bias: underfitting.
- Doesn't perform well on training set, generally won't perform well on dev set.
- High variance: overfitting.
- Performs well on training set but poorly on the dev set.
- Accuracy of model should be compared to human accuracy: optimal (bayes) error.
- If human's achieve 15% error, then a model that achieve similar would be considered low bias, low variance.
Basic recipe for Machine Learning
-
Basic recipe:
- After training model, ask "does the model have high bias?"
- Try bigger network.
- Train longer.
- Try a different NN architecture (maybe).
- Do you have high variance?
- More data.
- Regularisation.
- Different NN architecture.
- "Trade off" of bias / variance is talked about less in the deep learning era, because there are things you can do that reduce one without increasing the other.
- Increasing data improves variance without bias.
- Bigger networks let you decease bias without hurting variance.
- After training model, ask "does the model have high bias?"
Regularising your neural network
Regularisation
- L2 regularisation =
- Added to cost function definition in linear regression as follows:
- Where controls the amount of regularisation in the model.
- A value of 0 = no regularisation.
- A value of ∞ = only regularisation.
- Generally only regularise the weights, not the bias term.
- Added to cost function definition in linear regression as follows:
- L1 norm =
- L1 norm will generally make the weights sparse, allowing you to compress the model, though doesn't always work well in practise.
- L2 norm in NNs, refers to the sum of the weight matrix or norm of the matrix:
- "Squared norm is the sum of the i sum of j, of each of the elements of that matrix, squared."
- The L2 norm of a matrix is actually called the "Frobenius norm".
- "Squared norm is the sum of the i sum of j, of each of the elements of that matrix, squared."
- The only change to the backprop step is adding the weight as follows:
Dropout Regularization / Understanding Dropout
- Define a prob that a hidden unit will be kept. Eg keep_prob = 80%:
- 20% of nodes will be set to 0.
- Don't use drop out at test time.
- Can't rely on a single feature so they tend to spread out weights.
- Generally shrinks weights.
Other regularisation methods
- Data augmentation.
- Flip images, random zoom, crop, etc.
- Not as good as new data.
- Early stopping.
- As you train longer, the training error tends to decrease but the dev tends to increase. Stop early before dev error increases.
- Can make compartmentalisation of each tasks hard: want to optimise, then reduce overfitting etc.
- Early stopping couples optimisation and overfitting.
- Better to just use L2 regularisation.
Setting up your optimisation problem
Normalising inputs
- Normalising inputs corresponds to 2 steps:
- Subtract out the mean, so you training set has a mean of 0:
- Normalise variance: _ refers to element-wise squaring. _ Then, take each example and divide it by sigma squared:
- Use the calculated mu and sigma squared to normalise test set.
- Unnormalised inputs will require a smaller learning rate.
- Normalised inputs will usually make cost function faster to optimise.
Vanishing / Exploding gradients
- Derivates can get very big (exploding) or very small (vanishing).
- If you initialise your weights to 1.5 * identity matrix (or some value that would cause it to grow exponentially), the value of would explode.
- The inverse is also true: very small weights, could cause to vanish.
- This can cause gradient descent to take a long time to run.
Weight initialization for Deep Networks
- Larger n is (eg the larger the number of inputs is), the small you want each of the corresponding weights to be.
- One approach: set variance of to be 1/n:
- When using relu, generally 2/n works better.
- tanh activation, prefers .
- Also, Xavier init method:
- Can be set as follows: * Where the n for each layer refers to the size of the previous layer's output.
- Doesn't solve, but helps reduce the vanishing/exploding problem.
Numerical approximation of gradients
- Gradient checking: tool that can help determine if an implementation of backprop is correct.
-
Numerical approximation of gradients:
- Start with function:
- Instead of just nudging it in one direction to get an estimate of the gradient, can do it in two: and
-
Make use of both when calculating the derivative, diving by the width of the slope and it should be close to :
-
By setting and and , it works as follows:
(1.01 ** 3 - (0.99)**3) / 2 * (0.001) >> 3.0001000000000057e-05
-
- Then you get an approx error of 0.0001.
- Using this gradient method is twice as slow and generally will only be used in debug not in training.
- Summary: 2-sided difference formula is much more accurate than the one-sided.
Gradient checking
- Take all params and and reshape into a vector then concatenate into a vector .
- Do the same for all derivatives and into vector .
- The result should be close to the derivative (take the eclidean distance of both vectors to confirm):
- If the euclidean distance is off, then it's probably correct. At , it's probably incorrect.
Gradient checking implementation notes
- Don't use in training, only while debugging.
- Train a few iterations with grad check to see if correct then turn off.
- If an algorithm fails grad check, want to look at individual values to try to figure out what's up.
- Remember to include regularization term.
- Won't work with dropout.