1 Practical aspects of Deep Learning

Train / Dev / Test sets

Applied deep learning is a very iterative process.

• In the previous era of machine learning : the 70/30 train test splits, if you don’t have an explicit dev set or maybe a 60/20/20% split
• In the modern big data era : 100w examples, 98/1/1 or 99.5/0.25/0.25
• Make sure that the dev and test sets come from the same distribution
• It might be okay to not have a test set. The goal of the test set is to give you a unbiased estimate
of the performance of your final network, of the network that you selected. But if you don’t need that unbiased estimate, then it might be okay to not have a test set.

Bias / Variance

• High Bias : not a very good fit to the data what we say that this is underfitting the data.
• High Variance : this is overfitting the data and would not generalizing well.
• The optimal error, sometimes called Bayesian error.

How to analyze bias and variance when no classifier can do very well :

• Get a sense of how well you are fitting by looking at your training set error
• Go to the dev set and look at how bad is the variance problem

Basic Recipe for Machine Learning

1. Does your algorithm have high bias? And so to try and evaluate if there is high bias, And so, if it does not even fit in the training set that well, some things you could try would be to try pick a network.
2. Maybe you can make it work, maybe not, whereas getting a bigger network almost always helps. And training longer doesn’t always help, but it certainly never hurts. Try these things until I can at least get rid of the bias problems, as in go back after I’ve tried this and keep doing that until I can fit, at least, fit the training set pretty well.
3. Once you reduce bias to a acceptable amounts, then ask, do you have a variance problem?
4. And if you have high variance, well, best way to solve a high variance problem is to get more data. But sometimes you can’t get more data. Or you could try regularization.

Repeat until hopefully you find something with both low bias and low variance.

Notes :

• If you actually have a high bias problem, getting more training data is actually not going to help.
• Getting a bigger network almost always just reduces your bias without necessarily hurting your variance, so long as you regularize appropriately. And getting more data pretty much always reduces your variance and doesn’t hurt your bias much.

Training a bigger network almost never hurts. And the main cost of training a neural network that’s too big is just computational time, so long as you’re regularizing.

Regularization

High Variance Problem :

• probably regularization
• get more training data

Regularization will often help to prevent overfitting, or to reduce the errors in your network.

Add regularization to the logistic regression, what you do is add $$\lambda$$ to it, which is called the Regularization Parameter.

• L2 regularization (the most common type of regularization) $$J(w,b) = \frac {1}{m} \sum _{i=1}^{m} L(\hat y ^{(i)}, y ^{(i)}) + \frac {\lambda}{2m} \left \| w \right \| ^{2}_{2}$$
• L1 regularization 

Frobenius norm : (L2 normal of a matrix) It just means the sum of square of elements of a matrix.

L2 regularization is sometimes also called weight decay.

Why regularization reduces overfitting?

One piece of intuition is that if you crank regularisation lambda to be really, really big, they’ll be really incentivized to set the weight matrices W to be reasonably close to zero. So one piece of intuition is maybe it set the weight to be so close to zero for a lot of hidden units that’s basically zeroing out a lot of the impact of these hidden units.

Dropout Regularization

With dropout, what we’re going to do is go through each of the layers of the network, and set some probability of eliminating a node in neural network. So you end up with a much smaller, really much diminished network. And then you do back propagation training.

By far the most common implementation of dropouts today is inverted dropouts.

Understanding Dropout

• So it’s as if on every iteration, you’re working with a smaller neural network, and so using a smaller neural network seems like it should have a regularizing effect.
• Similar to what we saw with L2 regularization, the effect of implementing dropout is that it shrinks the weights, and does some of those outer regularization that helps prevent over-fitting.

Notice that the keep_prob of one point zero means that you’re keeping every unit

If you’re more worried about some layers overfitting than others,
you can set a lower keep_prob for some layers than others. The downside is, this gives you even more hyper parameters to search for using cross-validation.

One other alternative might be to have some layers where you apply dropout and some layers where you don’t apply dropout and then just have one hyper parameter, which is the keep_prob for the layers for which you do apply dropout.

On computer vision, the input size is so big, you inputting all these pixels that you almost never have enough data. So you’re almost always overfitting, And so dropout is very frequently used by computer vision.

One big downside of dropout is that the cost function J is no longer well-defined. On every iteration, you are randomly killing off a bunch of nodes. and so, if you are double checking the performance of gradient dissent, it’s actually harder to double check that right, you have a well-defined cost function J that is going downhill on every iteration.

Other regularization methods

• data augmentation : flipping it horizontally, random rotations and distortions
• early stopping :
1. And the advantage of early stopping is that running the gradient descent process just once, you get to try out values of small w, mid-size w, and large w, without needing to try a lot of values of the L2 regularization hyperparameter lambda.
2. The Problem is that because of stopping gradient descent eailer, so that not doing a great job reducing the cost function J. And then you also trying to not over fit.

Normalizing inputs

When training a neural network, one of the techniques that will speed up your training.

Normalizing your inputs corresponds to two steps :

1. subtract out or to zero out the mean
2. normalize the variances

If your features came in on similar scales, then this step is less important, although performing this type of normalization pretty much never does any harm, so I’ll often do it anyway if I’m not sure whether or not it will help with speeding up training for your algorithm.

When you’re training a very deep network, your derivatives or your slopes can sometimes get either very very big or very very small, maybe even exponentially small, and this makes training difficult.

Use careful choices of the random weight initialization to significantly reduce this problem.

Weight Initialization for Deep Networks

More careful choice of the random initialization for your neural network.

Some formulas gives a default value to use for the variance of the initialization of weight matrices :

• tanh : Xavier initialization $$\sqrt{\frac{1}{n^{[l-1]}}}$$, or $$\sqrt{\frac{2}{n^{[l-1]}+n^{[l]}}}$$
• Relu : $$\sqrt{\frac{2}{n^{[l-1]}}}$$

When you implement back propagation you’ll find that there’s a test called gradient checking that can really help you make sure that your implementation of back prop is correct. Because sometimes you write all these equations and you’re just not 100% sure if you’ve got all the details right and implementing back propagation. So in order to build up to gradient checking, let’s first talk about how to numerically approximate computations of gradients.

How to numerically approximate computations of gradients

The formal definition of a derivative : $$f'(\theta) = \frac {f(\theta + \varepsilon) – f(\theta – \varepsilon)}{2\varepsilon }$$

How you could use it too to debug, or to verify that your implementation and back props correct.

$$\mathrm{d} \theta _{approx}[i] = \frac {J(\theta_1, \theta_2, \cdots \theta_i + \varepsilon, \cdots) – J(\theta_1, \theta_2, \cdots \theta_i – \varepsilon, \cdots)}{2\varepsilon }$$

$$\frac{\left \| \mathrm{d} \theta _{approx}[i] – \mathrm{d} \theta [i] \right \|_2}{\left \| \mathrm{d} \theta _{approx}[i] \right \|_2 + \left \| \mathrm{d} \theta [i] \right \|_2} = \varepsilon \left\{\begin{matrix} < 10^{-7} & , that’s great\\ > 10^{-5} & , maybe have a bug somewhere \end{matrix}\right.$$