2 Optimization algorithms

Mini-batch gradient descent

Mini-batch gradient descent in contrast, refers to the algorithm which we’ll talk about on the next slide, and which you process is single mini batch \(X^{{t}}\), \(Y^{{t}}\) at the same time, rather than processing your entire training set X, Y the same time.

Mini-batch gradient descent runs much faster than batch gradient descent that’s pretty much what everyone in Deep Learning will use when you’re training on a large data set.

Understanding mini-batch gradient descent

  • If the mini-batch size=m then you just end up with Batch Gradient Descent.
  • If your mini-batch size=1 and this gives you an algorithm called Stochastic Gradient Descent.

Disadvantage:

  • Lose almost all your speed up from vectorization. Because of that we processing a single training example at a time.
  • It doesn’t always exactly converge or oscillate in a very small region. If that’s an issue you can always reduce the learning rate slowly.

Guidelines : 

  • If you have a small training set (maybe 2000), just use batch gradient descent.
  • If you have a bigger training set, typical mini batch sizes would be, anything from 64 up to maybe 512 are quite typical.
  • Because of the way computer memory is laid out and accessed, sometimes your code runs faster if your mini-batch size is a power of 2. 

Exponentially weighted averages

Exponentially weighted averages and it’s also called exponentially weighted moving averages in statistics.

Understanding exponentially weighted averages

The key equation for implementing exponentially weighted averages : \(v_t = \beta v_{t-1} + (1- \beta) \theta _t\)

It takes very little memory.

Bias correction in exponentially weighted averages

Bias Correction that can make you computation of these averages more accurately.

If you are concerned about the bias during this initial phase, while your exponentially weighted moving average is still warming up. Then bias correction can help you get a better estimate early on.

Gradient descent with Momentum

Momentum, or gradient descent with momentum that almost always works faster than the standard gradient descent algorithm.

In one sentence, the basic idea is to compute an exponentially weighted average of your gradients, and then use that gradient to update your weights instead.

This will almost always work better than the straightforward gradient descent algorithm without momentum.

RMSprop

RMSprop (Root Mean Square Prop) that can also speed up gradient descent.

And so, you want to slow down the learning in the b direction, or in the vertical direction. And speed up learning, or at least not slow it down in the horizontal direction. So this is what the RMSprop algorithm does to accomplish this.

Adam optimization algorithm

RMSprop and the Adam optimization algorithm, which we’ll talk about in this video, is one of those rare algorithms that has really stood up, and has been shown to work well across a wide range of deep learning architectures.

And the Adam optimization algorithm is basically taking momentum and RMSprop and putting them together.

Learning rate decay

One of the things that might help speed up your learning algorithm, is to slowly reduce your learning rate over time.

some formula : 

\(\begin{matrix}
\alpha = \frac {1}{1 + decayrate*epoch -num} \alpha _0\\
\alpha = \frac {k}{\sqrt{epoch -num}} \alpha _0\\
\alpha = \frac {k}{\sqrt{t}} \alpha _0
\end{matrix}\)

The problem of local optima

Instead most points of zero gradient in a cost function are saddle points.

In very high-dimensional spaces you’re actually much more likely to run into a saddle point.

1 Practical aspects of Deep Learning

Train / Dev / Test sets

Applied deep learning is a very iterative process.

  • In the previous era of machine learning : the 70/30 train test splits, if you don’t have an explicit dev set or maybe a 60/20/20% split
  • In the modern big data era : 100w examples, 98/1/1 or 99.5/0.25/0.25
  • Make sure that the dev and test sets come from the same distribution
  • It might be okay to not have a test set. The goal of the test set is to give you a unbiased estimate
    of the performance of your final network, of the network that you selected. But if you don’t need that unbiased estimate, then it might be okay to not have a test set.

Bias / Variance

  • High Bias : not a very good fit to the data what we say that this is underfitting the data.
  • High Variance : this is overfitting the data and would not generalizing well.
  • The optimal error, sometimes called Bayesian error.

How to analyze bias and variance when no classifier can do very well :

  • Get a sense of how well you are fitting by looking at your training set error
  • Go to the dev set and look at how bad is the variance problem

Basic Recipe for Machine Learning

  1. Does your algorithm have high bias? And so to try and evaluate if there is high bias, And so, if it does not even fit in the training set that well, some things you could try would be to try pick a network.
  2. Maybe you can make it work, maybe not, whereas getting a bigger network almost always helps. And training longer doesn’t always help, but it certainly never hurts. Try these things until I can at least get rid of the bias problems, as in go back after I’ve tried this and keep doing that until I can fit, at least, fit the training set pretty well.
  3. Once you reduce bias to a acceptable amounts, then ask, do you have a variance problem?
  4. And if you have high variance, well, best way to solve a high variance problem is to get more data. But sometimes you can’t get more data. Or you could try regularization.

Repeat until hopefully you find something with both low bias and low variance.

Notes :

  • If you actually have a high bias problem, getting more training data is actually not going to help.
  • Getting a bigger network almost always just reduces your bias without necessarily hurting your variance, so long as you regularize appropriately. And getting more data pretty much always reduces your variance and doesn’t hurt your bias much.

Training a bigger network almost never hurts. And the main cost of training a neural network that’s too big is just computational time, so long as you’re regularizing.

Regularization

High Variance Problem :

  • probably regularization
  • get more training data

Regularization will often help to prevent overfitting, or to reduce the errors in your network.

Add regularization to the logistic regression, what you do is add \(\lambda\) to it, which is called the Regularization Parameter.

  • L2 regularization (the most common type of regularization) \(J(w,b) = \frac {1}{m} \sum _{i=1}^{m} L(\hat y ^{(i)}, y ^{(i)}) + \frac {\lambda}{2m} \left \| w \right \| ^{2}_{2}\)
  • L1 regularization \(\)

Frobenius norm : (L2 normal of a matrix) It just means the sum of square of elements of a matrix.

L2 regularization is sometimes also called weight decay.

Why regularization reduces overfitting?

One piece of intuition is that if you crank regularisation lambda to be really, really big, they’ll be really incentivized to set the weight matrices W to be reasonably close to zero. So one piece of intuition is maybe it set the weight to be so close to zero for a lot of hidden units that’s basically zeroing out a lot of the impact of these hidden units.

Dropout Regularization

With dropout, what we’re going to do is go through each of the layers of the network, and set some probability of eliminating a node in neural network. So you end up with a much smaller, really much diminished network. And then you do back propagation training.

By far the most common implementation of dropouts today is inverted dropouts.

Understanding Dropout

  • So it’s as if on every iteration, you’re working with a smaller neural network, and so using a smaller neural network seems like it should have a regularizing effect.
  • Similar to what we saw with L2 regularization, the effect of implementing dropout is that it shrinks the weights, and does some of those outer regularization that helps prevent over-fitting.

Notice that the keep_prob of one point zero means that you’re keeping every unit

If you’re more worried about some layers overfitting than others,
you can set a lower keep_prob for some layers than others. The downside is, this gives you even more hyper parameters to search for using cross-validation.

One other alternative might be to have some layers where you apply dropout and some layers where you don’t apply dropout and then just have one hyper parameter, which is the keep_prob for the layers for which you do apply dropout.

On computer vision, the input size is so big, you inputting all these pixels that you almost never have enough data. So you’re almost always overfitting, And so dropout is very frequently used by computer vision.

One big downside of dropout is that the cost function J is no longer well-defined. On every iteration, you are randomly killing off a bunch of nodes. and so, if you are double checking the performance of gradient dissent, it’s actually harder to double check that right, you have a well-defined cost function J that is going downhill on every iteration.

Other regularization methods

  • data augmentation : flipping it horizontally, random rotations and distortions
  • early stopping : 
    1. And the advantage of early stopping is that running the gradient descent process just once, you get to try out values of small w, mid-size w, and large w, without needing to try a lot of values of the L2 regularization hyperparameter lambda.
    2. The Problem is that because of stopping gradient descent eailer, so that not doing a great job reducing the cost function J. And then you also trying to not over fit.

Normalizing inputs

When training a neural network, one of the techniques that will speed up your training.

Normalizing your inputs corresponds to two steps : 

  1. subtract out or to zero out the mean
  2. normalize the variances

If your features came in on similar scales, then this step is less important, although performing this type of normalization pretty much never does any harm, so I’ll often do it anyway if I’m not sure whether or not it will help with speeding up training for your algorithm.

Vanishing / Exploding gradients

When you’re training a very deep network, your derivatives or your slopes can sometimes get either very very big or very very small, maybe even exponentially small, and this makes training difficult.

Use careful choices of the random weight initialization to significantly reduce this problem.

Weight Initialization for Deep Networks

More careful choice of the random initialization for your neural network.

Some formulas gives a default value to use for the variance of the initialization of weight matrices :

  • tanh : Xavier initialization \(\sqrt{\frac{1}{n^{[l-1]}}}\), or \(\sqrt{\frac{2}{n^{[l-1]}+n^{[l]}}}\)
  • Relu : \(\sqrt{\frac{2}{n^{[l-1]}}}\)

Numerical approximation of gradients

When you implement back propagation you’ll find that there’s a test called gradient checking that can really help you make sure that your implementation of back prop is correct. Because sometimes you write all these equations and you’re just not 100% sure if you’ve got all the details right and implementing back propagation. So in order to build up to gradient checking, let’s first talk about how to numerically approximate computations of gradients.

How to numerically approximate computations of gradients

The formal definition of a derivative : \(f'(\theta) = \frac {f(\theta + \varepsilon) – f(\theta – \varepsilon)}{2\varepsilon }\)

Gradient checking

How you could use it too to debug, or to verify that your implementation and back props correct.

\(\mathrm{d} \theta _{approx}[i] = \frac {J(\theta_1, \theta_2, \cdots \theta_i + \varepsilon, \cdots) – J(\theta_1, \theta_2, \cdots \theta_i – \varepsilon, \cdots)}{2\varepsilon }\)

 

\(\frac{\left \| \mathrm{d} \theta _{approx}[i] – \mathrm{d} \theta [i] \right \|_2}{\left \| \mathrm{d} \theta _{approx}[i] \right \|_2 + \left \| \mathrm{d} \theta [i] \right \|_2} = \varepsilon \left\{\begin{matrix}
< 10^{-7} & , that’s great\\
> 10^{-5} & , maybe have a bug somewhere
\end{matrix}\right.\)

 

Gradient Checking Implementation Notes

  1. Don’t use in training – only to debug
  2. If algorithm fails grad check , look at components to try to identify bug.
  3. Remember regularization
  4. Doesn’t work with dropout
  5. Run at random initialization; perhaps again after some training.

4 Deep Neural Networks

Deep L-layer neural network

Over the last several years the AI or the machine learning community has realized that there are functions that very deep neural networks can learn and the shallower models are often unable to.

Although for any given problem it might be hard to predict in advance exactly how deep a neural network you would want, it would be reasonable to try logistic regression

Symbol definition for deep learning : 

  • \(L\) : the number of layers in the network
  • \(n^{[1]} = 5\) : the number of nodes or the number of units in layer
  • \(a^{[l]}\) : the activations in layer l
  • computing \(a^{[l]}\) as g
  • \(W^{[l]}\) : weights on layer l
  • \(x\) : feature and \(x = a^{[0]}\)
  • \(\hat {y} = a^{[l]}\) : the activation of the final layer

Forward and backward propagation

forward propagation : \(\begin{matrix}
z^{[l]} = W^{[l]} \cdot a^{[l – 1]} + b^{[l]}\\
a^{[l]} = g^{[l]}(z^{[l]})
\end{matrix}\)

vectorized version : \(\begin{matrix}
z^{[l]} = W^{[l]} \cdot A^{[l – 1]} + b^{[l]}\\
A^{[l]} = g^{[l]}(Z^{[l]})
\end{matrix}\)

backward propagation : \(\begin{matrix}
\mathrm{d}z^{[l]} = \mathrm{d}a^{[l]} * g^{[l]^{‘}}(z^{[l]})\\
\mathrm{d}w^{[l]} = \mathrm{d}z^{[l]} \cdot a^{[l-1]}\\
\mathrm{d}b^{[l]} = \mathrm{d}z^{[l]} \\
\mathrm{d}a^{[l-1]} = w^{[l]T} \cdot \mathrm{d}z^{[l]}\\
\mathrm{d}z^{[l]} = w^{[l+1]T} \mathrm{d}z^{[l+1]} \cdot g^{[l]^{‘}}(z^{[l]})
\end{matrix}\)

vectorized version : \(\begin{matrix}
\mathrm{d}Z^{[l]} = \mathrm{d}A^{[l]} * g^{[l]^{‘}}(Z^{[l]})\\
\mathrm{d}W^{[l]} = \frac{1}{m} \mathrm{d}Z^{[l]} \cdot A^{[l-1]T}\\
\mathrm{d}b^{[l]} = \frac{1}{m} np.sum(\mathrm{d}z^{[l]} , axis = 1, keepdims = True)\\
\mathrm{d}A^{[l-1]} = W^{[l]T} \cdot \mathrm{d}Z^{[l]}
\end{matrix}\)

Forward propagation in a Deep Network

for a single training example : \(z^{[l]} = w^{[l]}a^{[l-1]} + b^{[l]}, \ a^{[l]} = g^{[l]}(z^{[l]})\)

vectorized way : \(Z^{[l]} = W^{[l]}a^{[l-1]} + b^{[l]}, \ A^{[l]} = g^{[l]}(Z^{[l]}) \ (A^{[0]} = X)\)

Getting your matrix dimensions right

one of the debugging tools to check the correctness of my code is to work through the dimensions and matrix.

make sure that all the matrices dimensions are consistent that will usually help you go some ways toward eliminating some cause of possible bugs.

Why deep representations?

Deep neural networks work really well for a lot of problems it’s not just that they need to be big neural networks is that specifically they need to be deep or to have a lot of hidden layers

  1. The earlier layers learn these low levels simpler features and then have the later deeper layers then put together the simpler things that’s detected in order to detect more complex things
  2. If you try to compute the same function with a shallow network so we aren’t allowed enough hidden layers then you might require exponentially more hidden units to compute

Starting out on a new problem : 

  • Start out with even logistic regressions and try something with one or two hidden layers and use that as a hyper parameter use that as a parameter or hyper parameter that you tune
  • But over the last several years there has been a trend toward people finding that for some applications very very deep neural networks sometimes can be the best model for a problem

Building blocks of deep neural networks

Nothing ……

Parameters vs Hyperparameters

These are parameters that control the ultimate parameters W and b and so we call all of these things below hyper parameters : 

  • \(\alpha\) (learning rate)
  • iterations (the number of iterations of gradient descent)
  • L (the number of hidden layers)
  • \(n^{[l]}\) (the number of hidden units)
  • choice of activation function

Find the best value :

Idea—Code—Experiment—Idea— ……

Try a few values for the hyper parameters and double check if there’s a better value for the hyper parameters and as you do so you slowly gain intuition as well about the hyper parameters.

What does this have to do with the brain?

Maybe that was useful but now the field has moved to the point where that analogy is breaking down.

3 Shallow Neural Networks

Neural Network Overview

Refer to Example : \(x^{(i)}\)

Refer to Layer : \(\alpha^{[m]}\)

Algorithm 3.1 : \(\left.\begin{matrix}
x\\
w\\
b
\end{matrix}\right\}
\Rightarrow z = w^Tx + b\)

Algorithm 3.2 : \(\left.\begin{matrix}
x\\
w\\
b
\end{matrix}\right\}
\Rightarrow z = w^Tx + b
\Rightarrow \alpha = \sigma(z)
\Rightarrow L(a,y) \ (Loss \ Function)\)

Algorithm 3.3 : \(\left.\begin{matrix}
x\\
W^{[1]}\\
b^{[1]}
\end{matrix}\right\}
\Rightarrow z^{[1]} = W^{[1]}x + b^{[1]}
\Rightarrow \alpha^{[1]} = \sigma(z^{[1]})\)

Algorithm 3.4 : \(\left.\begin{matrix}
x\\
dW^{[1]}\\
db^{[1]}
\end{matrix}\right\}
\Leftarrow dz^{[1]} = d(W^{[1]}x + b^{[1]})
\Leftarrow d\alpha^{[1]} = d\sigma(z^{[1]})\)

Algorithm 3.5 : \(\left.\begin{matrix}
x\\
dW^{[1]}\\
db^{[1]}
\end{matrix}\right\}
\Leftarrow dz^{[1]} = d(W^{[1]}x + b^{[1]})
\Leftarrow d\alpha^{[1]} = d\sigma(z^{[1]})\)

Algorithm 3.6 : \(\left.\begin{matrix}
d\alpha^{[1]} = d\sigma(z^{[1]})\\
dW^{[2]}\\
db^{[2]}
\end{matrix}\right\}
\Leftarrow dz^{[2]} = d(W^{[2]}\alpha^{[1]} + b^{[2]})
\Leftarrow d\alpha^{[2]} = d\sigma(z^{[2]})
\Leftarrow dL(a^{[2]}, y)\)

Neural Network Representation

  • input layer
  • hidden layer
  • output layer

\(a\) : activations

\(a^{[0]}\) : the activations of the input layer

\(a^{[0]}_1\) : first node

Algorithm 3.7 : \(a^{[1]}=\begin{bmatrix}
a^{[1]}_1\\
a^{[1]}_2\\
a^{[1]}_3\\
a^{[1]}_4
\end{bmatrix}\)

  • when we count layers in neural networks we don’t count the input layer so the hidden layer is layer 1
  • In our notational convention we’re calling the input layer layer 0

so a two layer neural network looks like a neural network with one hidden layer.

Computing a Neural Network’s output

Symbols in neural networks:

  • 𝑥 : features
  • 𝑎 : output
  • 𝑊 : weight
  • superscript : layers
  • subscript : number of the items

How this neural network computers outputs :

  1. \(z_1^{[1]} = w_1^{[1]T}x + b_1^{[1]}\)
  2. \(a_1^{[1]} = \sigma(z_1^{[1]})\)
  3. \(a_2^{[1]}, a_3^{[1]}, a_4^{[1]}\)

Vectorizing : stack nodes in a layer vertically

Algorithm 3.10 : \(a^{[1]} = \begin{bmatrix}
a_1^{[1]}\\
a_2^{[1]}\\
a_3^{[1]}\\
a_4^{[1]}
\end{bmatrix}
= \sigma(z^{[1]})\)

Algorithm 3.11 : \(\begin{bmatrix}
z_1^{[1]}\\
z_2^{[1]}\\
z_3^{[1]}\\
z_4^{[1]}
\end{bmatrix}
=
\begin{bmatrix}
\cdots W_1^{[1]T} \cdots \\
\cdots W_2^{[1]T} \cdots \\
\cdots W_3^{[1]T} \cdots \\
\cdots W_4^{[1]T} \cdots
\end{bmatrix}
*
\begin{bmatrix}
x_1\\
x_2\\
x_3
\end{bmatrix}
+
\begin{bmatrix}
b_1^{[1]}\\
b_2^{[1]}\\
b_3^{[1]}\\
b_4^{[1]}
\end{bmatrix}\)

Vectorizing across multiple examples

Take the equations you had from the previous algorithm and with very little modification, change them to make the neural network compute the outputs on all the examples, pretty much all at the same time.

\(a^{[2](i)}\) : Refers to training example i and layer two

Algorithm 3.12 : \(x = \begin{bmatrix}
\vdots & \vdots & \vdots & \vdots \\
x^{(1)} & x^{(2)} & \cdots & x^{(m)}\\
\vdots & \vdots & \vdots & \vdots
\end{bmatrix}\)

Algorithm 3.13 : \(Z^{[1]} = \begin{bmatrix}
\vdots & \vdots & \vdots & \vdots \\
z^{[1](1)} & z^{[1](2)} & \cdots & z^{[1](m)}\\
\vdots & \vdots & \vdots & \vdots
\end{bmatrix}\)

Algorithm 3.14 : \(A^{[1]} = \begin{bmatrix}
\vdots & \vdots & \vdots & \vdots \\
a^{[1](1)} & a^{[1](2)} & \cdots & a^{[1](m)}\\
\vdots & \vdots & \vdots & \vdots
\end{bmatrix}\)

Algorithm 3.15 : \(\left.\begin{matrix}
z^{[1](i)} = W^{[1](i)}x^{(i)} + b^{[1]}\\
a^{[1](i)} = \sigma(z^{[1](i)})\\
z^{[2](i)} = W^{[2](i)}a^{[1](i)} + b^{[2]}\\
a^{[2](i)} = \sigma(z^{[2](i)})
\end{matrix}\right\}
\Rightarrow
\left\{\begin{matrix}
A^{[1]} = \sigma (z^{[1]})\\
z^{[2]} = W^{[2]}A^{[1]} + b^{[2]}\\
A^{[2]} = \sigma (z^{[2]})
\end{matrix}\right.\)

Justification for vectorized implementation

Algorithm 3.16 : \(\begin{matrix}
z^{[1](1)} = W^{[1]}x^{(1)} + b^{[1]}\\
z^{[1](2)} = W^{[1]}x^{(2)} + b^{[1]}\\
z^{[1](3)} = W^{[1]}x^{(3)} + b^{[1]}
\end{matrix}\)

Algorithm 3.17 : \(W^{[1]}x = \begin{bmatrix}
\cdots \\
\cdots \\
\cdots
\end{bmatrix}
\begin{bmatrix}
\vdots &\vdots &\vdots &\vdots \\
x^{(1)} &x^{(2)} &x^{(3)} &\vdots \\
\vdots &\vdots &\vdots &\vdots
\end{bmatrix}
=
\begin{bmatrix}
\vdots &\vdots &\vdots &\vdots \\
w^{(1)}x^{(1)} &w^{(1)}x^{(2)} &w^{(1)}x^{(3)} &\vdots \\
\vdots &\vdots &\vdots &\vdots
\end{bmatrix}
=
\begin{bmatrix}
\vdots &\vdots &\vdots &\vdots \\
z^{[1](1)} &z^{[1](2)} &z^{[1](3)} &\vdots \\
\vdots &\vdots &\vdots &\vdots
\end{bmatrix}
= Z^{[1]}\)

Stack up the training examples in the columns of matrix X, and their outputs are also stacked into the columns of matrix \(z^{[1]}\).

Activation functions

Algorithm 3.18 Sigmoid : \(a = \sigma (z) = \frac{1}{1 + e^{-z}}\)

Algorithm 3.19 tanh : \(a = \tanh (z) = \frac{e^{z} – e^{-z}}{e^{z} + e^{-z}}\) (almost always works better than the sigmoid function)

Algorithm 3.20 hidden layer : \(g(z^{[1]}) = tanh(z^{[1]})\) (almost always strictly superior)

Algorithm 3.21 binaray : \(g(z^{[2]}) = \sigma(z^{[2]})\) (if y is either 0 or 1)

if z is either very large or very small then the gradient of the derivative or the slope of this function becomes very small so z is very large or z is very small the slope of the function you know ends up being close to zero and so this can slow down gradient descent

Algorithm 3.22 Relu (Rectified Linear Unit) : \(a = max(0, z)\)

Algorithm 3.23 Leaky Relu: \(a = max(0.01z, z)\)

some rules of thumb for choosing activation functions :

  • sigmoid : binary classification
  • tanh : pretty much strictly superior
  • ReLu : default

If you’re not sure which one of these activation functions work best you know try them all and then evaluate on like a holdout validation set or like a development set which we’ll talk about later and see which one works better and then go with that.

Why need a nonlinear activation function?

It turns out that for your neural network to compute interesting functions you do need to take a nonlinear activation function.

It turns out that if you use a linear activation function or alternatively if you don’t have an activation function then no matter how many layers your neural network has always doing is just computing a linear activation function so you might as well not have any hidden layers.

Derivatives of activation functions

Algorithm 3.25 : \(\frac{\mathrm{d} }{\mathrm{d} z}g(z) = \frac{1}{1+e^{-z}}(1 – \frac{1}{1+e^{-z}}) = g(z)(1 – g(z))\)

Algorithm 3.26 : \(g(z) = \tanh (z) = \frac{e^{z} – e^{-z}}{e^{z} + e^{-z}}\)

Algorithm 3.27 : \(\frac{\mathrm{d} }{\mathrm{d} z}g(z) = 1 – (tanh(z))^2\)

Algorithm Rectified Linear Unit (ReLU) : \(g(z)’ = \left\{\begin{matrix}
0 & if\ z < 0\\
1 & if\ z > 0\\
undefined & if\ z = 0
\end{matrix}\right.\)

Algorithm Leaky linear unit (Leaky ReLU) : \(g(z)’ = \left\{\begin{matrix}
0.01 & if\ z < 0\\
1 & if\ z > 0\\
undefined & if\ z = 0
\end{matrix}\right.\)

Gradient descent for neural networks

forward propagation : 

(1) : \(z^{[1]} = W^{[1]}x + b^{[1]}\)

(2) : \(a^{[1]} = \sigma(z^{[1]})\)

(3) : \(z^{[2]} = W^{[2]}a^{[1]} + b^{[2]}\)

(4) : \(a^{[2]} = g^{[2]}(z^{[2]}) = \sigma(z^{[2]})\)

back propagation : 

Algorithm 3.32 : \(\mathrm{d}z^{[2]} = A^{[2]} – Y, \ Y = [y^{[1]} \ y^{[2]} \ \cdots \ y^{[m]}]\)

Algorithm 3.33 : \(\mathrm{d}W^{[2]} = \frac{1}{m} \mathrm{d}z^{[2]}A^{[1]T}\)

Algorithm 3.34 : \(\mathrm{d}b^{[2]} = \frac{1}{m} np.sum(\mathrm{d}z^{[2]}, axis = 1, keepdims = True)\)

Algorithm 3.35 : \(\mathrm{d}z^{[1]} = \underbrace{W^{[2]T}\mathrm{d}z^{[2]}} * \underbrace{g^{[1]^{‘}}} * \underbrace{z^{[1]}}\)

Algorithm 3.36 : \(\mathrm{d}W^{[1]} = \frac{1}{m}\mathrm{d}z^{[1]}x^T\)

Algorithm 3.37 : \(\underbrace{\mathrm{d}b^{[1]}} = \frac {1}{m} np.sum(\mathrm{d}z^{[1]}, axis = 1, keepdims = True)\) (axis = 1 : horizontally, keepdims : ensures that Python outputs, for d b^[2] a vector that is some n by one)

Backpropagation intuition

It is one of the very hardest pieces of math. One of the very hardest derivations in all of machine learning.

//TODO, maybe never ...

Random Initialization

It is important to initialize the weights randomly.

  1. Gaussian random variable (2,2) : \(W^{[1]} = np.random.randn(2, 2)\)
  2. then usually you multiply this by a very small number such as 0.01 so you initialize it to very small random values

2 Basics of Neural Network programming

Binary Classification

Logistic regression is an algorithm for binary classification.

Logistic Regression

\(\hat{y} = w^Tx + b\) : An algorithm that can output a prediction. More formally, you want \(\hat{y}\) to be the probability of the chance. And \(\hat{y}\) should really be between zero and one.

sigmoid : \(\sigma (z) = \frac {1}{1 + e^{-z}}\)

Logistic Regression Cost Function

To train the parameters W and B of the logistic regression model, we need to define a cost function.

Sigmoid(z) : \(z^{(i)} = w^Tx^{(i)} + b\)

Loss function : \(L(\hat{y}, y)\) (measure how good our output \(\hat{y}\) is when the true label is \(y\))

Loss Function In Logistic Regression : \(L(\hat{y}, y) = -ylog(\hat{y}) – (1-y)log(1 – \hat{y})\) (It measures how well you’re doing on a single training example)

Cost Function : \(J(w,b) = \frac {1}{m} \sum_{i=1}^{m} L(\hat{y} ^{(i)}, y ^{(i)}) =\frac {1}{m} \sum_{i=1}^{m} (-y ^{(i)}log\hat{y} ^{(i)} – (1-y ^{(i)})log(1 – \hat{y} ^{(i)}))\) (It measures how well you’re doing an entire training set)

Gradient Descent

Cost function J is a convex function

  1. Random initialization
  2. takes a step in the steepest downhill direction repeatedly \(\left\{\begin{matrix}
    w := w – \alpha \frac {\partial J(w,b)}{\partial w} \\
    b := b – \alpha \frac {\partial J(w,b)}{\partial b}
    \end{matrix}\right.\)
  3. converge to this global optimum or get to something close to the global optimum

Derivatives

Really all you need is an intuitive understanding of this in order to build and successfully apply these algorithms.

Watch the videos and then if you could do the homework and complete the programming homework successfully then you can apply deep learning.

More Derivative Examples

  • the derivative of the function just means the slope of a function and the slope of a function can be different at different points on the function
  • if you want to look up the derivative of a function you can flip open your calculus textbook or look at Wikipedia and often get a formula for the slope of these functions at different points

Computation Graph

In order to compute derivatives Opa right to left pass like this kind of going in the opposite direction as the blue arrows that would be most natural for computing the derivatives so the recap the computation graph organizes a computation with this blue arrow left to right computation.

Derivatives with a Computation Graph

If you want to compute the derivative of this final output variable which uses variable you care most about, with respect to v, then we’re done sort of one step of backpropagation so the called one step backwards in this graph.

By changing a you end up increasing v. Well, how much does v increase? It is increased by an amount that’s determined by dv/da and then the change in v will cause the value of J to also increase. So, in Calculus this is actually called the chain rule.

A computation graph and how there’s a forward or left to right calculation to compute the cost functions. Do you might want to optimize. And a backwards or a right to left calculation to compute derivatives.

Logistic Regression Gradient Descent

How to compute derivatives for you to implement gradient descent for logistic regression.

\(\frac{\mathrm{d} J}{\mathrm{d} u} = \frac{\mathrm{d} J}{\mathrm{d} v} \frac{\mathrm{d} v}{\mathrm{d} u} \), 

\(\frac{\mathrm{d} J}{\mathrm{d} b} = \frac{\mathrm{d} J}{\mathrm{d} u} \frac{\mathrm{d} u}{\mathrm{d} b} \), 

\(\frac{\mathrm{d} J}{\mathrm{d} a} = \frac{\mathrm{d} J}{\mathrm{d} u} \frac{\mathrm{d} u}{\mathrm{d} a} \)

 

Get you familiar with these ideas so that hopefully you’ll make a bit more sense when we talk about full fledged neural networks.

Logistic Regression \(\left\{\begin{matrix}
Lost \ Function : & L(\hat y^{(i)}, y^{(i)}) = -y^{(i)}log\hat y^{(i)} – (1-y^{(i)})log(1-\hat y^{(i)}) \\
Cost \ Function : & J(w,b) = \frac {1}{m} \sum _{i}^{m} L(\hat y^{(i)}, y^{(i)})
\end{matrix}\right.\)

Gredient Descent : \(\left\{\begin{matrix}
w := w – \alpha \frac {\partial J(w,b)}{\partial w} \\
b := b – \alpha \frac {\partial J(w,b)}{\partial b}
\end{matrix}\right.\)

Calculus : \(\frac{\mathrm{d} L(a,y)}{\mathrm{d} a} = -y/a + (1-y)/(1-a)\)

\(\left\{\begin{matrix}
\frac{\mathrm{d} L(a,y)}{\mathrm{d} z} = \frac {\mathrm{d} L}{\mathrm{d} z} = \frac {\mathrm{d} L}{\mathrm{d} a} \frac {\mathrm{d} a}{\mathrm{d} z} \\
\frac {\mathrm{d} a}{\mathrm{d} z} = a \cdot (1-a) \\
\frac{\mathrm{d} L}{\mathrm{d} a} = -\frac{y}{a} + \frac{(1-y)}{(1-a)}
\end{matrix}\right.\)

 

\({\mathrm{d} z} = \frac{\mathrm{d} L(a,y)}{\mathrm{d} z} = \frac{\mathrm{d} L}{\mathrm{d} z} = (\frac {\mathrm{d} L}{\mathrm{d} a}) \cdot (\frac {\mathrm{d} a}{\mathrm{d} z} ) = (-\frac{y}{a} + \frac{(1-y)}{(1-a)}) \cdot a(1-a) = a – y\) \(\left\{\begin{matrix}
{\mathrm{d} w_1} = \frac {1}{m} \sum _i^m x_1^{(i)} (a^{(i)} – y^{(i)})\\
{\mathrm{d} w_2} = \frac {1}{m} \sum _i^m x_2^{(i)} (a^{(i)} – y^{(i)})\\
{\mathrm{d} b} = \frac {1}{m} \sum _i^m (a^{(i)} – y^{(i)})
\end{matrix}\right.\)

Gradient Descent on m Examples

\(J(w,b) = \frac {1}{m}\sum _{i=1}^m L (a^{(i)}, y^{(i)})\)

 

J=0;dw1=0;dw2=0;db=0;
    for i = 1 to m
        z(i) = wx(i)+b;
        a(i) = sigmoid(z(i));
        J += -[y(i)log(a(i))+(1-y(i)log(1-a(i));
        dz(i) = a(i)-y(i);
        dw1 += x1(i)dz(i);
        dw2 += x2(i)dz(i);
        db += dz(i);
J/= m;
dw1/= m;
dw2/= m;
db/= m;
w=w-alpha*dw
b=b-alpha*db

Vectorization

Vectorization is basically the art of getting rid of explicit for loops in your code.

non-vectorized implementation :

z=0
for i in range(n_x)
    z+=w[i]*x[i]
z+=b

vectorized implementation :

z=np.dot(w,x)+b

They’re sometimes called SIMD instructions. This stands for a single instruction multiple data.

The rule of thumb to remember is whenever possible, avoid using explicit four loops.

More Examples of Vectorization

It’s not always possible to never use a for-loop, but when you can use a built in function or find some other way to compute whatever you need, you’ll often go faster than if you have an explicit for-loop.

Vectorizing Logistic Regression

How you can vectorize the implementation of logistic regression

\(\left\{\begin{matrix}
z^{(1)} = w^Tx^{(1)} + b \\
a^{(1)} = \sigma (z^{(1)}) \\
\hat y
\end{matrix}\right.\)

 

Z=np.dot(w.T, X)+b

It turns out, you can also use vectorization very efficiently to compute the backward propagation, to compute the gradients.

Vectorizing Logistic Regression’s Gradient

How you can use vectorization to also perform the gradient computations for all m training samples.

\(\begin{matrix}
Z = w^TX + b = np.dot(w.T, X) + b\\
A = \sigma (Z)\\
{\mathrm{d} Z} = A – Y\\
{\mathrm{d} w} = \frac {1}{m} * X * {\mathrm{d} z} ^T\\
{\mathrm{d} b} = \frac {1}{m} * np.sum({\mathrm{d} Z})\\
w := w – a * {\mathrm{d} w}\\
b := b – a * {\mathrm{d} b}
\end{matrix} \)

 

Broadcasting in Python

A note on python or numpy vectors

  • Because with broadcasting and this great amount of flexibility, sometimes it’s possible you can introduce very subtle bugs very strange looking bugs.
  • When you’re coding neural networks, that you just not use data structures where the shape is 5, or n, rank 1 array.
  • Don’t hesitate to throw in assertion statements like this whenever you feel like it.
  • Do not use these rank 1 arrays, you can reshape this.

Quick tour of Jupyter/iPython Notebooks

It’s so simple that nothing need to say.

Explanation of logistic regression cost function

A quick justification for why we like to use that cost function for logistic regression.

\(\begin{matrix}
\hat y = \sigma (w^Tx + b)\\
\sigma (z) = \sigma (w^Tx + b) = \frac {1}{1+e^{-z}}\\
\hat y = p(y = 1 | x)
\end{matrix}\)

 

\(\left.\begin{matrix}
If \ y = 1 \ : & p(y|x) = \hat y\\
If \ y = 0 \ : & p(y|x) = 1 – \hat y
\end{matrix}\right\}
p(y|x) = \hat y(1 – \hat y)^{(1-y)}\)

 

\(ylog\hat y + (1-y)log(1-\hat y)\)

 

In statistics, there’s a principle called the principle of maximum likelihood estimation, which just means to choose the parameters that maximizes this thing. Or in other words, that maximizes this thing.

\(P(labels\ in\ training\ set) = \prod _{i=1}^{m} P(y^{(i)} | x^{(i)})\)

 

\(logP(labels\ in\ training\ set) = log\prod _{i=1}^{m} P(y^{(i)} | x^{(i)}) = \sum _{i=1}^{m}log P(y^{(i)} | x^{(i)}) = \sum _{i=1}^{m} -L(\hat y^{(i)}, y^{(i)})\)

 

\(J(w,b) = \frac {1}{m} \sum _{i=1}^{m} L(\hat y^{(i)}, y^{(i)})\)

 

1 Introduction to Deep Learning

Welcome

Deep Learning has aready transformed the traditional internet businesses

What is a Neural Network

You probably find them to be most useful, most powerful, in supervised learning settings. Meaning that you’re trying to take an input x and map it some output y,

Supervised Learning with Neural Networks

It turns out that so far, almost all the economic value created by neural networks has been through one type of machine learning, called supervised learning.

Possibly the single most lucrative application of deep learning today is online advertising.

  • Computer Vision
  • Speech Recognition
  • Autonomous Driving

It turns out that slightly different types of neural networks are useful for different applications.

CNN (Convolutional Neural Network)

RNN (Recurrent Neural Network)

  • Structured Data means basically databases of data.
  • In contrast, unstructured data refers to things like audio, raw audio, or images where you might want to recognize what’s in the image or text. Historically, it has been much harder for computers to make sense of unstructured data compared to structured data.

It turns out that a lot of short term economic value that neural networks are creating has also been on structured data, such as much better advertising systems, much better profit recommendations, and just a much better ability to process the giant databases that many companies have to make accurate predictions from them.

Why is Deep Learning taking off?

Go over some of the main drivers behind the rise of deep learning because I think this will help you better spot the best opportunities within your own organization to apply these to.

More and more and more data have been collected so over the last 20 years for a lot of applications we just accumulate a lot more data more than traditional learning algorithms were able to effectively take advantage of.

Hit very high level of performance : 

  1. Train a big enough neural network
  2. Throw more data

If you don’t have a lot of training data is often up to your skill at hand engineering features that determines the performance.

The performance depends much more on your skill at hand engineer features and other normal details of the algorithms and there’s only in this some big data regions very large training sets very large M regions in the right that we more consistently see largely neural nets dominating the other approaches and so

  • Data
  • Computation
  • Algorithmic Innovation

One of the huge breakthroughs in neural networks has been switching from a sigmoid function which looks like this to a ReLU function.

  • Ultimately the impact of this algorithmic innovation was it really help computation so there remains quite a lot of examples like this of where we change the algorithm because it allows that code to run much faster and this allows us to train bigger neural networks or to do so within reasonable amount of time even when we have a large network with a lot of data.
  • It turns out the process of training your network it is very intuitive often. you have an idea for a neural network architecture and so you implement your idea and code. Implementing your idea then lets you run an experiment which tells you how well your neural network does and then by looking at it you go back to change the details of your neural network and then you go around this circle over and over and when your neural network takes a long time to train it just takes a long time to go around this cycle and there’s a huge difference in your productivity building effective neural networks.

Faster Computation —> Iterate Much Faster —> Improve Ideas Much Faster

About this Course

Just use the multiple choice questions to check your understanding. And dont’ review, you can try again and again until you get them all right.

Course Resources

If you have any questions or you want to discuss anything with the classmates or … the best place to do that is the discussion forum.

18 Application Example: Photo OCR

Problem Description and Pipeline

  1. Text detection
  2. Character segmentation
  3. Character classification

Sliding Windows

Go out and collect large training sets of positive and negative examples.

Take that green rectangle and we slide it over a bit and then run that new image patch through our classifier to decide if there’s a pedestrian there.

Getting Lots of Data and Artificial Data

  1. artificial data synthesis
  2. just collect the data and you label it yourself
  3. crowd sourcing

Ceiling Analysis What Part of the Pipeline to Work on Next

Where should you allocate resources?
Which of these boxes is most worth your efforts, trying to improve the performance of.

  1. choose one module
  2. provide it the correct text detection outputs
  3. And then, use the same evaluation metric as before, to measure what is the overall accuracy of the entire system
  4. go to step 1, but choose another one

17 Large Scale Machine Learning

Learning With Large Datasets

Draw a learning curve and determine if more data needs to be collected.

Stochastic Gradient Descent

Cost Function in SGD : \(cost(\theta, (x^{(i)}, y^{(i)})) = \frac {1}{2}(h_{\theta}(x^{(i)})-y^{(i)})^2\)

  1. randomly shuffle the data set
  2. a little gradient descent step using just one single training example
  3. maybe head in a bad direction, generally move the parameters in the direction of the global minimum, but not always
  4. it ends up doing is wandering around continuously in some region that’s in some region close to the global minimum

Mini-Batch Gradient Descent

In Batch gradient descent we will use all m examples in each generation.
Whereas in Stochastic gradient descent we will use a single example in each generation.
What Mini-batch gradient descent does is somewhere in between.

Stochastic Gradient Descent Convergence

\(\alpha = \frac {const1}{iterationNumber + const2}\)

We can compute the cost function on the last 1000 examples or so. And we can use this method both to make sure the stochastic gradient descent is okay and is converging or to use it to tune the learning rate alpha.

Online Learning

The online learning setting allows us to model problems where we have a continuous flood or a continuous stream of data coming in and we would like an algorithm to learn from that.

We learn using that example like so and then we throw that example away.

If you really have a continuous stream of data, then an online learning algorithm can be very effective.

If you have a changing pool of users, or if the things you’re trying to predict are slowly changing like your user taste is slowly changing, the online learning algorithm can slowly adapt your learned hypothesis to whatever the latest sets of user behaviors are like as well.

Map Reduce and Data Parallelism

In the MapReduce idea, one way to do, is split this training set in to different subsets and use many different machines.

  • multi-core machine
  • multiple machines
  • numerical linear algebra libraries
  • like Hadoop

16 Recommender Systems

Problem Formulation

  1. an important application of machine learning
  2. this idea of learning the features

Content Based Recommendations

user j ‘s parameter vector : \(\theta^{(j)}\)

movie i’s feature vector : \(x^{(i)}\)

Predicting rating: \((\theta^{(j)})^Tx^{(i)}\)

user \(j\) ‘s cost function : \(\underset {\theta ^{(j)}}{min} \frac {1}{2} \sum _{i:r(i,j)=1} ((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+ \frac {\lambda}{2}(\theta_k^{(j)})^2\)

all user’s cost function : \(\underset {\theta ^{(j)}, \cdots , \theta ^{(n_u)}}{min} \frac {1}{2} \sum _{j=1}^{n_u} \sum _{i:r(i,j)=1} ((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+ \frac {\lambda}{2} \sum_{j=1}^{n_u} \sum_{k=1}^{n} (\theta_k^{(j)})^2\)

Gradient descent : 

\(\left\{\begin{matrix}
\theta_k^{(j)} := \theta_k^{(j)} – \alpha \sum_{i:r(i,j)=1} ((\theta^{(j)})^Tx^{(i)} – y^{(i,j)})x_{k}^{(i)} &
\ (for \ k \ = \ 0) \\
\theta_k^{(j)} := \theta_k^{(j)} – \alpha (\sum_{i:r(i,j)=1} ((\theta^{(j)})^Tx^{(i)} – y^{(i,j)})x_{k}^{(i)} + \lambda \theta_{k}^{(j)}) & \ (for \ k \ \neq \ 0)
\end{matrix}\right.\)

 

Collaborative Filtering

No User’s Parameter and No Movie’s Features, you can do this.

The term collaborative filtering refers to the observation that when you run this algorithm with a large set of users, what all of these users are effectively doing are sort of collaboratively or collaborating to get better movie ratings for everyone because with every user rating some subset with the movies, every user is helping the algorithm a little bit to learn better features, and then by helping– by rating a few movies myself, I will be helping the system learn better features and then these features can be used by the system to make better movie predictions for everyone else. And so there is a sense of collaboration where every user is helping the system learn better features for the common good. This is this collaborative filtering.

Collaborative Filtering Algorithm

  1. Initialize x and theta to small random values
  2. minimize the cost function using great intercepts or one of the advance optimization algorithms
  3. predict

Vectorization Low Rank Matrix Factorization

A user has recently been looking at one product. Are there other related products that you could recommend to this user?

If you can find a different movie i, j, so that the distance between \(x^{(i)}\) and \(x^{(j)}\) is small, then this is a pretty strong indication that, you know, movies j and i are somehow similar

Use learned features to find what might be movies and what might be products that aren’t related to each other.

Implementational Detail Mean Normalization

If a user has not evaluated any movies, which movie should we recommend?

The idea of mean normalization will let us fix this problem.

15 Anomaly Detection

Problem Motivation

It’s mainly for unsupervised problem, that there’s some aspects of it that are also very similar to sort of the supervised learning problem.

some examples : 

  • detect strange behavior or fraudulent behavior
  • manufacturing
  • monitoring computers in a data center

Gaussian Distribution

Gaussian distribution

\(x\sim N(\mu , \sigma ^2)\)

 

Gaussian probability density

\(p(x, \mu , \sigma ^2) = \frac {1}{\sqrt{2\pi }\sigma } exp(-\frac {(x – \mu)^2}{2 \sigma ^2})\)

 

The location of the center of this bell-shaped curve

\(\mu = \frac {1}{m} \sum _{i=1}^{m}x^{(i)}\)

 

The width of this bell-shaped curve

\(\sigma ^ 2 = \frac {1}{m} \sum _{i=1}^{m} (x^{(i)} – \mu) ^2\)

 

Notice : The formula here  we use \(m\) instead of \(m – 1\) which is used in a statistics.

Algorithm

Address anomaly detection :

\(\mu _j = \frac {1}{m} \sum _{i=1}^{m}x^{(i)} _j\)

 

\(\sigma ^ 2 _j = \frac {1}{m} \sum _{i=1}^{m} (x^{(i)}_j – \mu _j) ^2\)

 

\(p(x) = \prod _{j=1}^{n}p(x_j; \mu _j, \sigma ^2_j) = \prod _{j=1}^{1}\frac {1}{\sqrt{2\pi } \sigma _j} exp(-\frac {(x_j – \mu_j)^2}{2 \sigma ^2_j})\)

 

If \(p(x) < \varepsilon \), it’s anomaly.

Developing and Evaluating an Anomaly Detection System

How to develop and evaluate an algorithm ?

  1. Take the training sets and fit the model \(p(x)\)
  2. On the cross validation of the test set, try to use different \(\varepsilon\), and then compute the F1 score
  3. After choosed \(\varepsilon\), evaluation of the algorithm on the test sets

Anomaly Detection vs. Supervised Learning

Anomaly DetectionSupervised Learning
very small number of positive, and a relatively large number of negative examplesa reasonably large number of both positive and negative examples
many different types of anomalieshave enough positive examples for an algorithm to get a sense of what the positive examples are like
future anomalies may look nothing like the ones you've seen so far
fraud detection, manufacturing, data centerSPAM email, weather prediction, classifying cancers

Choosing What Features to Use

  1. model the features using this sort of Gaussian distribution (play with different transformations of the data in order to make it look more Gaussian)
  2. do an error analysis procedure to come up with features for an anomaly detection algorithm
  3. create new features by combining me features

Multivariate Gaussian Distribution

\(p(x) = \prod _{j=1}^{n}p(x_j; \mu, \sigma ^2_j) = \prod _{j=1}^{n}\frac {1}{\sqrt{2\pi } \sigma _j} exp(-\frac {(x_j – \mu_j)^2}{2 \sigma ^2_j})\)

 

\(\mu = \frac {1}{m} \sum _{i=1}^{m}x^{(i)}\)

 

\(\sum = \frac {1}{m} \sum_{i=1}^{m} (x^{(i)} – \mu )(x^{(i)} – \mu )^T = \frac {1}{m} (X – \mu)^T(X – \mu)\)

 

\(p(x) = \frac {1}{(2 \pi)^{\frac {n}{2}}\left | \sum \right | ^{\frac {1}{2}}} exp(-\frac {1}{2} (x-\mu)^T\sum ^{-1}(x-\mu))\)

 

Gaussian Distribution Multivariate Gaussian Distribution
Manually create features to capture anomaliesAutomatically captures correlations between features
Computationally cheaper
Must have m > 10n or else sum is non-invertible