2 Optimization algorithms

Mini-batch gradient descent

Mini-batch gradient descent in contrast, refers to the algorithm which we’ll talk about on the next slide, and which you process is single mini batch [latex]X^[/latex], [latex]Y^[/latex] at the same time, rather than processing your entire training set X, Y the same time. Mini-batch gradient descent runs much faster than batch gradient descent that’s pretty much what everyone in Deep Learning will use when you’re training on a large data set.

Understanding mini-batch gradient descent

If the mini-batch size=m then you just end up with Batch Gradient Descent.
If your mini-batch size=1 and this gives you an algorithm called Stochastic Gradient Descent.

Disadvantage：

Lose almost all your speed up from vectorization. Because of that we processing a single training example at a time.
It doesn’t always exactly converge or oscillate in a very small region. If that’s an issue you can always reduce the learning rate slowly.

Guidelines :

If you have a small training set (maybe 2000), just use batch gradient descent.
If you have a bigger training set, typical mini batch sizes would be, anything from 64 up to maybe 512 are quite typical.
Because of the way computer memory is laid out and accessed, sometimes your code runs faster if your mini-batch size is a power of 2.

Exponentially weighted averages

Exponentially weighted averages and it’s also called exponentially weighted moving averages in statistics.

Understanding exponentially weighted averages

The key equation for implementing exponentially weighted averages : [latex]v_t = \beta v_{t-1} + (1- \beta) \theta _t[/latex] It takes very little memory.

Bias correction in exponentially weighted averages

Bias Correction that can make you computation of these averages more accurately. If you are concerned about the bias during this initial phase, while your exponentially weighted moving average is still warming up. Then bias correction can help you get a better estimate early on.

Gradient descent with Momentum

Momentum, or gradient descent with momentum that almost always works faster than the standard gradient descent algorithm. In one sentence, the basic idea is to compute an exponentially weighted average of your gradients, and then use that gradient to update your weights instead. This will almost always work better than the straightforward gradient descent algorithm without momentum.

RMSprop

RMSprop (Root Mean Square Prop) that can also speed up gradient descent. And so, you want to slow down the learning in the b direction, or in the vertical direction. And speed up learning, or at least not slow it down in the horizontal direction. So this is what the RMSprop algorithm does to accomplish this.

Adam optimization algorithm

RMSprop and the Adam optimization algorithm, which we’ll talk about in this video, is one of those rare algorithms that has really stood up, and has been shown to work well across a wide range of deep learning architectures. And the Adam optimization algorithm is basically taking momentum and RMSprop and putting them together.

Learning rate decay

One of the things that might help speed up your learning algorithm, is to slowly reduce your learning rate over time. some formula : [latex]\begin{matrix} \alpha = \frac {1}{1 + decayrate*epoch -num} \alpha _0\\ \alpha = \frac {k}{\sqrt{epoch -num}} \alpha _0\\ \alpha = \frac {k}{\sqrt{t}} \alpha _0 \end{matrix}[/latex]

The problem of local optima

Instead most points of zero gradient in a cost function are saddle points. In very high-dimensional spaces you’re actually much more likely to run into a saddle point.