3 Object detection

Object localization

Object detection is one of the areas of computer vision that’s just exploding.

Object localization which means not only do you have to label this as say a car, but the algorithm also is responsible for putting a bounding box, so that’s called the classification with localization problem.

Defining the target label y : 

  1. pedestrian
  2. card
  3. motorcycle
  4. background

Need to output bx, by, bh, bw, class label(1-4) \(y=\begin{bmatrix}
p_c \\
b_x \\
b_y \\
b_h \\
h_w \\
c_1 \\
c_2 \\

If using squared error, then loss function : \(L(\hat y, y) = (\hat y_1 – y_1)^2 + (\hat y_2 – y_2)^2 + \cdots (\hat y_8 – y_8)^2\)

In practice you could use you improbably use a log likelihood loss for the \(c_1\), \(c_2\), \(c_3\)to the softmax, output one of those elements, usually you can use squared error or something like squared error for the bounding box coordinates and then for \(p_c\), you could use something like the logistic regression loss, although even if you use squared error or predict work okay.

Landmark detection

Neural network just output x and y coordinates of important points in image sometimes called landmarks that you want the netural network to recognize.

The labels have to be consistent across different images.

Object detection

Sliding windows detection algorithm : 

  • using a pretty large stride in this example just to make the animation go faster
  • repeat it, but now use a larger window
  • then slide the window over again using some stride and so on, and you run that throughout your entire image until you get to the end

There’s a huge disadvantage of sliding windows detection which is the computational cost : 

  • if you use a very coarse stride, a very big stride, a very big step size, then that will reduce the number of windows you need to pass through the ConvNet, but that coarser granularity may hurt performance
  • whereas if you use a very fine granularity or a very small stride, then the huge number of all these little regions you’re passing through the ConvNet means that there’s a very high computational cost

So before the rise of neural networks, people used to use much simpler classifiers

Convolutional implementation of sliding windows

Turn fully connected layers in your neural network into convolutional layers

It turns out a lot of this computation done by these 4 ConvNet is highly duplicated

Sliding windows convolutionally makes the whole thing much more efficient, but it still has one weakness which is the position of the bounding boxes is not going to be too accurate.

Bounding box predictions

A good way to get this output more accurate bounding boxes is with the YOLO algorithm, YOLO stands for you only look once.

The basic idea is you’re going to take the image classification and localization algorithm and  what the YOLO algorithm does is it takes the midpoint of each of the two objects and it assigns the object to the grid cell containing the midpoint.

The advantage of this algorithm is that the neural network outputs precise bounding boxes as follows so long as you don’t have more than one object in each grid cell this algorithm should work okay.

Assign an object to grid cell is you look at the mid point of an object and then you assign that object to whichever one grid cell contains the mid point of the object.

This is a pretty efficient algorithm and in fact one nice thing about the YOLO algorithm which which accounts for popularity is because this is a convolutional implementation it actually runs very fast so this works even for real-time object detection.

The YOLO paper is one of the harder papers to read.

It’s not that uncommon sadly for even you know senior researchers to read research papers and have a hard time figuring out the details and have to look at the open source code or contact the authors or something else to figure out the details of these algorithms.

Intersection over union

Intersection over union, and just we use both for evaluating your object detection algorithm.

So, what the intersection over union function does or IoU does is it computes the intersection over union of these two bounding boxes.

So, the union of these two bounding boxes is this area, is really the area that is contained in either bounding boxes, whereas the intersection is this smaller region here. So, what the intersection over union does is it computes the size of the intersection,

And by convention, law of computer vision task will judge that your answer is correct, if the IoU is greater than or 0.5 (just a human-chosen convention, there’s no particularly deep theoretical reason for it).

Non-max suppression

One of the problems of object detection as you’ve learned about so far is that your algorithm may find multiple detections of the same object so rather than detecting an object just once it might detect it multiple times non-max suppression is a way for you to make sure that your algorithm detects each object only once.

  • so concretely what it does is it first looks at the probabilities associated with each of these detections count on the p_c, and then it first takes the largest one 
    • and says that’s my most confident detection
    • so let’s highlight it, 
    • and all the ones with a high overlap with a high IoU with this one that you’ve just output will get suppressed.
  • and find the one with the highest probability the highest

Non max means that you’re going to output your maximal probabilities classifications but suppress it close by ones that are non maximal so that’s as a name non max suppression.

Anchor Boxes

One of the problems with object detection as you’ve seen it so far is that each of the grid cells can detect only one object What if a grid cell wants to detect multiple objects here’s what you can do you can use the idea of anchor boxes.


The idea of anchor boxes what you’re going to do is predefined two different shapes called anchor boxes or anchor boxes shapes and what you’re going to do is now be able to associate two predictions with the two anchor boxes and in general you might use more anchor boxes maybe five or even more.

Anchor box algorithm :

  • Previously :

Echo object in training image is assigned to grid cell that contains that object’s midpoint.

  • With two anchor boxes :

Echo object in training image is assigned to grid cell that contains object’s midpoint and anchor box for the grid cell with highest IoU.

Now just some additional details what if you have two anchor boxes but 3 objects in the same grid cell that’s one case that this algorithm doesn’t handle it well.

Anchor boxes gives you is it allows your learning algorithm to specialize better in particular if your data set has some tall skinny objects like pedestrians and some wide objects like cars then this allows your learning algorithm to specialize.

How to choose the anchor boxes :

  • People used to just choose them by hand you choose maybe five or ten anchor box shapes that spans a variety of shapes that see to cover the types of objects you seem to detect.
  • One of the later YOLO research papers is to use a k-means algorithm to group together two types of object shapes you tend to get and if we use that to select a set of anchor boxes that this most stereotypically representative of the may be multiple there may be dozens of object classes you’re trying to detect but that’s a more advanced way to automatically choose the anchor boxes.

Putting it together: YOLO algorithm


One of the most effective object detection algorithms that

also encompasses many of the best ideas across the entire computer vision literature that relate to object detection.

Region proposals (Optional)

Algorithm convolutionally but one downside that the algorithm is it just classifies a lot of regions where there’s clearly no object.

Faster algorithms : 

  • R-CNN : Propose regions. Classify proposed regions one at a time. Output label + bounding box.
  • Fast R-CNN : Propose regions. Use convolution implementation of sliding windows to classify all the proposed regions.
  • Faster R-CNN : Use convolutional network to propose regions.

Although the faster R-CNN algorithm most implementations are usually still quite a bit slower than the YOLO algorithm.

The idea of region proposals has been quite influential in computer vision.

2 Deep convolutional models : case studies

Why look at case studies?

It turns out a lot of the past few years of computer vision research has been on how to put together these basic building blocks to form effective convolutional neural networks. 

And one of the best ways for you to gain intuition yourself, is to see some of these examples.

After the next few chapters, you’ll be able to read some of the research papers from the field of computer vision.

Classic networks

  • LeNet-5

And back then, when this paper was written, people used average pooling much more. If you’re building a modern variant, you’ll probably use mass pooling instead.

So it turns out that if you read the original paper, back then people used Sigmoid and Tahn non-linearities, and people weren’t using ReLu non-linearities back then.

But back then, computers were much slower. And so, to save on computation as well as on parameters, the original LeNet – 5 had some crazy complicated way where different filters look at different channels of the input block. And so, the paper talks about those details, but the more modern implementation you wouldn’t have that type of complexity these days.

  • AlexNet

So, this neural network actually had a lot of similarities to LeNet, but it was much bigger.

And the fact that they could take pretty similar basic building blocks that have a lot more hidden units and trained on a lot more data they trained on the image and the data set, Another aspect of this architecture that made it much better than LeNet was using the ReLU activation function.

One is that when this paper was written, GPUs were still a little bit slower. So, it had a complicated way of training on two GPUs.

The original AlexNet architecture, also had another type of a layer called a local response normalization. the basic idea of local response normalization is, if you look at one of these blocks, one of these volumes that we have on top, let’s say for the sake of argument this one,13 by 13 by 256. look at all 256 numbers and normalize them. And the motivation for this local response normalization was that for each position in this 13 by 13 image, maybe you don’t want too many neurons with a very high activation. But subsequently, many researchers have found that this doesn’t help that much.

It was really this paper that convinced a lot of the computer vision community to take a serious look at deep learning, and to convince them that deep learning really works in computer vision, and then it grew on to have a huge impact, not just in computer vision but beyond computer vision as well.

  • VGG-16

Instead of having so many hyper parameters, let’s use a much simpler network where you focus on just having conv layers that are just three by three filters with stride one and always use the same padding, and make all your max pooling layers two by two with a stride of two. And so, one very nice thing about the VGG network was, it really simplified these neural network architectures.

But VGG-16 is a relatively deep network.

The 16 in the name VGG-16, refers to the fact that this has 16 layers that have to weight. And this is a pretty large network. This network has a total of about 138 million parameters.

And that’s pretty large even by modern standards. But the simplicity of the VGG-16 architecture made it quite appealing. You can tell its architecture is really quite uniform. There’s a few conv layers followed by a pooling layer, which reduces the height and width. So the pooling layers reduce the height and width. You have a few of them here. But then also, if you look at the number of filters in the conv layers, here you have 64 filters, and then you double to 128, double to 256 doubles to 512. But roughly doubling on every step, or doubling through every stack of conv layers was another simple principle used to design the architecture of this network.

And so, I think the relative uniformity of this architecture made it quite attractive to researchers. The main downside was that, it was a pretty large network in terms of the number of parameters you had to train. this made this pattern of how as you go deeper, height and width goes down. It just goes down by a factor of two each time by the pooling layers, whereas the number of channels increases. And sure it roughly goes up by a factor of two every time you have a new set of conv layers.

ResNets (Residual Networks)

Very, very deep neural networks are difficult to train because of vanishing and exploding gradients types of problems. skip connections which allows you to take the activation from one layer and suddenly feed it to another layer, even much deeper in the neural network.  And using that, you’re going to build ResNets which enables you to train very, very deep networks sometimes even networks of over 100 layers.

ResNets are built out of something called a residual block.

Plain network : \(\begin{matrix}
z^{[l+1]} = W^{[l+1]} a^{[l]} + b^{[l+1]} & a^{[l+1]} = g(z^{[l+1]}) \\
z^{[l+2]} = W^{[l+2]} a^{[l+1]} + b^{[l+2]} & a^{[l+2]} = g(z^{[l+2]})\\

Residual block : \(\begin{matrix}
z^{[l+1]} = W^{[l+1]} a^{[l]} + b^{[l+1]} & a^{[l+1]} = g(z^{[l+1]}) \\
z^{[l+2]} = W^{[l+2]} a^{[l+1]} + b^{[l+2]} & a^{[l+2]} = g(z^{[l+2]} + a^{[l]})\\

In practice, or in reality, having a plain network. So no ResNet, having plain network that’s very deep means that your optimization algorithm just has a much harder time training. And so, in reality, your training error gets worse if you pick a network that’s too deep. But what happens with ResNets is that even as the number of layers gets deeper, you can have the performance of the training error kind of keep on going down. Now, even if you train a network with over 100 layers.

Why ResNets work?

Doing well on the training set is usually a prerequisite to doing well on your hold out, or on your dev, on your test sets. So being able to at least train the ResNets to do well on a training set is a good first step toward that.

Adding this residual block somewhere in the middle or to the end of this big neural network, it doesn’t hurt performance.

The residual network works is that it’s so easy for these extra layers to learn the identity function Or at least is easier to go from a decent baseline of not hurting performance and then creating the same can only improve the solution from there.

And then as is common in these networks, you have conv, conv, conv, pool, conv, conv, conv, pool, conv, conv, conv, pool. And then at the end, I have a fully connected layer that then makes a prediction using a softmax.

Network in Network and 1×1 convolutions

1 x 1 filter

  • 6 x 6 x 1 image, doesn’t seem particularly useful
  • 6 x 6 x 1 channel images, and in particular, what a 1 x 1 convolution will do is it will look at each of the 36 different positions here. And it will take the element wise product between 32 numbers on the left and the 32 numbers in the filter. And then apply a ReLU nonlinearity to it after that.

And in fact, one way to think about the 32 numbers you have in this 1 x 1 x 32 filter(weights)

So one way to think about the 1 x 1 convolution is that it is basically having a fully connected neural network that applies to each of the 32 different positions.

It’s sometimes also called Network in Network.

A pretty non-trivial operation that allows you to shrink the number of channels in your volumes, or keep it the same, or even increase it if you want.

Inception network motivation

When designing a layer for a CONV layer you might have to pick do you want to 1 x 3 filter, or 3 x 3, or 5 x 5. Or do you want to pooling layer? What inception network does is it says, why should you do them all. And this makes the network architecture more complicated but it also works remarkably well.

The inception network or what an inception layer says is, is instead of choosing what filter size you want in a CONV layer or even do you want a convolutional layer or pooling layer.

And the basic idea is that instead of you needing to pick one of these filter sizes or pooling you want and committing to that, you can do them all and just concatenate all the outputs and let the network learn whatever parameters it wants to use, what are the combinations of these filter sizes at once.

There’s a problem with the inception layer as I’ve describe it here which is computational cost.

A bottleneck layer is the smallest part of this network. We shrink the representation before increasing the size again. The total number of multiplications you need to do is the sum of those.

  • If you are building a layer of a neural network and you don’t want to have to decide do you want a 1 x 1 or 3 x 3 or 5 x 5 of pooling layer. The inception module, let’s do them all. And let’s concatenate the results.
  • The problem of computational cost and we just saw here was how using a 1 x 1 convolution, you can create this bottleneck layer thereby reducing the computational cost significantly.
  • It turns out that so long as you implement this bottleneck layer within the region, you can shrink down the representation size significantly. And it doesn’t seem to hurt the performance. That saves you a lot of computation.

Inception network


Using open-source implementations

It turns out that a lot of these neural networks are difficult or finicky to replicate. Because a lot of details about tuning the hyperparameters. Sometimes difficult even for say, AI or deep learning Ph.D. students even at the top universities to replicate someone else’s publish work just from reading the research paper.

Fortunately, a lot of deep learning researchers routinely open source their work on the internet such as on GitHub.

If you see a research paper whose results you would like to build on top of, one thing you should consider doing, one thing I do quite often is just look online for an open-source implementation.

The MIT license is one of the more permissive open source licenses.

  1. If you’re developing a computer vision application, a very common workflow would be to pick an architecture that you’d like. Maybe one of the ones you’ve learned about in this course, or maybe one that you’ve heard about from a friend, or from some of the literature.
  2. And look for an open-source implementation and download it from GitHub to start building from there.

One of the advantages of doing so also is that sometimes these networks take a long time to train and someone else might have used multiple GPUs and a very largely data set to pre-trained some of these networks. And that allows you to do transfer learning using these networks.

Transfer Learning

If you’re building a computer vision application, rather than training the weights from scratch, from random initialization, you often make much faster progress if your download weights that some else has already trained on a network architecture.

And use that as pre-training and transfer that to a new task that you might be interested in. Use transfer learning to sort of transfer knowledge from some of these very large public data sets to your own problem.

Go online and download some open source implementation of a neural network. And download not just the code, but also the weights.

What you can do is then get rid of the softmax layer, and create your own softmax unit by using someone else’s pre-trained weights, you’re likely to get pretty good performance on this, even with a small data set. Fortunately, a lot of deep learning frameworks support this mode of operation.

And these are different ways in different deep learning programming frameworks letting you specify whether or not to train the weights associated with a particular layer.

If you have a bigger a data set, then maybe of enough data, not just to train a single softmax unit. But to train some modest-sized neural network that comprises the last few layers of this final network that you end up using. And then finally, if you have a lot of data, one thing you might do is take this open source network and weights, and use the whole thing just as initialization, and train the whole network.

Computer vision is one where transfer learning is somethingz that you should almost always do. Unless you actually have a very, very large, unless you have an exceptionally large data set to train everything else from scratch yourself.

Data augmentation

Most computer vision tasks could use more data and so data augmentation is one of the techniques that is often used to improve the performance of computer vision systems.

The majority of computer vision problems is that we just can’t get enough data.

The common data augmentation methods :

  • mirroring on the vertical axis
  • random cropping
    • Rotation
      local warping
  • color shifting

One of the ways to influence color distortion uses an algorithm called PCA (Principles Components Analysis).

The rough idea the called PCA color augmentation, is for example, if your image is mainly purple, if it has mainly red and blue tints,
and very little green, then PCA color augmentation will add and subtract a lot to red and blue were relatively little to green so it kind of keeps the overall color of the tint the same.

A pretty common way of of implementing data augmentation is to really have one thread or multiple threads that is responsible for loading the data and implementing distortions, and then passing that to some other thread or some other process that then does the training and often this and this, can run in parallel.

A good place to get started might be to use someone else’s open source implementation for how they use data augmentation.

The state of computer vision

Deep learning has been successfully applied to computer vision, natural language processing, speech recognition, online advertising, logistics, many, many, many problems.

Image recognition was a problem of looking at a picture and telling you, is this a cat or not? Whereas object detection is look at a picture and actually, you’re putting the bounding boxes and telling you where in the picture the objects, such as the cars are, as well. And so because of the costs of getting the bounding boxes is just more expensive to label the objects and the bounding boxes, so we tend to have less data for object detection than for image recognition.

On average that when you have a lot data, you tend to find people getting away with using simpler algorithms as well as less hand engineering. So there’s just less needing to carefully design features for the problem.

  • But instead you can have a giant neural network, even a simpler architecture and have a neural network just learn whatever it wants to learn when you have a lot of data.
  • Whereas in contrast, when you don’t have that much data, then, on average you see people engaging in more hand engineering and

Two sources of knowledge :

  • labeled data
  • hand engineering features / network architectures / other components of your system

And someone that is insightful with hand engineering will get better performance.

If you look at the computer vision literature, look at the set of ideas out there, you’ll also find that people are really enthusiastic. They’re really into doing well on standardized benchmark data sets and on winning competitions. And for computer vision researchers, if you do well on the benchmarks it’s easier to get the paper published. So there is just a lot of attention on doing well on these benchmarks. 

  • And the positive side of this is that it helps the whole community figure out what are the most effective algorithms
  • but you also see in the papers, people do things that allow you to do well on a benchmark,
  • but that you wouldn’t really use in a production or a system that you deploy in an actual application.

Tips for doing well on benchmarks / winning competitions

  • Ensembling

Train several neural networks independently and average their outputs

But it’s almost never used in production to serve actual customers, I guess unless you have a huge computational budget and don’t mind burning a lot more of it per customer image.

  • Multi-crop at test time

Take the central crop. Then, take the four corners crops. Run these images through your classifier and then average the results.

And a neural network that works well on one vision problem often, maybe surprisingly, but it just often will work other vision problems as well. So, to build a practical system often you do well starting off with some else’s neural network architecture.

  • And you can use an open source implementation if possible because the open source implementation might have figured out all the finicky details.
  • But if you have the computer resources and the inclination, don’t let me stop you from training your own networks from scratch. And, in fact, if you want to invent your own computer vision algorithm, that’s what you might have to do.

1 Foundations of Convolutional Neural Networks

Computer vision

  • Rapid advances in computer vision are enabling brand new applications to be able.
  • Even if you don’t end up building computer vision systems per se, because the computer vision research community has been so creative and so inventive in coming up with new neural network architectures and algorithms, is actually inspire that creates a lot of cross-fertilization into other areas as well.

Some examples of computer vision problems :

  • Image classification, sometimes also called image recognition
  • Object detection
  • Neural style transfer

One of the challenges of computer vision problems is that the inputs can get really big.

To do that, you need to be the implement the convolution operation.

Edge detection example

  1. The early layers a neural network might detect edges.
  2. And then the somewhat later layers might detect parts of objects.
  3. And then even later layers maybe detect parts of complete objects
  • A Matrix.
    And in the pooling, the terminology of convolutional neural networks, this is going to be called a filter.
    Sometimes research papers will call this a kernel instead of a filter.

More edge detection

Sobel filter : \(\begin{bmatrix}
1 & 0 & -1\\
2 & 0 & -2\\
1 & 0 & -1


If we have a n-by-n image, and convolve that with an f-by-f filter, then the dimension of the output will be \((n-f+1) * (n-f+1)\)

The two downsides to this

  • every time you apply a convolutional operator, your image shrinks
  • if you look the pixel at the corner of the edge, this pixel is touched or used only in one of the outputs

So to solve both of these problems : before apply the convolutional operation, you can pad the image padding all around with an extra border of one pixels, that the output becomes \((n+2p-f+1)*(n+2p-f+1)\). So this effective maybe not quite throwing away,  but counting less the information from the edge of a corner or the edge of the image is reduced.

How much to pad : 

  • Valid convolution : this basically means no padding.
  • Same Convolution : that means when you pad,so the output size is the same as the input size.

And you rarely see an even-numbered filters, filter would be used in computer vision.

  • One is that if f was even, then you need some asymmetric padding
  • And then second, when you have an odd dimension filter, then it has a central position.

Strided convolutions

If you have an n x n matrix or \(n * n\) image that you convolve with an \(f* f\) filter with padding p, and stride s, then the output size will have this dimension. \(\frac{n+2p-f}{s} + 1 \times \frac{n+2p-f}{s} + 1\)

In that case, we’re going to round this down. \(\left \lfloor z \right \rfloor\)

And technically, what we’re actually doing, really, is sometimes called cross-correlation instead of convolution. But in deep learning literature, by convention we just call this a convolution operation.

Convolutions over volumes

Convolve this not to a three by three filter as you had previously, but now with also a 3D filter, That’s going to be three by three by three, So, the filter itself will also have three layers.

You can now detect two features or maybe several hundred different features, and the output will then have a number of channels equal to the number of features you are detecting.

One layer of a convolutional network

Suppose you have 10 filters not just 2 filters, that are 3 x 3 x 3 in one layer of a neural network. How many parameters does this layer have? Each filter is a three by three by three volume, So three by three by three, so each filter has 27 parameters, right, so it’s 27 numbers to be learned. And then plus the bias, so that was the b parameters, so this gives you 28 parameters. Then all together you would have 28 times 10, so that would be 280 parameters.

size of the output : \(n_H^{[l]} = \left \lfloor \frac{n_W^{[l-1]} + 2p^{[l] – f^{[l]}}}{s^{[l]}} + 1 \right \rfloor\)

the number of filters : \(f^{[l]} \times f^{[l]} \times n_c^{[l-1]}\)

A simple convolution network example

A lot of the work in designing a convolutional neural net is selecting hyperparameters like these, deciding what’s the filter size, what’s the stride, what’s the padding, and how many filters you use.

Types of layer in a convolutional network : 

  • Convolution
  • Pooling
  • Fully connected

Pooling layers

Other than convolutional layers, ConvNets often also use pooling layers to reduce the size of their representation to speed up computation, as well as to make some of the features it detects a bit more robust

Suppose you have a 4×4 input, and you want to apply a type of pooling called max pooling. And the output of this particular implementation of max pooling will be a 2×2 output. And the way you do that is quite simple. Take your 4×4 input and break it into different regions. And I’m going to color the four regions as follows. And then in the output, which is 2×2, each of the outputs will just be the max from the correspondingly shaded region.

So what the max operation does is so long as the feature is detected anywhere in one of these quadrants, it then remains preserved in the output of Max pooling. So what the max operator does is really says, if this feature is detected anywhere in this filter, then keep a high number. But if this feature is not detected, so maybe this feature doesn’t exist in the upper right hand quadrant, then the max of all those numbers is still itself quite small. So maybe that’s the intuition behind max pooling.

The main reason people use max pooling is because it’s been found in a lot of experiments to work well.

Average pooling : So that’s pretty much what you’d expect, which is instead of taking the maxes within each filter, you take the average.

So these days max pooling is used much more often than average pooling,

One thing to note about pooling is that there are no parameters to learn, right.

Convolutional neural network example

It turns out that in the literature of a ConvNet, there are two conventions which are slightly in consistence about what you call a layer.

  • One convention is that this is called one layer, so this will be Layer 1 of the neural network.
  • Another convention would be to count the Conv layer as a layer, and the Pool layer as a layer.

When people report a number of layers in a neural network, usually people report just the number of layers that have weights, that have parameters, and because the pooling layer has no weights, has no parameters, only a few hyper parameters,

Maybe one common guideline is to actually not try to invent your own settings of hyperparameters, but to look in the literature to see what hyperparameters that you work for others. And to just choose an architecture that has worked well for someone else, and there’s a chance that will work for your application as well.

Why convolutions?

  • Parameter sharing: A feature detector(such as a vertical edge detector) that’s useful in one part of the image is probably useful in another part of the image.
  • Sparsity of connections: In each layer, each output value depends only on a smalll number of inputs.

2 ML strategy (2)

Carrying out error analysis

If you’re trying to get a learning algorithm to do a task that humans can do. And if your learning algorithm is not yet at the performance of a human. Then manually examining mistakes that your algorithm is making, can give you insights into what to do next. This process is called error analysis. An error analysis procedure that can let you very quickly tell whether or not this could be worth your effort.

In machine learning, sometimes we call this the ceiling on performance which just means, what’s in the best case?

In machine learning, sometimes we speak disparagingly of hand engineering things, or using too much manual insight. But if you’re building applied systems, then this simple counting procedure, error analysis, can save you a lot of time. In terms of deciding what’s the most important, or what’s the most promising direction to focus on.

Maybe this is a 5 to 10 minute effort. This will gives you an estimate of how worthwhile this direction is. And could help you make a much better decision, How to using error analysis to evaluate whether or not is worth working on. Sometimes you can also evaluate multiple ideas in parallel doing error analysis.

During error analysis, you’re just looking at dev set examples that your algorithm has misrecognized.

Quick counting procedure, which you can often do in, at most, small numbers of hours can really help you make much better prioritization decisions, and understand how promising different approaches are to work on.

  • To carry out error analysis,you should find a set of mislabeled examples, either in your dev set, or in your development set.
  • And look at the mislabeled examples for false positives and false negatives. And just count up the number of errors that fall into various different categories.

During this process,

  • you might be inspired to generate new categories of errors,
  • You can create new categories during that process.
  • By counting up the fraction of examples that are mislabeled in different ways, often this will help you prioritize or give you inspiration for new directions to go in.

Cleaning up Incorrectly labeled data

If you going through your data and you find that some of these output labels Y are incorrect, you have data which is incorrectly labeled? Is it worth your while to go in to fix up some of these labels?

  • It turns out that deep learning algorithms are quite robust to random errors in the training set.
  • If the errors are reasonably random, then it’s probably okay to just leave the errors as they are and not spend too much time fixing them.
  • So long as the total data set size is big enough and the actual percentage of errors is maybe not too high.
  • There is one caveat to this which is that deep learning algorithms are robust to random errors. They are less robust to systematic errors.

If it makes a significant difference to your ability to evaluate algorithms on your dev set, then go ahead and spend the time to fix incorrect labels.

But if it doesn’t make a significant difference to your ability to use the dev set to evaluate cost bias, then it might not be the best use of your time.

Apply whatever process you apply to both your dev and test sets at the same time. It’s actually less important to correct the labels in your training set.

  • In building practical systems, often there’s also more manual error analysis and more human insight that goes into the systems than sometimes deep learning researchers like to acknowledge.
  • Actually go in and look at the data myself and try to counter the fraction of errors. And I think that because these minutes or maybe a small number of hours of counting data can really help you prioritize where to go next.

Build your first system quickly, then iterate

And more generally, for almost any machine learning application, there could be 50 different directions you could go in and each of these directions is reasonable and would make your system better. But the challenge is, how do you pick which of these to focus on.

  • If you’re starting on building a brand new machine learning application, is to build your first system quickly and then iterate.
    • First quickly set up a dev/test set and metric. So this is really deciding where to place your target.

All the value of the initial system is having some learned system, having some trained system allows you to localize bias/variance, to try to prioritize what to do next, allows you to do error analysis, look at some mistakes, to figure out all the different directions you can go in, which ones are actually the most worthwhile.

  • If there’s a significant body of academic literature that you can draw on for pretty much the exact same problem you’re building. It might be okay to build a more complex system from the get-go by building on this large body of academic literature.
    • But if you are tackling a new problem for the first time, then I would encourage you to really not, there are more teams overthink and build something too complicated.

If you are applying to your machine learning algorithms to a new application, and if your main goal is to build something that works, as opposed to if your main goal is to invent a new machine learning algorithm which is a different goal, then your main goal is to get something that works really well.

Training and testing on different distributions

How to deal with when your train and test distributions differ from each other.

  • Put both of these data sets together

But the disadvantage, is that if you look at your dev set, a lot of it rather than what you actually care about.

  • The training set is still the images, and then for dev and test sets would be all app images. Now you’re aiming the target where you want it to be.

Bias and Variance with mismatched data distributions

In order to tease out these two effects it will be useful to define a new piece of data which we’ll call the training-dev set.

  • What we’re going to do is randomly shuffle the training sets and then carve out just a piece of the training set to be the training-dev set. So just as the dev and test set have the same distribution, the training set and the training-dev set, also have the same distribution.
  • But, the difference is that now you train your neural network, just on the training set proper. You won’t let the neural network, you won’t run backpropagation on the training-dev portion of this data.

Bias / variance on mismatched training and dev / test sets

  • When you went from training data to training dev data the error really went up a lot. And only the difference between the training data and the training-dev data is that your neural network got to sort the first part of this. It was trained explicitly on this, but it wasn’t trained explicitly on the training-dev data. So this tells you that you have a variance problem.
  • But then it really jumps when you go to the dev set. So this is a data mismatch problem, where data mismatched. So somehow your algorithm has learned to do well on a different distribution than what you really care about, so we call that a data mismatch problem.

Addressing data mismatch

  • Carry out manual error analysis to try to understand the differences between the training set and the dev/test sets.
  • Make training data more similar,; or collect more data similar to dev/test sets

One of the ways we talked about is artificial data synthesis. And artificial data synthesis does work. But, if you’re using artificial data synthesis, just be cautious and bear in mind whether or not you might be accidentally simulating data only from a tiny subset of the space of all possible examples.

Transfer learning

One of the most powerful ideas in deep learning is that sometimes you can take knowledge the neural network has learned from one task and apply that knowledge to a separate task. What you can do is take this last output layer of the neural network and just delete that and delete also the weights feeding into that last output layer and create a new set of randomly initialized weights just for the last layer and have that now output.

A couple options of how you retrain neural network with radiology data :

  • If you have a small radiology dataset, you might want to just retrain the weights of the last layer.
  • But if you have a lot of data, then maybe you can retrain all the parameters in the network.

When transfer learning makes sense :

  • Task A and B have the same input X
  • You have a lot more data for Task A than Task B
  • Low level features from A could be helpful for learning B

Multi-task learning

So whereas in transfer learning, you have a sequential process where you learn from task A and then transfer that to task B.

In multi-task learning, you start off simultaneously, trying to have one neural network do several things at the same time. And then each of these task helps hopefully all of the other task.

When multi-task learning makes sense

  • Training on a set of tasks that could benefit from having shared lower-level features.
  • Usually: Amount of data you have for each task is quite similar.
  • Can train a big enough neural network to do well on all the tasks.

What is end-to-end deep learning?

Briefly, there have been some data processing systems, or learning systems that require multiple stages of processing. And what end-to-end deep learning does, is it can take all those multiple stages, and replace it usually with just a single neural network.

  • It turns out that one of the challenges of end-to-end deep learning is that you might need a lot of data before it works well.
  • If you’re training on smaller data to build a speech recognition system, then the traditional pipeline, the full traditional pipeline works really well.

So why is it that the two step approach works better?

  • One is that each of the two problems you’re solving is actually much simpler.
  • But second, is that you have a lot of data for each of the two sub-tasks.

Although if you had enough data for the end-to-end approach, maybe the end-to-end approach would work better.

Whether to use end-to-end learning?

The benefits of applying end-to-end learning :

  • end-to-end learning really just lets the data speak.
  • there’s less hand designing of components needed.

The disadvantages :

  • it may need a large amount of data.
  • it excludes potentially useful hand designed components, but the hand-designed components could be very helpful if well designed.

1 ML strategy (1)

Why ML Strategy?

How to improve your system : 

  • more data
  • diverse
    • poses
  • negative examples
  • train the algorithm longer
  • try a different optimization algorithm
    • trying a bigger network or a smaller network
    • try to dropout or maybe L2 regularization
    • change the network architecture
    • changing activation functions
    • changing the number of hidden units and so on

And the problem is that if you choose poorly, it is entirely possible that you end up spending six months charging in some direction only to realize after six months that that didn’t do any good. So we need a number of strategies, that is, ways of analyzing a machine learning problem that will point you in the direction of the most promising things to try.


You must be very clear-eyed about what to tune in order to try to achieve one effect. This is a process we call orthogonalization.

So the concept of orthogonalization refers to that, if you think of one dimension of what you want to do as controlling a steering angle, and another dimension as controlling your speed.

And by having orthogonal, orthogonal means at 90 degrees to each other. By having orthogonal controls that are ideally aligned with the things you actually want to control, it makes it much easier to tune the knobs you have to tune.

Chain of assumptions in ML

  • Fit training set well on cost function
  • Fit dev set well on cost function
  • Fit test set well on cost function
  • Performs well in real world

Single number evaluation metric

Whether you’re tuning hyperparameters, or trying out different ideas for learning algorithms, or just trying out different options for building your machine learning system. You’ll find that your progress will be much faster if you have a single real number evaluation metric that lets you quickly tell if the new thing you just tried is working better or worse than your last idea.

One reasonable way to evaluate the performance of your classifiers is to look at its precision and recall.

It turns out that there’s often a tradeoff between precision and recall, and you care about both.

\(F_1 = \frac{2}{\frac{1}{P} + \frac{1}{R}}\)

And in mathematics, this function is called the harmonic mean of precision P and recall R.

A well-defined dev set which is how you’re measuring precision and recall, plus a single number evaluation metric allows you to quickly tell if classifier A or classifier B is better,

Satisficing and optimizing metrics

It’s not always easy to combine all the things you care about into a single real number evaluation metric. In those cases it sometimes useful to set up satisficing as well as optimizing metrics.

If you have N metrics that you care about it’s sometimes reasonable to pick one of them to be optimizing. So you want to do as well as is possible on that one. And then N minus 1 to be satisficing, meaning that so long as they reach some threshold. Such as running times faster than 100 milliseconds, but so long as they reach some threshold, you don’t care how much better it is in that threshold, but they have to reach that threshold.

If there are multiple things you care about by say there’s one as the optimizing metric that you want to do as well as possible on and one or more as satisficing metrics were you’ll be satisfice. Almost it does better than some threshold you can now have an almost automatic way of quickly looking at multiple cost size and picking the, quote, best one.

Train/dev/test distributions

The dev set is also called the development set, or sometimes called the hold out cross validation set.

Make your dev and test sets come from the same distribution. Take all this data, randomly shuffled data into the dev and test set. So that, both the dev and test sets have data from all eight regions and that the dev and test sets really come from the same distribution,

Machine learning teams are often very good at shooting different arrows into targets and iterating to get closer and closer to hitting the bullseye. Once you do well on, try to get data that looks like that. And, whatever that data is, put it into both your dev set and your test set. A totally different location that just was a very frustrating experience for the team.

Setting up the dev set, as well as the evaluation metric, is really defining what target you want to aim at.

Size of dev and test sets

  • train and test set : 70/30
  • train dev and test sets : 60/20/20
  • train dev and test sets : 98/1/1

Size of test set

Set your test set to be big enough to give high condifience in the over all performance of your system.

When to change dev/test sets and metrics

Sometimes partway through a project you might realize you put your target in the wrong place. In that case you should move your target.

Misclassification error metric : \(Error = \frac{1}{m_{dev}} \sum _{i=1}^{m_{dev}}I\{y_{pred}^{(i)} \neq y^{(i)}\}\)

One way to change this evaluation metric : \(Error = \frac{1}{m_{dev}} \sum _{i=1}^{m_{dev}}w^{(i)}I\{y_{pred}^{(i)} \neq y^{(i)}\}\)

If you want this normalization constant, technically this becomes sum over i of w(i), so then this error would still be between zero and one. \(Error = \frac{1}{\sum w^{(i)}} \sum _{i=1}^{m_{dev}}w^{(i)}I\{y_{pred}^{(i)} \neq y^{(i)}\}\)

The goal of the evaluation metric is accurately tell you, given two classifiers, which one is better for your application.

If you’re not satisfied with your old error metric then don’t keep coasting with an error metric you’re unsatisfied with, instead try to define a new one that you think better captures your preferences in terms of what’s actually a better algorithm.

Take a machine learning problem and break it into distinct steps.

  • place the target
  • shooting at the target

The point was with the philosophy of orthogonalization.

If doing well on your metric and your current dev sets or dev and test sets’ distribution, if that does not correspond to doing well on the application you actually care about, then change your metric and your dev test set.

The overall guideline is if your current metric and data you are evaluating on doesn’t correspond to doing well on what you actually care about, then change your metrics and/or your dev/test set to better capture what you need your algorithm to actually do well on.

Even if you can’t define the perfect evaluation metric and dev set, just set something up quickly and use that to drive the speed of your team iterating.

And if later down the line you find out that it wasn’t a good one, you have better idea, change it at that time, it’s perfectly okay.

Why human-level performance?

  1. In deep learning, machine learning algorithms are suddenly working much better and so it has become much more feasible in a lot of application areas for machine learning algorithms to actually become competitive with human-level performance.
  2. The workflow of designing and building a machine learning system, the workflow is much more efficient when you’re trying to do something that humans can also do.

And over time, as you keep training the algorithm, maybe bigger and bigger models on more and more data, the performance approaches but never surpasses some theoretical limit, which is called the Bayes optimal error. So Bayes optimal error, think of this as the best possible error. And Bayes optimal error, or Bayesian optimal error, or sometimes Bayes error for short, is the very best theoretical function for mapping from x to y. That can never be surpassed.

It turns out that progress is often quite fast until you surpass human level performance. And it sometimes slows down after you surpass human level performance.

  1. One reason is that human level performance is for many tasks not that far from Bayes’ optimal error.
  2. so long as your performance is worse than human level performance, then there are actually certain tools you could use to improve performance that are harder to use once you’ve surpassed human level performance.
    • For tasks that humans are good at, so long as your machine learning algorithm is still worse than the human, you can get labeled data from humans. That is you can ask people, ask or hire humans, to label examples for you so that you can have more data to feed your learning algorithm.

Knowing how well humans can do well on a task can help you understand better how much you should try to reduce bias and how much you should try to reduce variance.

Avoidable bias

If there’s a huge gap between how well your algorithm does on your training set versus how humans do shows that your algorithm isn’t even fitting the training set well.So in terms of tools to reduce bias or variance, in this case I would say focus on reducing bias.

In another case, even though your training error and dev error are the same as the other example, you see that maybe you’re actually doing just fine on the training set. It’s doing only a little bit worse than human level performance. You would maybe want to focus on reducing this component, reducing the variance in your learning algorithm.

Think of human level error as a proxy or as a estimate for Bayes error or for Bayes optimal error. And for computer vision tasks, this is a pretty reasonable proxy because humans are actually very good at computer vision and so whatever a human can do is maybe not too far from Bayes error.

The difference between Bayes error or approximation of Bayes error and the training error is the avoidable bias.

The difference between your training area and the dev error, there’s a measure still of the variance problem of your algorithm.

Understanding human-level performance

Human-level error, is that it gives us a way of estimating Bayes error. What is the best possible error any function could, either now or in the future.

How should you define human-level error?

To be clear about what your purpose is in defining the term human-level error.

This gap between Bayes error or estimate of Bayes error and training error is calling that a measure of the avoidable bias. And this as a measure or an estimate of how much of a variance problem you have in your learning algorithm.

  • The difference between your estimate of Bayes error tells you how much avoidable bias is a problem, how much avoidable bias there is.
  • And the difference between training error and dev error, that tells you how much variance is a problem, whether your algorithm’s able to generalize from the training set to the dev set.

A better estimate for Bayes error can help you better estimate avoidable bias and variance. And therefore make better decisions on whether to focus on bias reduction tactics, or on variance reduction tactics.

Surpassing human- level performance

Surpassing human-level performance : 

  • Team of humans
  • One human
  • Training error
  • Dev error

If your error is already better than even a team of humans looking at and discussing and debating the right label, then it’s just also harder to rely on human intuition to tell your algorithm what are ways that your algorithm could still improve the performance

Humans tend to be very good in natural perception task. So it is possible, but it’s just a bit harder for computers to surpass human-level performance on natural perception task.

Problems where ML significantly surpasses human-level performance :

  • Online advertising
  • Product recommendations
  • Logistics (predicting transit time)
  • Loan approvals

And finally, all of these are problems where there are teams that have access to huge amounts of data. So for example, the best systems for all four of these applications have probably looked at far more data of that application than any human could possibly look at. And so, that’s also made it relatively easy for a computer to surpass human-level performance.

Improving your model performance

The two fundamental assumptions of supervised learning

  • You can fit the training set pretty well
  • The training set performance generalizes pretty well to the dev/test set

Reducing (avoidable) bias and variance

  • Human-level
    • Train bigger model
    • Train longer/better optimization algorithms
    • NN architecture/hyperparameters search
  • Training error
    • More data
    • Regularization
    • NN architecture/hyperparameters search
  • Dev error

This notion of bias or avoidable bias and variance there is one of those things that easily learned, but tough to master

3 Hyperparameter tuning

Tuning process

How to systematically organize your hyperparameter tuning process

One of the painful things about training deepness :

  • the sheer number of hyperparameters
  • Momentum
  • the number of layers
  • the number of hidden units for the different layers
  • learning rate decay
  • mini-batch size

How do you select a set of values to explore

  • It was common practice to sample the points in a grid and systematically explore these values.
  • In deep learning, what we tend to do, is choose the points at random.
  • Another common practice is to use a coarse to fine sampling scheme.
    • zoom in to a smaller region of the hyperparameters and then sample more density within this space 
    • use random sampling and adequate search

Using an appropriate scale to pick hyperparameters

It’s important to pick the appropriate scale on which to explore the hyperparamaters.

  • uniformly at random
  • log scale

Pandas VS Caviar(Hyperparameters tuning in practice: Pandas vs. Caviar)

Deep learning today is applied to many different application areas and that intuitions about hyperparameter settings from one application area may or may not transfer to a different one.

People from different application domains do read increasingly research papers from other application domains to look for inspiration for cross-fertilization.

How to search for good hyperparameters : the panda approach versus the caviar approach

Normalizing activations in a network

Batch normalization makes your hyperparameter search problem much easier, makes the neural network much more robust to the choice of hyperparameters, there’s a much bigger range of hyperparameters that work well, and will also enable you to much more easily train even very deep networks.

Fitting Batch Norm into a neural network

The Programming framework which will make using Batch Norm much easier.

Why does Batch Norm work?

  • normalizing all the features to take on a similar range of values that can speed up learning
  • makes weights,later or deeper than your network
  • It reduces the amount that the distribution of these hidden unit values shifts around.
  • Batch norm reduces the problem of the input values changing, it really causes these values to become more stable, so that the later layers of the neural network has more firm ground to stand on.
  • It weakens the coupling between what the early layers parameters has to do and what the later layers parameters have to do. And so it allows each layer of the network to learn by itself, a little bit more independently of other layers, and this has the effect of speeding up learning in the whole network.
  • Batch norm therefore has a slight regularization effect. Because by adding noise to the hidden units, it’s forcing the downstream hidden units not to rely too much on any one hidden unit.

Batch Norm at test time

Batch norm processes your data one mini batch at a time, but the test time you might need to process the examples one at a time.

So the takeaway from this is that during training time \(\mu\) and \(\sigma ^2\) are computed on an entire mini batch of, say, 64, 28 or some number of examples. But at test time, you might need to process a single example at a time. So, the way to do that is to estimate \(\mu\) and \(\sigma ^2\) from your training set and there are many ways to do that.

But in practice, what people usually do is implement an exponentially weighted average where you just keep track of the \(\mu\) and \(\sigma ^2\) values you’re seeing during training and use an exponentially weighted average, also sometimes called the running average, to just get a rough estimate of \(\mu\) and \(\sigma ^2\) and then you use those values of \(\mu\) and \(\sigma ^2\) at test time to do the scaling you need of the hidden unit values Z.

Deep learning framework usually have some default way to estimate the \(\mu\) and \(\sigma ^2\) that should work reasonably well as well.

Softmax regression

Softmax regression that lets you make predictions where you’re trying to recognize one of C or one of multiple classes, rather than just recognize two classes.

Training a Softmax classifier

Loss Function in softmax classification : \(L(\hat y, y) = -\sum _{j=1}^{j}y_{j}log(\hat y_{j})\)

It looks at whatever is the ground truth class in your training set, and it tries to make the corresponding probability of that class as high as possible. If you’re familiar with maximum likelihood estimation statistics, this turns out to be a form of maximum likelyhood estimation.

The cost J on the entire training set : \(J(w^{[1]}, b^{[1]}, \cdots \cdots ) = \frac {1}{m} \sum _{i=1}^{m} L(\hat y ^{(i)}, y ^{(i)})\)

Usually it turns out you just need to focus on getting the forward prop right. And so long as you specify it as a program framework, the forward prop pass, the program framework will figure out how to do back prop, how to do the backward pass for you.

Deep Learning frameworks

At least for most people, is not practical to implement everything yourself from scratch. Fortunately, there are now many good deep learning software frameworks that can help you implement these models.

choose frameworks :

  • Ease of programming
  • Running speed
  • Truly open
  • Preferences of language
  • What application you’re working on


Example :

import numpy as np
import tensorflow as tf
for i in range(1000):

Example (placeholder):

for i in range(1000):

the TensorFlow documentation tends to just write the operation.

2 Optimization algorithms

Mini-batch gradient descent

Mini-batch gradient descent in contrast, refers to the algorithm which we’ll talk about on the next slide, and which you process is single mini batch \(X^{{t}}\), \(Y^{{t}}\) at the same time, rather than processing your entire training set X, Y the same time.

Mini-batch gradient descent runs much faster than batch gradient descent that’s pretty much what everyone in Deep Learning will use when you’re training on a large data set.

Understanding mini-batch gradient descent

  • If the mini-batch size=m then you just end up with Batch Gradient Descent.
  • If your mini-batch size=1 and this gives you an algorithm called Stochastic Gradient Descent.


  • Lose almost all your speed up from vectorization. Because of that we processing a single training example at a time.
  • It doesn’t always exactly converge or oscillate in a very small region. If that’s an issue you can always reduce the learning rate slowly.

Guidelines : 

  • If you have a small training set (maybe 2000), just use batch gradient descent.
  • If you have a bigger training set, typical mini batch sizes would be, anything from 64 up to maybe 512 are quite typical.
  • Because of the way computer memory is laid out and accessed, sometimes your code runs faster if your mini-batch size is a power of 2. 

Exponentially weighted averages

Exponentially weighted averages and it’s also called exponentially weighted moving averages in statistics.

Understanding exponentially weighted averages

The key equation for implementing exponentially weighted averages : \(v_t = \beta v_{t-1} + (1- \beta) \theta _t\)

It takes very little memory.

Bias correction in exponentially weighted averages

Bias Correction that can make you computation of these averages more accurately.

If you are concerned about the bias during this initial phase, while your exponentially weighted moving average is still warming up. Then bias correction can help you get a better estimate early on.

Gradient descent with Momentum

Momentum, or gradient descent with momentum that almost always works faster than the standard gradient descent algorithm.

In one sentence, the basic idea is to compute an exponentially weighted average of your gradients, and then use that gradient to update your weights instead.

This will almost always work better than the straightforward gradient descent algorithm without momentum.


RMSprop (Root Mean Square Prop) that can also speed up gradient descent.

And so, you want to slow down the learning in the b direction, or in the vertical direction. And speed up learning, or at least not slow it down in the horizontal direction. So this is what the RMSprop algorithm does to accomplish this.

Adam optimization algorithm

RMSprop and the Adam optimization algorithm, which we’ll talk about in this video, is one of those rare algorithms that has really stood up, and has been shown to work well across a wide range of deep learning architectures.

And the Adam optimization algorithm is basically taking momentum and RMSprop and putting them together.

Learning rate decay

One of the things that might help speed up your learning algorithm, is to slowly reduce your learning rate over time.

some formula : 

\alpha = \frac {1}{1 + decayrate*epoch -num} \alpha _0\\
\alpha = \frac {k}{\sqrt{epoch -num}} \alpha _0\\
\alpha = \frac {k}{\sqrt{t}} \alpha _0

The problem of local optima

Instead most points of zero gradient in a cost function are saddle points.

In very high-dimensional spaces you’re actually much more likely to run into a saddle point.

1 Practical aspects of Deep Learning

Train / Dev / Test sets

Applied deep learning is a very iterative process.

  • In the previous era of machine learning : the 70/30 train test splits, if you don’t have an explicit dev set or maybe a 60/20/20% split
  • In the modern big data era : 100w examples, 98/1/1 or 99.5/0.25/0.25
  • Make sure that the dev and test sets come from the same distribution
  • It might be okay to not have a test set. The goal of the test set is to give you a unbiased estimate
    of the performance of your final network, of the network that you selected. But if you don’t need that unbiased estimate, then it might be okay to not have a test set.

Bias / Variance

  • High Bias : not a very good fit to the data what we say that this is underfitting the data.
  • High Variance : this is overfitting the data and would not generalizing well.
  • The optimal error, sometimes called Bayesian error.

How to analyze bias and variance when no classifier can do very well :

  • Get a sense of how well you are fitting by looking at your training set error
  • Go to the dev set and look at how bad is the variance problem

Basic Recipe for Machine Learning

  1. Does your algorithm have high bias? And so to try and evaluate if there is high bias, And so, if it does not even fit in the training set that well, some things you could try would be to try pick a network.
  2. Maybe you can make it work, maybe not, whereas getting a bigger network almost always helps. And training longer doesn’t always help, but it certainly never hurts. Try these things until I can at least get rid of the bias problems, as in go back after I’ve tried this and keep doing that until I can fit, at least, fit the training set pretty well.
  3. Once you reduce bias to a acceptable amounts, then ask, do you have a variance problem?
  4. And if you have high variance, well, best way to solve a high variance problem is to get more data. But sometimes you can’t get more data. Or you could try regularization.

Repeat until hopefully you find something with both low bias and low variance.

Notes :

  • If you actually have a high bias problem, getting more training data is actually not going to help.
  • Getting a bigger network almost always just reduces your bias without necessarily hurting your variance, so long as you regularize appropriately. And getting more data pretty much always reduces your variance and doesn’t hurt your bias much.

Training a bigger network almost never hurts. And the main cost of training a neural network that’s too big is just computational time, so long as you’re regularizing.


High Variance Problem :

  • probably regularization
  • get more training data

Regularization will often help to prevent overfitting, or to reduce the errors in your network.

Add regularization to the logistic regression, what you do is add \(\lambda\) to it, which is called the Regularization Parameter.

  • L2 regularization (the most common type of regularization) \(J(w,b) = \frac {1}{m} \sum _{i=1}^{m} L(\hat y ^{(i)}, y ^{(i)}) + \frac {\lambda}{2m} \left \| w \right \| ^{2}_{2}\)
  • L1 regularization \(\)

Frobenius norm : (L2 normal of a matrix) It just means the sum of square of elements of a matrix.

L2 regularization is sometimes also called weight decay.

Why regularization reduces overfitting?

One piece of intuition is that if you crank regularisation lambda to be really, really big, they’ll be really incentivized to set the weight matrices W to be reasonably close to zero. So one piece of intuition is maybe it set the weight to be so close to zero for a lot of hidden units that’s basically zeroing out a lot of the impact of these hidden units.

Dropout Regularization

With dropout, what we’re going to do is go through each of the layers of the network, and set some probability of eliminating a node in neural network. So you end up with a much smaller, really much diminished network. And then you do back propagation training.

By far the most common implementation of dropouts today is inverted dropouts.

Understanding Dropout

  • So it’s as if on every iteration, you’re working with a smaller neural network, and so using a smaller neural network seems like it should have a regularizing effect.
  • Similar to what we saw with L2 regularization, the effect of implementing dropout is that it shrinks the weights, and does some of those outer regularization that helps prevent over-fitting.

Notice that the keep_prob of one point zero means that you’re keeping every unit

If you’re more worried about some layers overfitting than others,
you can set a lower keep_prob for some layers than others. The downside is, this gives you even more hyper parameters to search for using cross-validation.

One other alternative might be to have some layers where you apply dropout and some layers where you don’t apply dropout and then just have one hyper parameter, which is the keep_prob for the layers for which you do apply dropout.

On computer vision, the input size is so big, you inputting all these pixels that you almost never have enough data. So you’re almost always overfitting, And so dropout is very frequently used by computer vision.

One big downside of dropout is that the cost function J is no longer well-defined. On every iteration, you are randomly killing off a bunch of nodes. and so, if you are double checking the performance of gradient dissent, it’s actually harder to double check that right, you have a well-defined cost function J that is going downhill on every iteration.

Other regularization methods

  • data augmentation : flipping it horizontally, random rotations and distortions
  • early stopping : 
    1. And the advantage of early stopping is that running the gradient descent process just once, you get to try out values of small w, mid-size w, and large w, without needing to try a lot of values of the L2 regularization hyperparameter lambda.
    2. The Problem is that because of stopping gradient descent eailer, so that not doing a great job reducing the cost function J. And then you also trying to not over fit.

Normalizing inputs

When training a neural network, one of the techniques that will speed up your training.

Normalizing your inputs corresponds to two steps : 

  1. subtract out or to zero out the mean
  2. normalize the variances

If your features came in on similar scales, then this step is less important, although performing this type of normalization pretty much never does any harm, so I’ll often do it anyway if I’m not sure whether or not it will help with speeding up training for your algorithm.

Vanishing / Exploding gradients

When you’re training a very deep network, your derivatives or your slopes can sometimes get either very very big or very very small, maybe even exponentially small, and this makes training difficult.

Use careful choices of the random weight initialization to significantly reduce this problem.

Weight Initialization for Deep Networks

More careful choice of the random initialization for your neural network.

Some formulas gives a default value to use for the variance of the initialization of weight matrices :

  • tanh : Xavier initialization \(\sqrt{\frac{1}{n^{[l-1]}}}\), or \(\sqrt{\frac{2}{n^{[l-1]}+n^{[l]}}}\)
  • Relu : \(\sqrt{\frac{2}{n^{[l-1]}}}\)

Numerical approximation of gradients

When you implement back propagation you’ll find that there’s a test called gradient checking that can really help you make sure that your implementation of back prop is correct. Because sometimes you write all these equations and you’re just not 100% sure if you’ve got all the details right and implementing back propagation. So in order to build up to gradient checking, let’s first talk about how to numerically approximate computations of gradients.

How to numerically approximate computations of gradients

The formal definition of a derivative : \(f'(\theta) = \frac {f(\theta + \varepsilon) – f(\theta – \varepsilon)}{2\varepsilon }\)

Gradient checking

How you could use it too to debug, or to verify that your implementation and back props correct.

\(\mathrm{d} \theta _{approx}[i] = \frac {J(\theta_1, \theta_2, \cdots \theta_i + \varepsilon, \cdots) – J(\theta_1, \theta_2, \cdots \theta_i – \varepsilon, \cdots)}{2\varepsilon }\)


\(\frac{\left \| \mathrm{d} \theta _{approx}[i] – \mathrm{d} \theta [i] \right \|_2}{\left \| \mathrm{d} \theta _{approx}[i] \right \|_2 + \left \| \mathrm{d} \theta [i] \right \|_2} = \varepsilon \left\{\begin{matrix}
< 10^{-7} & , that’s great\\
> 10^{-5} & , maybe have a bug somewhere


Gradient Checking Implementation Notes

  1. Don’t use in training – only to debug
  2. If algorithm fails grad check , look at components to try to identify bug.
  3. Remember regularization
  4. Doesn’t work with dropout
  5. Run at random initialization; perhaps again after some training.

4 Deep Neural Networks

Deep L-layer neural network

Over the last several years the AI or the machine learning community has realized that there are functions that very deep neural networks can learn and the shallower models are often unable to.

Although for any given problem it might be hard to predict in advance exactly how deep a neural network you would want, it would be reasonable to try logistic regression

Symbol definition for deep learning : 

  • \(L\) : the number of layers in the network
  • \(n^{[1]} = 5\) : the number of nodes or the number of units in layer
  • \(a^{[l]}\) : the activations in layer l
  • computing \(a^{[l]}\) as g
  • \(W^{[l]}\) : weights on layer l
  • \(x\) : feature and \(x = a^{[0]}\)
  • \(\hat {y} = a^{[l]}\) : the activation of the final layer

Forward and backward propagation

forward propagation : \(\begin{matrix}
z^{[l]} = W^{[l]} \cdot a^{[l – 1]} + b^{[l]}\\
a^{[l]} = g^{[l]}(z^{[l]})

vectorized version : \(\begin{matrix}
z^{[l]} = W^{[l]} \cdot A^{[l – 1]} + b^{[l]}\\
A^{[l]} = g^{[l]}(Z^{[l]})

backward propagation : \(\begin{matrix}
\mathrm{d}z^{[l]} = \mathrm{d}a^{[l]} * g^{[l]^{‘}}(z^{[l]})\\
\mathrm{d}w^{[l]} = \mathrm{d}z^{[l]} \cdot a^{[l-1]}\\
\mathrm{d}b^{[l]} = \mathrm{d}z^{[l]} \\
\mathrm{d}a^{[l-1]} = w^{[l]T} \cdot \mathrm{d}z^{[l]}\\
\mathrm{d}z^{[l]} = w^{[l+1]T} \mathrm{d}z^{[l+1]} \cdot g^{[l]^{‘}}(z^{[l]})

vectorized version : \(\begin{matrix}
\mathrm{d}Z^{[l]} = \mathrm{d}A^{[l]} * g^{[l]^{‘}}(Z^{[l]})\\
\mathrm{d}W^{[l]} = \frac{1}{m} \mathrm{d}Z^{[l]} \cdot A^{[l-1]T}\\
\mathrm{d}b^{[l]} = \frac{1}{m} np.sum(\mathrm{d}z^{[l]} , axis = 1, keepdims = True)\\
\mathrm{d}A^{[l-1]} = W^{[l]T} \cdot \mathrm{d}Z^{[l]}

Forward propagation in a Deep Network

for a single training example : \(z^{[l]} = w^{[l]}a^{[l-1]} + b^{[l]}, \ a^{[l]} = g^{[l]}(z^{[l]})\)

vectorized way : \(Z^{[l]} = W^{[l]}a^{[l-1]} + b^{[l]}, \ A^{[l]} = g^{[l]}(Z^{[l]}) \ (A^{[0]} = X)\)

Getting your matrix dimensions right

one of the debugging tools to check the correctness of my code is to work through the dimensions and matrix.

make sure that all the matrices dimensions are consistent that will usually help you go some ways toward eliminating some cause of possible bugs.

Why deep representations?

Deep neural networks work really well for a lot of problems it’s not just that they need to be big neural networks is that specifically they need to be deep or to have a lot of hidden layers

  1. The earlier layers learn these low levels simpler features and then have the later deeper layers then put together the simpler things that’s detected in order to detect more complex things
  2. If you try to compute the same function with a shallow network so we aren’t allowed enough hidden layers then you might require exponentially more hidden units to compute

Starting out on a new problem : 

  • Start out with even logistic regressions and try something with one or two hidden layers and use that as a hyper parameter use that as a parameter or hyper parameter that you tune
  • But over the last several years there has been a trend toward people finding that for some applications very very deep neural networks sometimes can be the best model for a problem

Building blocks of deep neural networks

Nothing ……

Parameters vs Hyperparameters

These are parameters that control the ultimate parameters W and b and so we call all of these things below hyper parameters : 

  • \(\alpha\) (learning rate)
  • iterations (the number of iterations of gradient descent)
  • L (the number of hidden layers)
  • \(n^{[l]}\) (the number of hidden units)
  • choice of activation function

Find the best value :

Idea—Code—Experiment—Idea— ……

Try a few values for the hyper parameters and double check if there’s a better value for the hyper parameters and as you do so you slowly gain intuition as well about the hyper parameters.

What does this have to do with the brain?

Maybe that was useful but now the field has moved to the point where that analogy is breaking down.

3 Shallow Neural Networks

Neural Network Overview

Refer to Example : \(x^{(i)}\)

Refer to Layer : \(\alpha^{[m]}\)

Algorithm 3.1 : \(\left.\begin{matrix}
\Rightarrow z = w^Tx + b\)

Algorithm 3.2 : \(\left.\begin{matrix}
\Rightarrow z = w^Tx + b
\Rightarrow \alpha = \sigma(z)
\Rightarrow L(a,y) \ (Loss \ Function)\)

Algorithm 3.3 : \(\left.\begin{matrix}
\Rightarrow z^{[1]} = W^{[1]}x + b^{[1]}
\Rightarrow \alpha^{[1]} = \sigma(z^{[1]})\)

Algorithm 3.4 : \(\left.\begin{matrix}
\Leftarrow dz^{[1]} = d(W^{[1]}x + b^{[1]})
\Leftarrow d\alpha^{[1]} = d\sigma(z^{[1]})\)

Algorithm 3.5 : \(\left.\begin{matrix}
\Leftarrow dz^{[1]} = d(W^{[1]}x + b^{[1]})
\Leftarrow d\alpha^{[1]} = d\sigma(z^{[1]})\)

Algorithm 3.6 : \(\left.\begin{matrix}
d\alpha^{[1]} = d\sigma(z^{[1]})\\
\Leftarrow dz^{[2]} = d(W^{[2]}\alpha^{[1]} + b^{[2]})
\Leftarrow d\alpha^{[2]} = d\sigma(z^{[2]})
\Leftarrow dL(a^{[2]}, y)\)

Neural Network Representation

  • input layer
  • hidden layer
  • output layer

\(a\) : activations

\(a^{[0]}\) : the activations of the input layer

\(a^{[0]}_1\) : first node

Algorithm 3.7 : \(a^{[1]}=\begin{bmatrix}

  • when we count layers in neural networks we don’t count the input layer so the hidden layer is layer 1
  • In our notational convention we’re calling the input layer layer 0

so a two layer neural network looks like a neural network with one hidden layer.

Computing a Neural Network’s output

Symbols in neural networks:

  • 𝑥 : features
  • 𝑎 : output
  • 𝑊 : weight
  • superscript : layers
  • subscript : number of the items

How this neural network computers outputs :

  1. \(z_1^{[1]} = w_1^{[1]T}x + b_1^{[1]}\)
  2. \(a_1^{[1]} = \sigma(z_1^{[1]})\)
  3. \(a_2^{[1]}, a_3^{[1]}, a_4^{[1]}\)

Vectorizing : stack nodes in a layer vertically

Algorithm 3.10 : \(a^{[1]} = \begin{bmatrix}
= \sigma(z^{[1]})\)

Algorithm 3.11 : \(\begin{bmatrix}
\cdots W_1^{[1]T} \cdots \\
\cdots W_2^{[1]T} \cdots \\
\cdots W_3^{[1]T} \cdots \\
\cdots W_4^{[1]T} \cdots

Vectorizing across multiple examples

Take the equations you had from the previous algorithm and with very little modification, change them to make the neural network compute the outputs on all the examples, pretty much all at the same time.

\(a^{[2](i)}\) : Refers to training example i and layer two

Algorithm 3.12 : \(x = \begin{bmatrix}
\vdots & \vdots & \vdots & \vdots \\
x^{(1)} & x^{(2)} & \cdots & x^{(m)}\\
\vdots & \vdots & \vdots & \vdots

Algorithm 3.13 : \(Z^{[1]} = \begin{bmatrix}
\vdots & \vdots & \vdots & \vdots \\
z^{[1](1)} & z^{[1](2)} & \cdots & z^{[1](m)}\\
\vdots & \vdots & \vdots & \vdots

Algorithm 3.14 : \(A^{[1]} = \begin{bmatrix}
\vdots & \vdots & \vdots & \vdots \\
a^{[1](1)} & a^{[1](2)} & \cdots & a^{[1](m)}\\
\vdots & \vdots & \vdots & \vdots

Algorithm 3.15 : \(\left.\begin{matrix}
z^{[1](i)} = W^{[1](i)}x^{(i)} + b^{[1]}\\
a^{[1](i)} = \sigma(z^{[1](i)})\\
z^{[2](i)} = W^{[2](i)}a^{[1](i)} + b^{[2]}\\
a^{[2](i)} = \sigma(z^{[2](i)})
A^{[1]} = \sigma (z^{[1]})\\
z^{[2]} = W^{[2]}A^{[1]} + b^{[2]}\\
A^{[2]} = \sigma (z^{[2]})

Justification for vectorized implementation

Algorithm 3.16 : \(\begin{matrix}
z^{[1](1)} = W^{[1]}x^{(1)} + b^{[1]}\\
z^{[1](2)} = W^{[1]}x^{(2)} + b^{[1]}\\
z^{[1](3)} = W^{[1]}x^{(3)} + b^{[1]}

Algorithm 3.17 : \(W^{[1]}x = \begin{bmatrix}
\cdots \\
\cdots \\
\vdots &\vdots &\vdots &\vdots \\
x^{(1)} &x^{(2)} &x^{(3)} &\vdots \\
\vdots &\vdots &\vdots &\vdots
\vdots &\vdots &\vdots &\vdots \\
w^{(1)}x^{(1)} &w^{(1)}x^{(2)} &w^{(1)}x^{(3)} &\vdots \\
\vdots &\vdots &\vdots &\vdots
\vdots &\vdots &\vdots &\vdots \\
z^{[1](1)} &z^{[1](2)} &z^{[1](3)} &\vdots \\
\vdots &\vdots &\vdots &\vdots
= Z^{[1]}\)

Stack up the training examples in the columns of matrix X, and their outputs are also stacked into the columns of matrix \(z^{[1]}\).

Activation functions

Algorithm 3.18 Sigmoid : \(a = \sigma (z) = \frac{1}{1 + e^{-z}}\)

Algorithm 3.19 tanh : \(a = \tanh (z) = \frac{e^{z} – e^{-z}}{e^{z} + e^{-z}}\) (almost always works better than the sigmoid function)

Algorithm 3.20 hidden layer : \(g(z^{[1]}) = tanh(z^{[1]})\) (almost always strictly superior)

Algorithm 3.21 binaray : \(g(z^{[2]}) = \sigma(z^{[2]})\) (if y is either 0 or 1)

if z is either very large or very small then the gradient of the derivative or the slope of this function becomes very small so z is very large or z is very small the slope of the function you know ends up being close to zero and so this can slow down gradient descent

Algorithm 3.22 Relu (Rectified Linear Unit) : \(a = max(0, z)\)

Algorithm 3.23 Leaky Relu: \(a = max(0.01z, z)\)

some rules of thumb for choosing activation functions :

  • sigmoid : binary classification
  • tanh : pretty much strictly superior
  • ReLu : default

If you’re not sure which one of these activation functions work best you know try them all and then evaluate on like a holdout validation set or like a development set which we’ll talk about later and see which one works better and then go with that.

Why need a nonlinear activation function?

It turns out that for your neural network to compute interesting functions you do need to take a nonlinear activation function.

It turns out that if you use a linear activation function or alternatively if you don’t have an activation function then no matter how many layers your neural network has always doing is just computing a linear activation function so you might as well not have any hidden layers.

Derivatives of activation functions

Algorithm 3.25 : \(\frac{\mathrm{d} }{\mathrm{d} z}g(z) = \frac{1}{1+e^{-z}}(1 – \frac{1}{1+e^{-z}}) = g(z)(1 – g(z))\)

Algorithm 3.26 : \(g(z) = \tanh (z) = \frac{e^{z} – e^{-z}}{e^{z} + e^{-z}}\)

Algorithm 3.27 : \(\frac{\mathrm{d} }{\mathrm{d} z}g(z) = 1 – (tanh(z))^2\)

Algorithm Rectified Linear Unit (ReLU) : \(g(z)’ = \left\{\begin{matrix}
0 & if\ z < 0\\
1 & if\ z > 0\\
undefined & if\ z = 0

Algorithm Leaky linear unit (Leaky ReLU) : \(g(z)’ = \left\{\begin{matrix}
0.01 & if\ z < 0\\
1 & if\ z > 0\\
undefined & if\ z = 0

Gradient descent for neural networks

forward propagation : 

(1) : \(z^{[1]} = W^{[1]}x + b^{[1]}\)

(2) : \(a^{[1]} = \sigma(z^{[1]})\)

(3) : \(z^{[2]} = W^{[2]}a^{[1]} + b^{[2]}\)

(4) : \(a^{[2]} = g^{[2]}(z^{[2]}) = \sigma(z^{[2]})\)

back propagation : 

Algorithm 3.32 : \(\mathrm{d}z^{[2]} = A^{[2]} – Y, \ Y = [y^{[1]} \ y^{[2]} \ \cdots \ y^{[m]}]\)

Algorithm 3.33 : \(\mathrm{d}W^{[2]} = \frac{1}{m} \mathrm{d}z^{[2]}A^{[1]T}\)

Algorithm 3.34 : \(\mathrm{d}b^{[2]} = \frac{1}{m} np.sum(\mathrm{d}z^{[2]}, axis = 1, keepdims = True)\)

Algorithm 3.35 : \(\mathrm{d}z^{[1]} = \underbrace{W^{[2]T}\mathrm{d}z^{[2]}} * \underbrace{g^{[1]^{‘}}} * \underbrace{z^{[1]}}\)

Algorithm 3.36 : \(\mathrm{d}W^{[1]} = \frac{1}{m}\mathrm{d}z^{[1]}x^T\)

Algorithm 3.37 : \(\underbrace{\mathrm{d}b^{[1]}} = \frac {1}{m} np.sum(\mathrm{d}z^{[1]}, axis = 1, keepdims = True)\) (axis = 1 : horizontally, keepdims : ensures that Python outputs, for d b^[2] a vector that is some n by one)

Backpropagation intuition

It is one of the very hardest pieces of math. One of the very hardest derivations in all of machine learning.

//TODO, maybe never ...

Random Initialization

It is important to initialize the weights randomly.

  1. Gaussian random variable (2,2) : \(W^{[1]} = np.random.randn(2, 2)\)
  2. then usually you multiply this by a very small number such as 0.01 so you initialize it to very small random values