# 2 Deep convolutional models : case studies

## Why look at case studies?

It turns out a lot of the past few years of computer vision research has been on how to put together these basic building blocks to form effective convolutional neural networks.

And one of the best ways for you to gain intuition yourself, is to see some of these examples.

## Classic networks

• #### LeNet-5

And back then, when this paper was written, people used average pooling much more. If you’re building a modern variant, you’ll probably use mass pooling instead.

So it turns out that if you read the original paper, back then people used Sigmoid and Tahn non-linearities, and people weren’t using ReLu non-linearities back then.

But back then, computers were much slower. And so, to save on computation as well as on parameters, the original LeNet – 5 had some crazy complicated way where different filters look at different channels of the input block. And so, the paper talks about those details, but the more modern implementation you wouldn’t have that type of complexity these days.

• #### AlexNet

So, this neural network actually had a lot of similarities to LeNet, but it was much bigger.

And the fact that they could take pretty similar basic building blocks that have a lot more hidden units and trained on a lot more data they trained on the image and the data set, Another aspect of this architecture that made it much better than LeNet was using the ReLU activation function.

One is that when this paper was written, GPUs were still a little bit slower. So, it had a complicated way of training on two GPUs.

The original AlexNet architecture, also had another type of a layer called a local response normalization. the basic idea of local response normalization is, if you look at one of these blocks, one of these volumes that we have on top, let’s say for the sake of argument this one,13 by 13 by 256. look at all 256 numbers and normalize them. And the motivation for this local response normalization was that for each position in this 13 by 13 image, maybe you don’t want too many neurons with a very high activation. But subsequently, many researchers have found that this doesn’t help that much.

It was really this paper that convinced a lot of the computer vision community to take a serious look at deep learning, and to convince them that deep learning really works in computer vision, and then it grew on to have a huge impact, not just in computer vision but beyond computer vision as well.

• #### VGG-16

Instead of having so many hyper parameters, let’s use a much simpler network where you focus on just having conv layers that are just three by three filters with stride one and always use the same padding, and make all your max pooling layers two by two with a stride of two. And so, one very nice thing about the VGG network was, it really simplified these neural network architectures.

But VGG-16 is a relatively deep network.

The 16 in the name VGG-16, refers to the fact that this has 16 layers that have to weight. And this is a pretty large network. This network has a total of about 138 million parameters.

And that’s pretty large even by modern standards. But the simplicity of the VGG-16 architecture made it quite appealing. You can tell its architecture is really quite uniform. There’s a few conv layers followed by a pooling layer, which reduces the height and width. So the pooling layers reduce the height and width. You have a few of them here. But then also, if you look at the number of filters in the conv layers, here you have 64 filters, and then you double to 128, double to 256 doubles to 512. But roughly doubling on every step, or doubling through every stack of conv layers was another simple principle used to design the architecture of this network.

And so, I think the relative uniformity of this architecture made it quite attractive to researchers. The main downside was that, it was a pretty large network in terms of the number of parameters you had to train. this made this pattern of how as you go deeper, height and width goes down. It just goes down by a factor of two each time by the pooling layers, whereas the number of channels increases. And sure it roughly goes up by a factor of two every time you have a new set of conv layers.

## ResNets (Residual Networks)

Very, very deep neural networks are difficult to train because of vanishing and exploding gradients types of problems. skip connections which allows you to take the activation from one layer and suddenly feed it to another layer, even much deeper in the neural network.  And using that, you’re going to build ResNets which enables you to train very, very deep networks sometimes even networks of over 100 layers.

ResNets are built out of something called a residual block.

Plain network : $$\begin{matrix} z^{[l+1]} = W^{[l+1]} a^{[l]} + b^{[l+1]} & a^{[l+1]} = g(z^{[l+1]}) \\ z^{[l+2]} = W^{[l+2]} a^{[l+1]} + b^{[l+2]} & a^{[l+2]} = g(z^{[l+2]})\\ \end{matrix}$$

Residual block : $$\begin{matrix} z^{[l+1]} = W^{[l+1]} a^{[l]} + b^{[l+1]} & a^{[l+1]} = g(z^{[l+1]}) \\ z^{[l+2]} = W^{[l+2]} a^{[l+1]} + b^{[l+2]} & a^{[l+2]} = g(z^{[l+2]} + a^{[l]})\\ \end{matrix}$$

In practice, or in reality, having a plain network. So no ResNet, having plain network that’s very deep means that your optimization algorithm just has a much harder time training. And so, in reality, your training error gets worse if you pick a network that’s too deep. But what happens with ResNets is that even as the number of layers gets deeper, you can have the performance of the training error kind of keep on going down. Now, even if you train a network with over 100 layers.

## Why ResNets work?

Doing well on the training set is usually a prerequisite to doing well on your hold out, or on your dev, on your test sets. So being able to at least train the ResNets to do well on a training set is a good first step toward that.

Adding this residual block somewhere in the middle or to the end of this big neural network, it doesn’t hurt performance.

The residual network works is that it’s so easy for these extra layers to learn the identity function Or at least is easier to go from a decent baseline of not hurting performance and then creating the same can only improve the solution from there.

And then as is common in these networks, you have conv, conv, conv, pool, conv, conv, conv, pool, conv, conv, conv, pool. And then at the end, I have a fully connected layer that then makes a prediction using a softmax.

## Network in Network and 1×1 convolutions

1 x 1 filter

• 6 x 6 x 1 image, doesn’t seem particularly useful
• 6 x 6 x 1 channel images, and in particular, what a 1 x 1 convolution will do is it will look at each of the 36 different positions here. And it will take the element wise product between 32 numbers on the left and the 32 numbers in the filter. And then apply a ReLU nonlinearity to it after that.

And in fact, one way to think about the 32 numbers you have in this 1 x 1 x 32 filter(weights)

So one way to think about the 1 x 1 convolution is that it is basically having a fully connected neural network that applies to each of the 32 different positions.

It’s sometimes also called Network in Network.

A pretty non-trivial operation that allows you to shrink the number of channels in your volumes, or keep it the same, or even increase it if you want.

## Inception network motivation

When designing a layer for a CONV layer you might have to pick do you want to 1 x 3 filter, or 3 x 3, or 5 x 5. Or do you want to pooling layer? What inception network does is it says, why should you do them all. And this makes the network architecture more complicated but it also works remarkably well.

The inception network or what an inception layer says is, is instead of choosing what filter size you want in a CONV layer or even do you want a convolutional layer or pooling layer.

And the basic idea is that instead of you needing to pick one of these filter sizes or pooling you want and committing to that, you can do them all and just concatenate all the outputs and let the network learn whatever parameters it wants to use, what are the combinations of these filter sizes at once.

There’s a problem with the inception layer as I’ve describe it here which is computational cost.

A bottleneck layer is the smallest part of this network. We shrink the representation before increasing the size again. The total number of multiplications you need to do is the sum of those.

• If you are building a layer of a neural network and you don’t want to have to decide do you want a 1 x 1 or 3 x 3 or 5 x 5 of pooling layer. The inception module, let’s do them all. And let’s concatenate the results.
• The problem of computational cost and we just saw here was how using a 1 x 1 convolution, you can create this bottleneck layer thereby reducing the computational cost significantly.
• It turns out that so long as you implement this bottleneck layer within the region, you can shrink down the representation size significantly. And it doesn’t seem to hurt the performance. That saves you a lot of computation.

## Using open-source implementations

It turns out that a lot of these neural networks are difficult or finicky to replicate. Because a lot of details about tuning the hyperparameters. Sometimes difficult even for say, AI or deep learning Ph.D. students even at the top universities to replicate someone else’s publish work just from reading the research paper.

Fortunately, a lot of deep learning researchers routinely open source their work on the internet such as on GitHub.

If you see a research paper whose results you would like to build on top of, one thing you should consider doing, one thing I do quite often is just look online for an open-source implementation.

The MIT license is one of the more permissive open source licenses.

1. If you’re developing a computer vision application, a very common workflow would be to pick an architecture that you’d like. Maybe one of the ones you’ve learned about in this course, or maybe one that you’ve heard about from a friend, or from some of the literature.
2. And look for an open-source implementation and download it from GitHub to start building from there.

One of the advantages of doing so also is that sometimes these networks take a long time to train and someone else might have used multiple GPUs and a very largely data set to pre-trained some of these networks. And that allows you to do transfer learning using these networks.

## Transfer Learning

If you’re building a computer vision application, rather than training the weights from scratch, from random initialization, you often make much faster progress if your download weights that some else has already trained on a network architecture.

And use that as pre-training and transfer that to a new task that you might be interested in. Use transfer learning to sort of transfer knowledge from some of these very large public data sets to your own problem.

What you can do is then get rid of the softmax layer, and create your own softmax unit by using someone else’s pre-trained weights, you’re likely to get pretty good performance on this, even with a small data set. Fortunately, a lot of deep learning frameworks support this mode of operation.

And these are different ways in different deep learning programming frameworks letting you specify whether or not to train the weights associated with a particular layer.

If you have a bigger a data set, then maybe of enough data, not just to train a single softmax unit. But to train some modest-sized neural network that comprises the last few layers of this final network that you end up using. And then finally, if you have a lot of data, one thing you might do is take this open source network and weights, and use the whole thing just as initialization, and train the whole network.

Computer vision is one where transfer learning is somethingz that you should almost always do. Unless you actually have a very, very large, unless you have an exceptionally large data set to train everything else from scratch yourself.

## Data augmentation

Most computer vision tasks could use more data and so data augmentation is one of the techniques that is often used to improve the performance of computer vision systems.

The majority of computer vision problems is that we just can’t get enough data.

The common data augmentation methods :

• mirroring on the vertical axis
• random cropping
• Rotation
Shearing
local warping
• color shifting

One of the ways to influence color distortion uses an algorithm called PCA (Principles Components Analysis).

The rough idea the called PCA color augmentation, is for example, if your image is mainly purple, if it has mainly red and blue tints,
and very little green, then PCA color augmentation will add and subtract a lot to red and blue were relatively little to green so it kind of keeps the overall color of the tint the same.

A pretty common way of of implementing data augmentation is to really have one thread or multiple threads that is responsible for loading the data and implementing distortions, and then passing that to some other thread or some other process that then does the training and often this and this, can run in parallel.

A good place to get started might be to use someone else’s open source implementation for how they use data augmentation.

## The state of computer vision

Deep learning has been successfully applied to computer vision, natural language processing, speech recognition, online advertising, logistics, many, many, many problems.

Image recognition was a problem of looking at a picture and telling you, is this a cat or not? Whereas object detection is look at a picture and actually, you’re putting the bounding boxes and telling you where in the picture the objects, such as the cars are, as well. And so because of the costs of getting the bounding boxes is just more expensive to label the objects and the bounding boxes, so we tend to have less data for object detection than for image recognition.

On average that when you have a lot data, you tend to find people getting away with using simpler algorithms as well as less hand engineering. So there’s just less needing to carefully design features for the problem.

• But instead you can have a giant neural network, even a simpler architecture and have a neural network just learn whatever it wants to learn when you have a lot of data.
• Whereas in contrast, when you don’t have that much data, then, on average you see people engaging in more hand engineering and

Two sources of knowledge :

• labeled data
• hand engineering features / network architectures / other components of your system

And someone that is insightful with hand engineering will get better performance.

If you look at the computer vision literature, look at the set of ideas out there, you’ll also find that people are really enthusiastic. They’re really into doing well on standardized benchmark data sets and on winning competitions. And for computer vision researchers, if you do well on the benchmarks it’s easier to get the paper published. So there is just a lot of attention on doing well on these benchmarks.

• And the positive side of this is that it helps the whole community figure out what are the most effective algorithms
• but you also see in the papers, people do things that allow you to do well on a benchmark,
• but that you wouldn’t really use in a production or a system that you deploy in an actual application.

Tips for doing well on benchmarks / winning competitions

• Ensembling

Train several neural networks independently and average their outputs

But it’s almost never used in production to serve actual customers, I guess unless you have a huge computational budget and don’t mind burning a lot more of it per customer image.

• Multi-crop at test time

Take the central crop. Then, take the four corners crops. Run these images through your classifier and then average the results.

And a neural network that works well on one vision problem often, maybe surprisingly, but it just often will work other vision problems as well. So, to build a practical system often you do well starting off with some else’s neural network architecture.

• And you can use an open source implementation if possible because the open source implementation might have figured out all the finicky details.
• But if you have the computer resources and the inclination, don’t let me stop you from training your own networks from scratch. And, in fact, if you want to invent your own computer vision algorithm, that’s what you might have to do.