1 Foundations of Convolutional Neural Networks

Computer vision

Rapid advances in computer vision are enabling brand new applications to be able.
Even if you don’t end up building computer vision systems per se, because the computer vision research community has been so creative and so inventive in coming up with new neural network architectures and algorithms, is actually inspire that creates a lot of cross-fertilization into other areas as well.

Some examples of computer vision problems :

Image classification, sometimes also called image recognition
Object detection
Neural style transfer

One of the challenges of computer vision problems is that the inputs can get really big. To do that, you need to be the implement the convolution operation.

Edge detection example

The early layers a neural network might detect edges.
And then the somewhat later layers might detect parts of objects.
And then even later layers maybe detect parts of complete objects

A Matrix. And in the pooling, the terminology of convolutional neural networks, this is going to be called a filter. Sometimes research papers will call this a kernel instead of a filter.

More edge detection

Sobel filter : [latex]\begin{bmatrix} 1 & 0 & -1\\ 2 & 0 & -2\\ 1 & 0 & -1 \end{bmatrix}[/latex]

Padding

If we have a n-by-n image, and convolve that with an f-by-f filter, then the dimension of the output will be [latex](n-f+1) * (n-f+1)[/latex] The two downsides to this

every time you apply a convolutional operator, your image shrinks
if you look the pixel at the corner of the edge, this pixel is touched or used only in one of the outputs

So to solve both of these problems : before apply the convolutional operation, you can pad the image padding all around with an extra border of one pixels, that the output becomes [latex](n+2p-f+1)*(n+2p-f+1)[/latex]. So this effective maybe not quite throwing away, but counting less the information from the edge of a corner or the edge of the image is reduced. How much to pad :

Valid convolution : this basically means no padding.
Same Convolution : that means when you pad,so the output size is the same as the input size.

And you rarely see an even-numbered filters, filter would be used in computer vision.

One is that if f was even, then you need some asymmetric padding
And then second, when you have an odd dimension filter, then it has a central position.

Strided convolutions

If you have an n x n matrix or [latex]n * n[/latex] image that you convolve with an [latex]f* f[/latex] filter with padding p, and stride s, then the output size will have this dimension. [latex]\frac{n+2p-f}{s} + 1 \times \frac{n+2p-f}{s} + 1[/latex] In that case, we’re going to round this down. [latex]\left \lfloor z \right \rfloor[/latex] And technically, what we’re actually doing, really, is sometimes called cross-correlation instead of convolution. But in deep learning literature, by convention we just call this a convolution operation.

Convolutions over volumes

Convolve this not to a three by three filter as you had previously, but now with also a 3D filter, That’s going to be three by three by three, So, the filter itself will also have three layers. You can now detect two features or maybe several hundred different features, and the output will then have a number of channels equal to the number of features you are detecting.

One layer of a convolutional network

Suppose you have 10 filters not just 2 filters, that are 3 x 3 x 3 in one layer of a neural network. How many parameters does this layer have? Each filter is a three by three by three volume, So three by three by three, so each filter has 27 parameters, right, so it’s 27 numbers to be learned. And then plus the bias, so that was the b parameters, so this gives you 28 parameters. Then all together you would have 28 times 10, so that would be 280 parameters. size of the output : [latex]n_H^{[l]} = \left \lfloor \frac{n_W^{[l-1]} + 2p^{[l] - f^{[l]}}}{s^{[l]}} + 1 \right \rfloor[/latex] the number of filters : [latex]f^{[l]} \times f^{[l]} \times n_c^{[l-1]}[/latex]

A simple convolution network example

A lot of the work in designing a convolutional neural net is selecting hyperparameters like these, deciding what’s the filter size, what’s the stride, what’s the padding, and how many filters you use. Types of layer in a convolutional network :

Convolution
Pooling
Fully connected

Pooling layers

Other than convolutional layers, ConvNets often also use pooling layers to reduce the size of their representation to speed up computation, as well as to make some of the features it detects a bit more robust Suppose you have a 4x4 input, and you want to apply a type of pooling called max pooling. And the output of this particular implementation of max pooling will be a 2x2 output. And the way you do that is quite simple. Take your 4x4 input and break it into different regions. And I’m going to color the four regions as follows. And then in the output, which is 2x2, each of the outputs will just be the max from the correspondingly shaded region. So what the max operation does is so long as the feature is detected anywhere in one of these quadrants, it then remains preserved in the output of Max pooling. So what the max operator does is really says, if this feature is detected anywhere in this filter, then keep a high number. But if this feature is not detected, so maybe this feature doesn’t exist in the upper right hand quadrant, then the max of all those numbers is still itself quite small. So maybe that’s the intuition behind max pooling. The main reason people use max pooling is because it’s been found in a lot of experiments to work well. Average pooling : So that’s pretty much what you’d expect, which is instead of taking the maxes within each filter, you take the average. So these days max pooling is used much more often than average pooling, One thing to note about pooling is that there are no parameters to learn, right.

Convolutional neural network example

It turns out that in the literature of a ConvNet, there are two conventions which are slightly in consistence about what you call a layer.

One convention is that this is called one layer, so this will be Layer 1 of the neural network.
Another convention would be to count the Conv layer as a layer, and the Pool layer as a layer.

When people report a number of layers in a neural network, usually people report just the number of layers that have weights, that have parameters, and because the pooling layer has no weights, has no parameters, only a few hyper parameters, Maybe one common guideline is to actually not try to invent your own settings of hyperparameters, but to look in the literature to see what hyperparameters that you work for others. And to just choose an architecture that has worked well for someone else, and there’s a chance that will work for your application as well.

Why convolutions?

Parameter sharing: A feature detector(such as a vertical edge detector) that’s useful in one part of the image is probably useful in another part of the image.
Sparsity of connections: In each layer, each output value depends only on a smalll number of inputs.