Speed up by ignoring std::cout

Test Code

  1. #include <iostream>
  3. int main(int argc, char *argv[])
  4. {
  5. 	//std::cout.setstate(std::ios_base::badbit);
  6. 	for(int i = 0; i < 100; i ++) {
  7. 		for(int i = 0; i < 100; i ++) {
  8. 			;//std::cout << "" << std::endl;
  9. 		}
  10. 	}
  11. 	return 0;
  12. }

Test it without std::cout

  1. root@imx6ul7d:~/tmp# time ./t1_1 
  3. real    0m0.041s
  4. user    0m0.020s
  5. sys     0m0.000s

Test it with std::cout, but redirect to /dev/null

  1. root@imx6ul7d:~/tmp# time ./t1_0 > /dev/null
  3. real    0m0.096s
  4. user    0m0.030s
  5. sys     0m0.030s

Test it with std::cout, but set io state

  1. root@imx6ul7d:~/tmp# time ./t1_2
  3. real    0m0.061s
  4. user    0m0.040s
  5. sys     0m0.000s

Profile : Linux

Example 1

Add -pg

  1. arm-linux-gnueabihf-g++ -Wall -g -pg hello.cpp -o hello -std=c++17


  1. root@imx6ul7d:~# gprof -b hello 
  2. Flat profile:
  4. Each sample counts as 0.01 seconds.
  5.   %   cumulative   self              self     total           
  6.  time   seconds   seconds    calls   s/call   s/call  name    
  7. 100.00      3.45     3.45        1     3.45     3.45  hehe()
  8.   0.00      3.45     0.00        4     0.00     0.00  std::_Optional_base<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >::_M_is_engaged() const
  9. ......

3 Sequence models & Attention mechanism

Various sequence to sequence architectures

Sequence to sequence models which are useful for everything from machine translation to speech recognition.

  • translate
  • image captioning

Picking the most likely sentence


Beam Search

You don’t want to output a random English translation, you want to output the best and the most likely English translation. Beam search is the most widely used algorithm to do this.

So, whereas greedy search will pick only the one most likely words and move on, Beam Search instead can consider multiple alternatives.  So, the Beam Search algorithm has a parameter called B, which is called the beam width.

Notice that what we ultimately care about in this second step would be to find the pair of the first and second words that is most likely. so it’s not just a second where is most likely but the pair of the first and second words most likely.

Evaluate all of these 30000 options according to the probability of the first and second words and then pick the top three. Because of beam width is equal to three, every step you instantiate three copies of the network to evaluate these partial sentence fragments and the output.

And it’s because of beam width is equal to three that you have three copies of the network with different choices for the first words,

Beam search will usually find a much better output sentence than greedy search.

Refinements to Beam Search

Length normalization is a small change to the beam search algorithm that can help you get much better results.

Numerical underflow. Meaning that it’s too small for the floating part representation in your computer to store accurately.

So in most implementations, you keep track of the sum of logs of the probabilities rather than the production of probabilities.

Instead of using this as the objective you’re trying to maximize, one thing you could do is normalize this by the number of words in your translation. And so this takes the average of the log of the probability of each word.

And this significantly reduces the penalty for outputting longer translations.

And in practice, as a heuristic instead of dividing by Ty, by the number of words in the output sentence, sometimes you use a softer approach. We have Ty to the power of alpha, where maybe alpha is equal to 0.7. So if alpha was equal to 1, then yeah, completely normalizing by length. If alpha was equal to 0, then, well, Ty to the 0 would be 1, then you’re just not normalizing at all. And this is somewhat in between full normalization and no normalization. And alpha’s another hyper parameter of algorithm that you can tune to try to get the best results.

Pick the one that achieves the highest value on this normalized log probability objective. Sometimes it’s called a normalized log likelihood objective.

In production systems, it’s not uncommon to see a beam width maybe around 10.

Exact search algorithms : 

  • BFS, Breadth First Search
  • DFS, Depth First Search

Beam search runs much faster but does not guarantee to find the exact maximum for this arg max that you would like to find.

Error analysis in beam search

Beam search is an approximate search algorithm, also called a heuristic search algorithm.

How error analysis interacts with beam search and how you can figure out whether it is the beam search algorithm that’s causing problems and worth spending time on. Or whether it might be your RNN model that is causing problems and worth spending time on.

Model : 

  • RNN model (neural network model or sequence to sequence model)
    • It’s really an encoder and a decoder.
    • P(y|x)
  • Beam search algorithm
P(y^{*}|x) & use \ model\\
P(\hat y|x) & use \ RNN



Bleu Score

How to evaluate a machine translation system

The way this is done conventionally is through something called the BLEU score.

What the BLEU score does is given a machine generated translation, it allows you to automatically compute a score that measures how good is that machine translation.

BLEU, by the way, stands for bilingual evaluation understudy.

Tthe intuition behind the BLEU score is we’re going to look at the machine generated output and see if the types of words it generates appear in at least one of the human generated references.

The reason the BLEU score was revolutionary for machine translation was because this gave a pretty good, by no means perfect, but pretty good single real number evaluation metric. And so that accelerated the progress of the entire field of machine translation.

Today, BLEU score is used to evaluate many systems that generate text, such as machine translation systems, as well as the example I showed briefly earlier of image captioning systems.

Attention Model Intuition

Attention Model, that makes RNN work much better.

  • It’s just difficult to get in your network to memorize a super long sentence.
  • But with an Attention Model, machine translation systems performance can look like this, because by working one part of the sentence at a time, 
    • What the Attention Model would be computing is a set of attention weights.

Attention Model

This algorithm runs in quadratic cost, Although in machine translation applications where neither input nor output sentences is usually that long maybe quadratic cost is actually acceptable.

Speech recognition

One of the most exciting developments were sequence-to-sequence models has been the rise of very accurate speech recognition.

A common pre-processing step for audio data is to run your raw audio clip and generate a spectrogram. So, this is the plots where the horizontal axis is time, and the vertical axis is frequencies, and intensity of different colors shows the amount of energy.

Once upon a time, speech recognition systems used to be built using phonemes and this where, hand-engineered basic units of cells. But with end-to-end deep learning, we’re finding that phonemes representations are no longer necessary.


Trigger Word Detection

When the rise of speech recognition have been more and more devices you can wake up with your voice and those are sometimes called trigger word detection systems.

The literature on trigger word detection algorithm is still evolving. So there isn’t wide consensus yet on what’s the best algorithm for trigger word detection.

One example of an algorithm you can use RNN like this and what we really do is take an audio clip maybe compute spectrogram features and that generates features \(x^{<1>},x^{<2>},x^{<3>}\) audio features \(x^{<1>},x^{<2>},x^{<3>}\). Then you pass to an RNN and so all that remains to be done is to define the target labels \(Y\). So if this point in the audio clip is when someone just finished saying the trigger word such as Alexa or xiaodunihao or hey Siri or okay Google. 

Then in the training sets you can set the target labels to be zero for everything before that point and right after that to set the target label of one.And then if a little bit later on the trigger word was said again, and the trigger was said at this point, then you can again set the target label to be one right after that.

One slight disadvantage of this is it creates a very imbalanced training set.So a lot more zeros than ones.

Instead of setting only a single time step to output one, you can actually make an output a few ones for several times or for a fixed period of time before reverting back to zero. So and that, slightly evens out the ratio of ones to zeros. But this is a little bit of a hack.

2 Natural Language Processing and Word Embeddings

Word Representation

NLP, Natural Language Processing

Word embeddings, which is a way of representing words. that let your algorithms automatically understand analogies like that, man is to woman, as king is to queen, and many other examples.

Representing words using a vocabulary of words.

One of the weaknesses of this representation is that it treats each word as a thing onto itself, and it doesn’t allow an algorithm to easily generalize the cross words.

You see plots like these sometimes on the internet to visualize some of these 300 or higher dimensional embeddings.

To visualize it, algorithms like t-SNE, map this to a much lower dimensional space.

Using Word Embeddings

Transfer learning and word embeddings

  1. Learn word embeddings from large text corpus. (1-100B words or download pre-trained embedding online.)
  2. Transfer embedding to new task with smaller training set. (say, 100k words)
  3. Optional: Continue to finetune the word embeddings with new data.

Properties of Word Embeddings

One of the most fascinating properties of word embeddings is that they can also help with analogy reasoning.

The most commonly used similarity function is called cosine similarity : \(CosineSimilarity(u,v) = \frac{u.v}{\left \| u \right \|_2\left \| v \right \|_2} = cos(\theta)\)

Embedding Matrix

When you implement an algorithm to learn a word embedding, what you end up learning is an embedding matrix.

And the columns of this matrix would be the different embeddings for the 10,000 different words you have in your vocabulary.


Learning Word Embeddings

It turns out that building a neural language model is a reasonable way to learn a set of embedding.

Well, what’s actually more commonly done is to have a fixed historical window.

And using a fixed history, just means that you can deal with even arbitrarily long sentences because the input sizes are always fixed.

If your goal is to learn a embedding. Researchers have experimented with many different types of context.

  • If your goal is to build a language model then it is natural for the context to be a few words right before the target word.
  • But if your goal isn’t to learn the language model per se, then you can choose other contexts.


The Word2Vec algorithm which is simple and computationally more efficient way to learn this types of embeddings.

Skip-Gram model

Softmax : & p(t|c) = \frac{e^{\theta ^T_t e_c}}{\sum _{j=1}^{10,000} e^{\theta ^T_j e_c}} \\
Loss Function : & L(\hat y, y) = – \sum _{i=1}^{10,000} y_i log \hat y _i

the primary problem is computational speed, because of the softmax step is very expensive to calculate because needing to sum over your entire vocabulary size into the denominator of the softmax.

a few solutions

    • hierarchical softmax classifier
    • negative sampling


the Continuous Bag-Of-Words Model, which takes the surrounding contexts from middle word, and and uses the surrounding words to try to predict the middle word.

Negative Sampling

What to do in this algorithm is create a new supervised learning problem. And the problem is, given a pair of words like orange and juice, we’re going to predict is this a context-target pair? It’s really to try to distinguish between these two types of distributions from which you might sample a pair of words.

How do you choose the negative examples?

  • sample the words in the middle, the candidate target words.
  • use 1 over the vocab size, sample the negative examples uniformly at random, but that’s also very non-representative of the distribution of English words.
  • the authors, Mikolov et al, reported that empirically, \(P(w_i) = \frac{f(w_i)^{\frac{3}{4}}}{\sum _{j=1}^{10,000}f(w_j)^{\frac{3}{4}}}\)

GloVe Word Vectors

GloVe stands for global vectors for word representation.

Sampling pairs of words, context and target words, by picking two words that appear in close proximity to each other in our text corpus. So, what the GloVe algorithm does is, it starts off just by making that explicit.


Sentiment Classification

Sentiment classification is the task of looking at a piece of text and telling if someone likes or dislikes the thing they’re talking about.

One of the challenges of sentiment classification is you might not have a huge label training set for it. But with word embeddings, you’re able to build good sentiment classifiers even with only modest-size label training sets.

One of the problems with this algorithm is it ignores word order.

More Sophisticated Model : 


Debiasing Word Embeddings

Machine learning and AI algorithms are increasingly trusted to help with, or to make, extremely important decisions. And so we like to make sure that as much as possible that they’re free of undesirable forms of bias, such as gender bias, ethnicity bias and so on.

  • So the first thing we’re going to do is identify the direction corresponding to a particular bias we want to reduce or eliminate.
  • the next step is a neutralization step. So for every word that’s not definitional, project it to get rid of bias.
  • And then the final step is called equalization in which you might have pairs of words such as grandmother and grandfather, or girl and boy, where you want the only difference in their embedding to be the gender.
  • And then, finally, the number of pairs you want to equalize, that’s actually also relatively small, and is, at least for the gender example, it is quite feasible to hand-pick.

1 Recurrent Neural Networks

Why Sequence Models?

Models like recurrent neural networks or RNNs have transformed speech recognition, natural language processing and other areas.


Suppose the input is the sequence of nine words. So, eventually we’re going to have nine sets of features to represent these nine words, and index into the positions in the sequence, I’m going to use \(x^{<1>}\), \(x^{<2>}\), \(x^{<3>}\) and so on up to \(x^{<9>}\) to index into the different positions.

use \(x^{<t>}\) to index into positions, in the middle of the sequence. And t implies that these are temporal sequences although whether the sequences are temporal one or not, I’m going to use the index t to index into the positions in the sequence.

Used \(T_{x}\) denote the length of the input sequence,

\(x^{(i)<t>}\) refer to the Tth element or the Tth element in the sequence of training example i

\(T_{x}^{(i)}\) is the length of sequence i

NLP or Natural Language Processing

Use one-hot representations to represent each of these words.

What if you encounter a word that is not in your vocabulary? Well the answer is, you create a new token or a new fake word called Unknown Word which under note as follows angle brackets UNK to represent words not in your vocabulary.

Recurrent Neural Network Model

Why not a standard network?


  • Inputs, outputs can be different lengths in different examples.
  • Doesn’t share features learned across different positions of text.

And what a recurrent neural network does is when it then goes on to read the second word in a sentence, say X2, instead of just predicting Y2 using only X2, it also gets to input some information from what had computed that time-step one’s. At each time-step,  the recurrent neural network passes on this activation to the next time-step for it to use.

Now one limitation of this particular neural network structure is that the prediction at a certain time uses inputs or uses information from the inputs earlier in the sequence but not information later in the sequence. We will address this in a later video where we talk about a bidirectional recurrent neural networks or BRNNs.

The activation function used in to compute the activations will often be a tanh and the choice of an RNN and sometimes, Relu are also used although the tanh is actually a pretty common choice.

Simplified RNN notation : \(\begin{matrix}
a^{<t>} = g_1(W_{aa}a^{<t-1>} + W_{ax}x^{<t>} + b_a)\\
\hat y ^{<t>} = g_2(W_{ya}a^{<t>} + b_y)

Backpropagation through time

As usual, when you implement this in one of the programming frameworks, often, the programming framework will automatically take care of backpropagation.

Element-wise loss funtion : \(L^{<t>}(\hat y ^{<t>}, y ^{<t>}) = -y ^{<t>}log \hat y ^{<t>} – (1 – \hat y ^{<t>})log(1 – \hat y ^{<t>})\)
standard logistic regression loss also called the cross entropy loss.

Overall loss of the entire sequence : \(L(\hat y, y) = \sum _{t=1}^{T_x} L ^{<t>}(\hat y ^{<t>}, y^{<t>})\)

Backpropagation through time, And the motivation for this name is that for forward prop you are scanning from left to right, increasing indices of the time t, whereas the backpropagation, you’re going from right to left, kind of going backwards in time.

Different types of RNNs

Language model and sequence generation

What a language model does is given any sentence its job is to tell you what is the probability of a sentence, of that particular sentence. And this is a fundamental component for both speech recognition systems as you’ve just seen, as well as for machine translation systems where translation systems wants output.

How do you build a language model?

  • first need a training set comprising a large corpus of English text. Or text from whatever language you want to build a language model of. And the word corpus is an NLP terminology that just means a large body or a very large set of English text of English sentences.
    • The first thing you would do is tokenize this sentence. And that means you would form a vocabulary as we saw in an earlier video. And then map each of these words to, say, one-hot vectors, all to indices in your vocabulary.
    • One thing you might also want to do is model when sentences end. So another common thing to do is to add an extra token called a EOS.
  • Go on to built the RNN model
    • what \(a^{<1>}\) does is it will make a softmax prediction to try to figure out what is the probability of the first words y. And so that’s going to be y<1>. So what this step does is really, it has a softmax it’s trying to predict. What is the probability of any word in the dictionary?
    • Then, the RNN steps forward to the next step and has some activation, \(a^{<1>}\) to the next step. And at this step, this job is try to figure out, what is the second word?
    • whatever this given, everything that comes before, and hopefully it will predict that there’s a high chance of it, EOS end sentence token.

Sampling novel sequences

After you train a sequence model, one of the ways you can informally get a sense of what is learned is to have a sample novel sequences.

  • what you want to do is first sample what is the first word you want your model to generate.

Then you will generate a randomly chosen sentence from your RNN language model.

  • words level RNN
  • character level RNN
    • advantage : you don’t ever have to worry about unknown word tokens.
    • disadvantage : you end up with much more, much longer sequences.
      • so they are not in widespread used today. Except for maybe specialized applications where you might need to deal with unknown words or other vocabulary words a lot.

Vanishing gradients with RNNs

It turns out the basics RNN we’ve seen so far it’s not very good at capturing very long-term dependencies.

  • It turns out that vanishing gradients tends to be the bigger problem with training RNNs
  • although when exploding gradients happens, it can be catastrophic because the exponentially large gradients can cause your parameters to become so large that your neural network parameters get really messed up. So it turns out that exploding gradients are easier to spot because the parameters just blow up and you might often see NaNs, or not a numbers, meaning results of a numerical overflow in your neural network computation. 
    • And if you do see exploding gradients, one solution to that is apply gradient clipping. And what that really means, all that means is look at your gradient vectors, and if it is bigger than some threshold, re-scale some of your gradient vector so that is not too big. So there are clips according to some maximum value. So if you see exploding gradients, if your derivatives do explode or you see NaNs, just apply gradient clipping, and that’s a relatively robust solution that will take care of exploding gradients. 

Gated Recurrent Unit(GRU)

The Gated Recurrent Unit which is a modification to the RNN hidden layer that makes it much better capturing long range connections and helps a lot with the vanishing gradient problems.










The GRU unit is going to have a new variable called c which stands for cell, for memory cell. And what the memory cell do is it will provide a bit of memory to remember. \(\tilde{c} ^{<t>} = tanh (W_c [c ^{<t-1>}, x ^{<t>}] + b_c)\)

the important idea of the GRU : \(\begin{matrix}
\Gamma _u = \sigma(W_u[c^{<t-1>}, x^{<t>}] + b_u) \\
c^{<t>} = \Gamma _u * \tilde{c} ^{<t>} + (1 – \Gamma _u) * c^{<t-1>}

LSTM(long short term memory)unit

the long short term memory units, and this is even more powerful than the GRU.


Perhaps, the most common one is that instead of just having the gate values be dependent only on a^{<t-1>} , x^{<t>}, sometimes, people also sneak in there the values c^{<t-1>} as well. This is called a peephole connection.


  • relatively recent invention
  • a simpler model and so it is actually easier to build a much bigger network, it only has two gates, so computationally, it runs a bit faster. So, it scales the building somewhat bigger models


  • actually came much earlier
  • more powerful and more flexible since it has three gates instead of two.

LSTM has been the historically more proven choice.

Bidirectional RNN

Bidirectional RNNs, which lets you at a point in time to take information from both earlier and later in the sequence.

In fact, for a lots of NLP problems, for a lot of text with natural language processing problems, a bidirectional RNN with a LSTM appears to be commonly used.

The disadvantage of the bidirectional RNN is that you do need the entire sequence of data before you can make predictions anywhere.

Deep RNNs

The different versions of RNNs you’ve seen so far will already work quite well by themselves. But for learning very complex functions sometimes it’s useful to stack multiple layers of RNNs together to build even deeper versions of these models.

For RNNs, having three layers is already quite a lot. Because of the temporal dimension, these networks can already get quite big even if you have just a small handful of layers. And you don’t usually see these stacked up to be like 100 layers. One thing you do see sometimes is that you have recurrent layers that are stacked on top of each other. But then you might take the output here, let’s get rid of this, and then just have a bunch of deep layers that are not connected horizontally but have a deep network here that then finally predicts y<1>.

4 Special applications : Face recognition & Neural style transfer

What is face recognition?

Liveness detection

Face Verification

  • Input image, name/ID
  • Output whether the input image is that of the claimed person

Face Recognition

  • Has a database of K persons
  • Get an input image
  • Output ID if the image is any of the K persons (or “not recognized”)

In fact we have a database of a hundred persons you probably need this to be even quite a bit higher than 99% for that to work well.

One-shot learning

One of the challenges of face recognition is that you need to solve the one-shot learning problem. What that means is that, for most face recognition applications, you need to recognize a person given just one single image, or given just one example of that person’s face. 

And historically, deep learning algorithms don’t work well if you have only one training example. So the carry-outs face recognition to carry out one-shot learning. So instead, to make this work, what you’re going to do instead is learning similarity function.

\(d(img1, img2) = degree\ of\ difference\ between\ images.\)


If \ \ d(img1, img2) \leq \tau & , same\\
\ \ \ \ \ \ \ \ \ \ \ > \tau & , different


Siamese network

Triplet loss

One way to learn the parameters of the neural network so that it gives you a good encoding for your pictures of faces is to define and apply gradient descent on the triplet loss function.

In the terminology of the triplet loss what you’re going to do is always look at one anchor image and then you want the distance between the anchor and a positive image really a positive example meaning is the same person to be similar. Whereas you want the anchor when pairs are compared to the negative example for their distances to be much further apart. So this is what gives rise to the term triplet loss which is that you always be looking at three images at a time, you’ll be looking at an anchor image a positive image as well as a negative image.

\(\left \| f(A) – f(P) \right \| ^2 – \left \| f(A) – f(N) \right \| ^2 + a \leq 0\) \(L(A,P,N) = max (\left \| f(A) – f(P) \right \| ^2 – \left \| f(A) – f(N) \right \| ^2 + a, 0)\)

For your face recognition system maybe you have only a single picture of someone you might be trying to recognize but for your training set you do need to make sure you have multiple images of the same person at least for some people in your training set so that you can have pairs of anchor and positive images.

Choosing the triplets A,P,N : 

During training, if A,P,N are chosen randomly,
\(d(A,P) + a \leq d(A,N)\) is easily satisfied.

So to construct a training set what you want to do is to choose triplets A P and N that are hard to train on this is one domain where because of the sheer data volume sizes this is one domain where often it might be useful for you to download someone else’s pretrained model rather than do everything from scratch yourself.

Face verification and binary classification

Take this pair of neural networks to take this siamese network and have them both compute these embeddings, maybe 128 dimensional embeddings, maybe even higher dimensional, and then have these be input to a logistic regression unit to then just make a prediction, where the target output will be 1 if both of these are the same persons, and 0 if both of these are of different persons. So this is a way to treat face recognition just as a binary classification problem.

\(\hat y = \sigma (\sum _{k=1}^{128} w_i | f(x^{(i)})_k – f(x^{(i)})_k| + b)\)


Help your deployment significantly : 

what you can do is actually pre compute that, so when the new employee walks in, what you can do is use this upper ConvNet to to compute that encoding and use it to then compare it  to your pre computed encoding, and then use that to make a prediction y hat.

What is neural style transfer?

In order to implement neural style transfer, you need to look at the features extracted by ConvNets, at various layers, the shallow and the deeper layers of a ConvNets.

What are deep ConvNets learning?

Cost function

Given a content image C and the style image S, then the goal is to generate a new image G.

\(J(G) = \alpha J_{content}(C,G) + \beta J_{style}(S,G)\)

Find the generated image G : 

  1. Initiate G randomly (G : 100 * 100 * 3)
  2. Use gradient descent to minimiza J(G)

Content cost function

  • Say you use hidden layer l to compute content cost.
  • Use pre-trained ConvNet. (E.g., VGG network)
\(J_{content}(C,G) = \frac {1}{2} \left \| a^{[l](C)} – a^{[l](G)} \right \| ^ 2\)
  • Let \(a^{[l](C)}\) and \(a^{[l](G)}\) be the activation of layer l on the images
  • If \(a^{[l](C)}\) and \(a^{[l](G)}\) are similar, both images have similar content

Style cost function

Style matrix : 

Let \(a_{i,j,k}^{[l]}\) = activation at \((i,j,k)\). \(G^{[l](s)}\) is \(n_{c}^{[l]} \times n_{c}^{[l]}\)

And it’s the degree of correlation that gives you one way of measuring how often these different high level features, such as vertical texture or this orange tint or other things as well. How often they occur and how often they occur together, and don’t occur together in different parts of an image.

Define this style image. \(G_{kk’}^{[l](G)} = \sum _{i=1}^{n_H^{[l]}} \sum _{j=1}^{n_W^{[l]}} a_{i,j,k}^{[l](G)} a_{i,j,k’}^{[l](G)} \) So G, defined using layer l and on the style image, is going to be a matrix, where the height and width of this matrix is the number of channels by number of channels. So in this matrix, the k, k prime element is going to measure how correlated our channels k and k prime.

Style cost function :

\(J_{style}^{[l]}(S,G) = \frac{1}{(2n_H^{[l]}n_W^{[l]}n_C^{[l]})^2} \sum _{k} \sum _{k’} (G_{kk’}^{[l](S)} – G_{kk’}^{[l](G)})\)


1D and 3D generalizations of models

For a long of 1d data applications you actually use a recurrent neural network.

Three-dimensional. And one way to think of this data is if your data now has some height, some width and then also some depth.

3 Object detection

Object localization

Object detection is one of the areas of computer vision that’s just exploding.

Object localization which means not only do you have to label this as say a car, but the algorithm also is responsible for putting a bounding box, so that’s called the classification with localization problem.

Defining the target label y : 

  1. pedestrian
  2. card
  3. motorcycle
  4. background

Need to output bx, by, bh, bw, class label(1-4) \(y=\begin{bmatrix}
p_c \\
b_x \\
b_y \\
b_h \\
h_w \\
c_1 \\
c_2 \\

If using squared error, then loss function : \(L(\hat y, y) = (\hat y_1 – y_1)^2 + (\hat y_2 – y_2)^2 + \cdots (\hat y_8 – y_8)^2\)

In practice you could use you improbably use a log likelihood loss for the \(c_1\), \(c_2\), \(c_3\)to the softmax, output one of those elements, usually you can use squared error or something like squared error for the bounding box coordinates and then for \(p_c\), you could use something like the logistic regression loss, although even if you use squared error or predict work okay.

Landmark detection

Neural network just output x and y coordinates of important points in image sometimes called landmarks that you want the netural network to recognize.

The labels have to be consistent across different images.

Object detection

Sliding windows detection algorithm : 

  • using a pretty large stride in this example just to make the animation go faster
  • repeat it, but now use a larger window
  • then slide the window over again using some stride and so on, and you run that throughout your entire image until you get to the end

There’s a huge disadvantage of sliding windows detection which is the computational cost : 

  • if you use a very coarse stride, a very big stride, a very big step size, then that will reduce the number of windows you need to pass through the ConvNet, but that coarser granularity may hurt performance
  • whereas if you use a very fine granularity or a very small stride, then the huge number of all these little regions you’re passing through the ConvNet means that there’s a very high computational cost

So before the rise of neural networks, people used to use much simpler classifiers

Convolutional implementation of sliding windows

Turn fully connected layers in your neural network into convolutional layers

It turns out a lot of this computation done by these 4 ConvNet is highly duplicated

Sliding windows convolutionally makes the whole thing much more efficient, but it still has one weakness which is the position of the bounding boxes is not going to be too accurate.

Bounding box predictions

A good way to get this output more accurate bounding boxes is with the YOLO algorithm, YOLO stands for you only look once.

The basic idea is you’re going to take the image classification and localization algorithm and  what the YOLO algorithm does is it takes the midpoint of each of the two objects and it assigns the object to the grid cell containing the midpoint.

The advantage of this algorithm is that the neural network outputs precise bounding boxes as follows so long as you don’t have more than one object in each grid cell this algorithm should work okay.

Assign an object to grid cell is you look at the mid point of an object and then you assign that object to whichever one grid cell contains the mid point of the object.

This is a pretty efficient algorithm and in fact one nice thing about the YOLO algorithm which which accounts for popularity is because this is a convolutional implementation it actually runs very fast so this works even for real-time object detection.

The YOLO paper is one of the harder papers to read.

It’s not that uncommon sadly for even you know senior researchers to read research papers and have a hard time figuring out the details and have to look at the open source code or contact the authors or something else to figure out the details of these algorithms.

Intersection over union

Intersection over union, and just we use both for evaluating your object detection algorithm.

So, what the intersection over union function does or IoU does is it computes the intersection over union of these two bounding boxes.

So, the union of these two bounding boxes is this area, is really the area that is contained in either bounding boxes, whereas the intersection is this smaller region here. So, what the intersection over union does is it computes the size of the intersection,

And by convention, law of computer vision task will judge that your answer is correct, if the IoU is greater than or 0.5 (just a human-chosen convention, there’s no particularly deep theoretical reason for it).

Non-max suppression

One of the problems of object detection as you’ve learned about so far is that your algorithm may find multiple detections of the same object so rather than detecting an object just once it might detect it multiple times non-max suppression is a way for you to make sure that your algorithm detects each object only once.

  • so concretely what it does is it first looks at the probabilities associated with each of these detections count on the p_c, and then it first takes the largest one 
    • and says that’s my most confident detection
    • so let’s highlight it, 
    • and all the ones with a high overlap with a high IoU with this one that you’ve just output will get suppressed.
  • and find the one with the highest probability the highest

Non max means that you’re going to output your maximal probabilities classifications but suppress it close by ones that are non maximal so that’s as a name non max suppression.

Anchor Boxes

One of the problems with object detection as you’ve seen it so far is that each of the grid cells can detect only one object What if a grid cell wants to detect multiple objects here’s what you can do you can use the idea of anchor boxes.


The idea of anchor boxes what you’re going to do is predefined two different shapes called anchor boxes or anchor boxes shapes and what you’re going to do is now be able to associate two predictions with the two anchor boxes and in general you might use more anchor boxes maybe five or even more.

Anchor box algorithm :

  • Previously :

Echo object in training image is assigned to grid cell that contains that object’s midpoint.

  • With two anchor boxes :

Echo object in training image is assigned to grid cell that contains object’s midpoint and anchor box for the grid cell with highest IoU.

Now just some additional details what if you have two anchor boxes but 3 objects in the same grid cell that’s one case that this algorithm doesn’t handle it well.

Anchor boxes gives you is it allows your learning algorithm to specialize better in particular if your data set has some tall skinny objects like pedestrians and some wide objects like cars then this allows your learning algorithm to specialize.

How to choose the anchor boxes :

  • People used to just choose them by hand you choose maybe five or ten anchor box shapes that spans a variety of shapes that see to cover the types of objects you seem to detect.
  • One of the later YOLO research papers is to use a k-means algorithm to group together two types of object shapes you tend to get and if we use that to select a set of anchor boxes that this most stereotypically representative of the may be multiple there may be dozens of object classes you’re trying to detect but that’s a more advanced way to automatically choose the anchor boxes.

Putting it together: YOLO algorithm


One of the most effective object detection algorithms that

also encompasses many of the best ideas across the entire computer vision literature that relate to object detection.

Region proposals (Optional)

Algorithm convolutionally but one downside that the algorithm is it just classifies a lot of regions where there’s clearly no object.

Faster algorithms : 

  • R-CNN : Propose regions. Classify proposed regions one at a time. Output label + bounding box.
  • Fast R-CNN : Propose regions. Use convolution implementation of sliding windows to classify all the proposed regions.
  • Faster R-CNN : Use convolutional network to propose regions.

Although the faster R-CNN algorithm most implementations are usually still quite a bit slower than the YOLO algorithm.

The idea of region proposals has been quite influential in computer vision.

2 Deep convolutional models : case studies

Why look at case studies?

It turns out a lot of the past few years of computer vision research has been on how to put together these basic building blocks to form effective convolutional neural networks. 

And one of the best ways for you to gain intuition yourself, is to see some of these examples.

After the next few chapters, you’ll be able to read some of the research papers from the field of computer vision.

Classic networks

  • LeNet-5

And back then, when this paper was written, people used average pooling much more. If you’re building a modern variant, you’ll probably use mass pooling instead.

So it turns out that if you read the original paper, back then people used Sigmoid and Tahn non-linearities, and people weren’t using ReLu non-linearities back then.

But back then, computers were much slower. And so, to save on computation as well as on parameters, the original LeNet – 5 had some crazy complicated way where different filters look at different channels of the input block. And so, the paper talks about those details, but the more modern implementation you wouldn’t have that type of complexity these days.

  • AlexNet

So, this neural network actually had a lot of similarities to LeNet, but it was much bigger.

And the fact that they could take pretty similar basic building blocks that have a lot more hidden units and trained on a lot more data they trained on the image and the data set, Another aspect of this architecture that made it much better than LeNet was using the ReLU activation function.

One is that when this paper was written, GPUs were still a little bit slower. So, it had a complicated way of training on two GPUs.

The original AlexNet architecture, also had another type of a layer called a local response normalization. the basic idea of local response normalization is, if you look at one of these blocks, one of these volumes that we have on top, let’s say for the sake of argument this one,13 by 13 by 256. look at all 256 numbers and normalize them. And the motivation for this local response normalization was that for each position in this 13 by 13 image, maybe you don’t want too many neurons with a very high activation. But subsequently, many researchers have found that this doesn’t help that much.

It was really this paper that convinced a lot of the computer vision community to take a serious look at deep learning, and to convince them that deep learning really works in computer vision, and then it grew on to have a huge impact, not just in computer vision but beyond computer vision as well.

  • VGG-16

Instead of having so many hyper parameters, let’s use a much simpler network where you focus on just having conv layers that are just three by three filters with stride one and always use the same padding, and make all your max pooling layers two by two with a stride of two. And so, one very nice thing about the VGG network was, it really simplified these neural network architectures.

But VGG-16 is a relatively deep network.

The 16 in the name VGG-16, refers to the fact that this has 16 layers that have to weight. And this is a pretty large network. This network has a total of about 138 million parameters.

And that’s pretty large even by modern standards. But the simplicity of the VGG-16 architecture made it quite appealing. You can tell its architecture is really quite uniform. There’s a few conv layers followed by a pooling layer, which reduces the height and width. So the pooling layers reduce the height and width. You have a few of them here. But then also, if you look at the number of filters in the conv layers, here you have 64 filters, and then you double to 128, double to 256 doubles to 512. But roughly doubling on every step, or doubling through every stack of conv layers was another simple principle used to design the architecture of this network.

And so, I think the relative uniformity of this architecture made it quite attractive to researchers. The main downside was that, it was a pretty large network in terms of the number of parameters you had to train. this made this pattern of how as you go deeper, height and width goes down. It just goes down by a factor of two each time by the pooling layers, whereas the number of channels increases. And sure it roughly goes up by a factor of two every time you have a new set of conv layers.

ResNets (Residual Networks)

Very, very deep neural networks are difficult to train because of vanishing and exploding gradients types of problems. skip connections which allows you to take the activation from one layer and suddenly feed it to another layer, even much deeper in the neural network.  And using that, you’re going to build ResNets which enables you to train very, very deep networks sometimes even networks of over 100 layers.

ResNets are built out of something called a residual block.

Plain network : \(\begin{matrix}
z^{[l+1]} = W^{[l+1]} a^{[l]} + b^{[l+1]} & a^{[l+1]} = g(z^{[l+1]}) \\
z^{[l+2]} = W^{[l+2]} a^{[l+1]} + b^{[l+2]} & a^{[l+2]} = g(z^{[l+2]})\\

Residual block : \(\begin{matrix}
z^{[l+1]} = W^{[l+1]} a^{[l]} + b^{[l+1]} & a^{[l+1]} = g(z^{[l+1]}) \\
z^{[l+2]} = W^{[l+2]} a^{[l+1]} + b^{[l+2]} & a^{[l+2]} = g(z^{[l+2]} + a^{[l]})\\

In practice, or in reality, having a plain network. So no ResNet, having plain network that’s very deep means that your optimization algorithm just has a much harder time training. And so, in reality, your training error gets worse if you pick a network that’s too deep. But what happens with ResNets is that even as the number of layers gets deeper, you can have the performance of the training error kind of keep on going down. Now, even if you train a network with over 100 layers.

Why ResNets work?

Doing well on the training set is usually a prerequisite to doing well on your hold out, or on your dev, on your test sets. So being able to at least train the ResNets to do well on a training set is a good first step toward that.

Adding this residual block somewhere in the middle or to the end of this big neural network, it doesn’t hurt performance.

The residual network works is that it’s so easy for these extra layers to learn the identity function Or at least is easier to go from a decent baseline of not hurting performance and then creating the same can only improve the solution from there.

And then as is common in these networks, you have conv, conv, conv, pool, conv, conv, conv, pool, conv, conv, conv, pool. And then at the end, I have a fully connected layer that then makes a prediction using a softmax.

Network in Network and 1×1 convolutions

1 x 1 filter

  • 6 x 6 x 1 image, doesn’t seem particularly useful
  • 6 x 6 x 1 channel images, and in particular, what a 1 x 1 convolution will do is it will look at each of the 36 different positions here. And it will take the element wise product between 32 numbers on the left and the 32 numbers in the filter. And then apply a ReLU nonlinearity to it after that.

And in fact, one way to think about the 32 numbers you have in this 1 x 1 x 32 filter(weights)

So one way to think about the 1 x 1 convolution is that it is basically having a fully connected neural network that applies to each of the 32 different positions.

It’s sometimes also called Network in Network.

A pretty non-trivial operation that allows you to shrink the number of channels in your volumes, or keep it the same, or even increase it if you want.

Inception network motivation

When designing a layer for a CONV layer you might have to pick do you want to 1 x 3 filter, or 3 x 3, or 5 x 5. Or do you want to pooling layer? What inception network does is it says, why should you do them all. And this makes the network architecture more complicated but it also works remarkably well.

The inception network or what an inception layer says is, is instead of choosing what filter size you want in a CONV layer or even do you want a convolutional layer or pooling layer.

And the basic idea is that instead of you needing to pick one of these filter sizes or pooling you want and committing to that, you can do them all and just concatenate all the outputs and let the network learn whatever parameters it wants to use, what are the combinations of these filter sizes at once.

There’s a problem with the inception layer as I’ve describe it here which is computational cost.

A bottleneck layer is the smallest part of this network. We shrink the representation before increasing the size again. The total number of multiplications you need to do is the sum of those.

  • If you are building a layer of a neural network and you don’t want to have to decide do you want a 1 x 1 or 3 x 3 or 5 x 5 of pooling layer. The inception module, let’s do them all. And let’s concatenate the results.
  • The problem of computational cost and we just saw here was how using a 1 x 1 convolution, you can create this bottleneck layer thereby reducing the computational cost significantly.
  • It turns out that so long as you implement this bottleneck layer within the region, you can shrink down the representation size significantly. And it doesn’t seem to hurt the performance. That saves you a lot of computation.

Inception network


Using open-source implementations

It turns out that a lot of these neural networks are difficult or finicky to replicate. Because a lot of details about tuning the hyperparameters. Sometimes difficult even for say, AI or deep learning Ph.D. students even at the top universities to replicate someone else’s publish work just from reading the research paper.

Fortunately, a lot of deep learning researchers routinely open source their work on the internet such as on GitHub.

If you see a research paper whose results you would like to build on top of, one thing you should consider doing, one thing I do quite often is just look online for an open-source implementation.

The MIT license is one of the more permissive open source licenses.

  1. If you’re developing a computer vision application, a very common workflow would be to pick an architecture that you’d like. Maybe one of the ones you’ve learned about in this course, or maybe one that you’ve heard about from a friend, or from some of the literature.
  2. And look for an open-source implementation and download it from GitHub to start building from there.

One of the advantages of doing so also is that sometimes these networks take a long time to train and someone else might have used multiple GPUs and a very largely data set to pre-trained some of these networks. And that allows you to do transfer learning using these networks.

Transfer Learning

If you’re building a computer vision application, rather than training the weights from scratch, from random initialization, you often make much faster progress if your download weights that some else has already trained on a network architecture.

And use that as pre-training and transfer that to a new task that you might be interested in. Use transfer learning to sort of transfer knowledge from some of these very large public data sets to your own problem.

Go online and download some open source implementation of a neural network. And download not just the code, but also the weights.

What you can do is then get rid of the softmax layer, and create your own softmax unit by using someone else’s pre-trained weights, you’re likely to get pretty good performance on this, even with a small data set. Fortunately, a lot of deep learning frameworks support this mode of operation.

And these are different ways in different deep learning programming frameworks letting you specify whether or not to train the weights associated with a particular layer.

If you have a bigger a data set, then maybe of enough data, not just to train a single softmax unit. But to train some modest-sized neural network that comprises the last few layers of this final network that you end up using. And then finally, if you have a lot of data, one thing you might do is take this open source network and weights, and use the whole thing just as initialization, and train the whole network.

Computer vision is one where transfer learning is somethingz that you should almost always do. Unless you actually have a very, very large, unless you have an exceptionally large data set to train everything else from scratch yourself.

Data augmentation

Most computer vision tasks could use more data and so data augmentation is one of the techniques that is often used to improve the performance of computer vision systems.

The majority of computer vision problems is that we just can’t get enough data.

The common data augmentation methods :

  • mirroring on the vertical axis
  • random cropping
    • Rotation
      local warping
  • color shifting

One of the ways to influence color distortion uses an algorithm called PCA (Principles Components Analysis).

The rough idea the called PCA color augmentation, is for example, if your image is mainly purple, if it has mainly red and blue tints,
and very little green, then PCA color augmentation will add and subtract a lot to red and blue were relatively little to green so it kind of keeps the overall color of the tint the same.

A pretty common way of of implementing data augmentation is to really have one thread or multiple threads that is responsible for loading the data and implementing distortions, and then passing that to some other thread or some other process that then does the training and often this and this, can run in parallel.

A good place to get started might be to use someone else’s open source implementation for how they use data augmentation.

The state of computer vision

Deep learning has been successfully applied to computer vision, natural language processing, speech recognition, online advertising, logistics, many, many, many problems.

Image recognition was a problem of looking at a picture and telling you, is this a cat or not? Whereas object detection is look at a picture and actually, you’re putting the bounding boxes and telling you where in the picture the objects, such as the cars are, as well. And so because of the costs of getting the bounding boxes is just more expensive to label the objects and the bounding boxes, so we tend to have less data for object detection than for image recognition.

On average that when you have a lot data, you tend to find people getting away with using simpler algorithms as well as less hand engineering. So there’s just less needing to carefully design features for the problem.

  • But instead you can have a giant neural network, even a simpler architecture and have a neural network just learn whatever it wants to learn when you have a lot of data.
  • Whereas in contrast, when you don’t have that much data, then, on average you see people engaging in more hand engineering and

Two sources of knowledge :

  • labeled data
  • hand engineering features / network architectures / other components of your system

And someone that is insightful with hand engineering will get better performance.

If you look at the computer vision literature, look at the set of ideas out there, you’ll also find that people are really enthusiastic. They’re really into doing well on standardized benchmark data sets and on winning competitions. And for computer vision researchers, if you do well on the benchmarks it’s easier to get the paper published. So there is just a lot of attention on doing well on these benchmarks. 

  • And the positive side of this is that it helps the whole community figure out what are the most effective algorithms
  • but you also see in the papers, people do things that allow you to do well on a benchmark,
  • but that you wouldn’t really use in a production or a system that you deploy in an actual application.

Tips for doing well on benchmarks / winning competitions

  • Ensembling

Train several neural networks independently and average their outputs

But it’s almost never used in production to serve actual customers, I guess unless you have a huge computational budget and don’t mind burning a lot more of it per customer image.

  • Multi-crop at test time

Take the central crop. Then, take the four corners crops. Run these images through your classifier and then average the results.

And a neural network that works well on one vision problem often, maybe surprisingly, but it just often will work other vision problems as well. So, to build a practical system often you do well starting off with some else’s neural network architecture.

  • And you can use an open source implementation if possible because the open source implementation might have figured out all the finicky details.
  • But if you have the computer resources and the inclination, don’t let me stop you from training your own networks from scratch. And, in fact, if you want to invent your own computer vision algorithm, that’s what you might have to do.