4 Special applications : Face recognition & Neural style transfer

What is face recognition?

Liveness detection

Face Verification

  • Input image, name/ID
  • Output whether the input image is that of the claimed person

Face Recognition

  • Has a database of K persons
  • Get an input image
  • Output ID if the image is any of the K persons (or “not recognized”)

In fact we have a database of a hundred persons you probably need this to be even quite a bit higher than 99% for that to work well.

One-shot learning

One of the challenges of face recognition is that you need to solve the one-shot learning problem. What that means is that, for most face recognition applications, you need to recognize a person given just one single image, or given just one example of that person’s face. 

And historically, deep learning algorithms don’t work well if you have only one training example. So the carry-outs face recognition to carry out one-shot learning. So instead, to make this work, what you’re going to do instead is learning similarity function.

\(d(img1, img2) = degree\ of\ difference\ between\ images.\)


If \ \ d(img1, img2) \leq \tau & , same\\
\ \ \ \ \ \ \ \ \ \ \ > \tau & , different


Siamese network

Triplet loss

One way to learn the parameters of the neural network so that it gives you a good encoding for your pictures of faces is to define and apply gradient descent on the triplet loss function.

In the terminology of the triplet loss what you’re going to do is always look at one anchor image and then you want the distance between the anchor and a positive image really a positive example meaning is the same person to be similar. Whereas you want the anchor when pairs are compared to the negative example for their distances to be much further apart. So this is what gives rise to the term triplet loss which is that you always be looking at three images at a time, you’ll be looking at an anchor image a positive image as well as a negative image.

\(\left \| f(A) – f(P) \right \| ^2 – \left \| f(A) – f(N) \right \| ^2 + a \leq 0\) \(L(A,P,N) = max (\left \| f(A) – f(P) \right \| ^2 – \left \| f(A) – f(N) \right \| ^2 + a, 0)\)

For your face recognition system maybe you have only a single picture of someone you might be trying to recognize but for your training set you do need to make sure you have multiple images of the same person at least for some people in your training set so that you can have pairs of anchor and positive images.

Choosing the triplets A,P,N : 

During training, if A,P,N are chosen randomly,
\(d(A,P) + a \leq d(A,N)\) is easily satisfied.

So to construct a training set what you want to do is to choose triplets A P and N that are hard to train on this is one domain where because of the sheer data volume sizes this is one domain where often it might be useful for you to download someone else’s pretrained model rather than do everything from scratch yourself.

Face verification and binary classification

Take this pair of neural networks to take this siamese network and have them both compute these embeddings, maybe 128 dimensional embeddings, maybe even higher dimensional, and then have these be input to a logistic regression unit to then just make a prediction, where the target output will be 1 if both of these are the same persons, and 0 if both of these are of different persons. So this is a way to treat face recognition just as a binary classification problem.

\(\hat y = \sigma (\sum _{k=1}^{128} w_i | f(x^{(i)})_k – f(x^{(i)})_k| + b)\)


Help your deployment significantly : 

what you can do is actually pre compute that, so when the new employee walks in, what you can do is use this upper ConvNet to to compute that encoding and use it to then compare it  to your pre computed encoding, and then use that to make a prediction y hat.

What is neural style transfer?

In order to implement neural style transfer, you need to look at the features extracted by ConvNets, at various layers, the shallow and the deeper layers of a ConvNets.

What are deep ConvNets learning?

Cost function

Given a content image C and the style image S, then the goal is to generate a new image G.

\(J(G) = \alpha J_{content}(C,G) + \beta J_{style}(S,G)\)

Find the generated image G : 

  1. Initiate G randomly (G : 100 * 100 * 3)
  2. Use gradient descent to minimiza J(G)

Content cost function

  • Say you use hidden layer l to compute content cost.
  • Use pre-trained ConvNet. (E.g., VGG network)
\(J_{content}(C,G) = \frac {1}{2} \left \| a^{[l](C)} – a^{[l](G)} \right \| ^ 2\)
  • Let \(a^{[l](C)}\) and \(a^{[l](G)}\) be the activation of layer l on the images
  • If \(a^{[l](C)}\) and \(a^{[l](G)}\) are similar, both images have similar content

Style cost function

Style matrix : 

Let \(a_{i,j,k}^{[l]}\) = activation at \((i,j,k)\). \(G^{[l](s)}\) is \(n_{c}^{[l]} \times n_{c}^{[l]}\)

And it’s the degree of correlation that gives you one way of measuring how often these different high level features, such as vertical texture or this orange tint or other things as well. How often they occur and how often they occur together, and don’t occur together in different parts of an image.

Define this style image. \(G_{kk’}^{[l](G)} = \sum _{i=1}^{n_H^{[l]}} \sum _{j=1}^{n_W^{[l]}} a_{i,j,k}^{[l](G)} a_{i,j,k’}^{[l](G)} \) So G, defined using layer l and on the style image, is going to be a matrix, where the height and width of this matrix is the number of channels by number of channels. So in this matrix, the k, k prime element is going to measure how correlated our channels k and k prime.

Style cost function :

\(J_{style}^{[l]}(S,G) = \frac{1}{(2n_H^{[l]}n_W^{[l]}n_C^{[l]})^2} \sum _{k} \sum _{k’} (G_{kk’}^{[l](S)} – G_{kk’}^{[l](G)})\)


1D and 3D generalizations of models

For a long of 1d data applications you actually use a recurrent neural network.

Three-dimensional. And one way to think of this data is if your data now has some height, some width and then also some depth.

Leave a Reply

Your email address will not be published. Required fields are marked *