3 Hyperparameter tuning

Tuning process

How to systematically organize your hyperparameter tuning process

One of the painful things about training deepness :

  • the sheer number of hyperparameters
  • Momentum
  • the number of layers
  • the number of hidden units for the different layers
  • learning rate decay
  • mini-batch size

How do you select a set of values to explore

  • It was common practice to sample the points in a grid and systematically explore these values.
  • In deep learning, what we tend to do, is choose the points at random.
  • Another common practice is to use a coarse to fine sampling scheme.
    • zoom in to a smaller region of the hyperparameters and then sample more density within this space 
    • use random sampling and adequate search

Using an appropriate scale to pick hyperparameters

It’s important to pick the appropriate scale on which to explore the hyperparamaters.

  • uniformly at random
  • log scale

Pandas VS Caviar(Hyperparameters tuning in practice: Pandas vs. Caviar)

Deep learning today is applied to many different application areas and that intuitions about hyperparameter settings from one application area may or may not transfer to a different one.

People from different application domains do read increasingly research papers from other application domains to look for inspiration for cross-fertilization.

How to search for good hyperparameters : the panda approach versus the caviar approach

Normalizing activations in a network

Batch normalization makes your hyperparameter search problem much easier, makes the neural network much more robust to the choice of hyperparameters, there’s a much bigger range of hyperparameters that work well, and will also enable you to much more easily train even very deep networks.

Fitting Batch Norm into a neural network

The Programming framework which will make using Batch Norm much easier.

Why does Batch Norm work?

  • normalizing all the features to take on a similar range of values that can speed up learning
  • makes weights,later or deeper than your network
  • It reduces the amount that the distribution of these hidden unit values shifts around.
  • Batch norm reduces the problem of the input values changing, it really causes these values to become more stable, so that the later layers of the neural network has more firm ground to stand on.
  • It weakens the coupling between what the early layers parameters has to do and what the later layers parameters have to do. And so it allows each layer of the network to learn by itself, a little bit more independently of other layers, and this has the effect of speeding up learning in the whole network.
  • Batch norm therefore has a slight regularization effect. Because by adding noise to the hidden units, it’s forcing the downstream hidden units not to rely too much on any one hidden unit.

Batch Norm at test time

Batch norm processes your data one mini batch at a time, but the test time you might need to process the examples one at a time.

So the takeaway from this is that during training time \(\mu\) and \(\sigma ^2\) are computed on an entire mini batch of, say, 64, 28 or some number of examples. But at test time, you might need to process a single example at a time. So, the way to do that is to estimate \(\mu\) and \(\sigma ^2\) from your training set and there are many ways to do that.

But in practice, what people usually do is implement an exponentially weighted average where you just keep track of the \(\mu\) and \(\sigma ^2\) values you’re seeing during training and use an exponentially weighted average, also sometimes called the running average, to just get a rough estimate of \(\mu\) and \(\sigma ^2\) and then you use those values of \(\mu\) and \(\sigma ^2\) at test time to do the scaling you need of the hidden unit values Z.

Deep learning framework usually have some default way to estimate the \(\mu\) and \(\sigma ^2\) that should work reasonably well as well.

Softmax regression

Softmax regression that lets you make predictions where you’re trying to recognize one of C or one of multiple classes, rather than just recognize two classes.

Training a Softmax classifier

Loss Function in softmax classification : \(L(\hat y, y) = -\sum _{j=1}^{j}y_{j}log(\hat y_{j})\)

It looks at whatever is the ground truth class in your training set, and it tries to make the corresponding probability of that class as high as possible. If you’re familiar with maximum likelihood estimation statistics, this turns out to be a form of maximum likelyhood estimation.

The cost J on the entire training set : \(J(w^{[1]}, b^{[1]}, \cdots \cdots ) = \frac {1}{m} \sum _{i=1}^{m} L(\hat y ^{(i)}, y ^{(i)})\)

Usually it turns out you just need to focus on getting the forward prop right. And so long as you specify it as a program framework, the forward prop pass, the program framework will figure out how to do back prop, how to do the backward pass for you.

Deep Learning frameworks

At least for most people, is not practical to implement everything yourself from scratch. Fortunately, there are now many good deep learning software frameworks that can help you implement these models.

choose frameworks :

  • Ease of programming
  • Running speed
  • Truly open
  • Preferences of language
  • What application you’re working on


Example :

import numpy as np
import tensorflow as tf
for i in range(1000):

Example (placeholder):

for i in range(1000):

the TensorFlow documentation tends to just write the operation.