18 Application Example: Photo OCR

Problem Description and Pipeline

  1. Text detection
  2. Character segmentation
  3. Character classification

Sliding Windows

Go out and collect large training sets of positive and negative examples.

Take that green rectangle and we slide it over a bit and then run that new image patch through our classifier to decide if there’s a pedestrian there.

Getting Lots of Data and Artificial Data

  1. artificial data synthesis
  2. just collect the data and you label it yourself
  3. crowd sourcing

Ceiling Analysis What Part of the Pipeline to Work on Next

Where should you allocate resources?
Which of these boxes is most worth your efforts, trying to improve the performance of.

  1. choose one module
  2. provide it the correct text detection outputs
  3. And then, use the same evaluation metric as before, to measure what is the overall accuracy of the entire system
  4. go to step 1, but choose another one

17 Large Scale Machine Learning

Learning With Large Datasets

Draw a learning curve and determine if more data needs to be collected.

Stochastic Gradient Descent

Cost Function in SGD : \(cost(\theta, (x^{(i)}, y^{(i)})) = \frac {1}{2}(h_{\theta}(x^{(i)})-y^{(i)})^2\)

  1. randomly shuffle the data set
  2. a little gradient descent step using just one single training example
  3. maybe head in a bad direction, generally move the parameters in the direction of the global minimum, but not always
  4. it ends up doing is wandering around continuously in some region that’s in some region close to the global minimum

Mini-Batch Gradient Descent

In Batch gradient descent we will use all m examples in each generation.
Whereas in Stochastic gradient descent we will use a single example in each generation.
What Mini-batch gradient descent does is somewhere in between.

Stochastic Gradient Descent Convergence

\(\alpha = \frac {const1}{iterationNumber + const2}\)

We can compute the cost function on the last 1000 examples or so. And we can use this method both to make sure the stochastic gradient descent is okay and is converging or to use it to tune the learning rate alpha.

Online Learning

The online learning setting allows us to model problems where we have a continuous flood or a continuous stream of data coming in and we would like an algorithm to learn from that.

We learn using that example like so and then we throw that example away.

If you really have a continuous stream of data, then an online learning algorithm can be very effective.

If you have a changing pool of users, or if the things you’re trying to predict are slowly changing like your user taste is slowly changing, the online learning algorithm can slowly adapt your learned hypothesis to whatever the latest sets of user behaviors are like as well.

Map Reduce and Data Parallelism

In the MapReduce idea, one way to do, is split this training set in to different subsets and use many different machines.

  • multi-core machine
  • multiple machines
  • numerical linear algebra libraries
  • like Hadoop

16 Recommender Systems

Problem Formulation

  1. an important application of machine learning
  2. this idea of learning the features

Content Based Recommendations

user j ‘s parameter vector : \(\theta^{(j)}\)

movie i’s feature vector : \(x^{(i)}\)

Predicting rating: \((\theta^{(j)})^Tx^{(i)}\)

user \(j\) ‘s cost function : \(\underset {\theta ^{(j)}}{min} \frac {1}{2} \sum _{i:r(i,j)=1} ((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+ \frac {\lambda}{2}(\theta_k^{(j)})^2\)

all user’s cost function : \(\underset {\theta ^{(j)}, \cdots , \theta ^{(n_u)}}{min} \frac {1}{2} \sum _{j=1}^{n_u} \sum _{i:r(i,j)=1} ((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+ \frac {\lambda}{2} \sum_{j=1}^{n_u} \sum_{k=1}^{n} (\theta_k^{(j)})^2\)

Gradient descent : 

\(\left\{\begin{matrix}
\theta_k^{(j)} := \theta_k^{(j)} – \alpha \sum_{i:r(i,j)=1} ((\theta^{(j)})^Tx^{(i)} – y^{(i,j)})x_{k}^{(i)} &
\ (for \ k \ = \ 0) \\
\theta_k^{(j)} := \theta_k^{(j)} – \alpha (\sum_{i:r(i,j)=1} ((\theta^{(j)})^Tx^{(i)} – y^{(i,j)})x_{k}^{(i)} + \lambda \theta_{k}^{(j)}) & \ (for \ k \ \neq \ 0)
\end{matrix}\right.\)

 

Collaborative Filtering

No User’s Parameter and No Movie’s Features, you can do this.

The term collaborative filtering refers to the observation that when you run this algorithm with a large set of users, what all of these users are effectively doing are sort of collaboratively or collaborating to get better movie ratings for everyone because with every user rating some subset with the movies, every user is helping the algorithm a little bit to learn better features, and then by helping– by rating a few movies myself, I will be helping the system learn better features and then these features can be used by the system to make better movie predictions for everyone else. And so there is a sense of collaboration where every user is helping the system learn better features for the common good. This is this collaborative filtering.

Collaborative Filtering Algorithm

  1. Initialize x and theta to small random values
  2. minimize the cost function using great intercepts or one of the advance optimization algorithms
  3. predict

Vectorization Low Rank Matrix Factorization

A user has recently been looking at one product. Are there other related products that you could recommend to this user?

If you can find a different movie i, j, so that the distance between \(x^{(i)}\) and \(x^{(j)}\) is small, then this is a pretty strong indication that, you know, movies j and i are somehow similar

Use learned features to find what might be movies and what might be products that aren’t related to each other.

Implementational Detail Mean Normalization

If a user has not evaluated any movies, which movie should we recommend?

The idea of mean normalization will let us fix this problem.

15 Anomaly Detection

Problem Motivation

It’s mainly for unsupervised problem, that there’s some aspects of it that are also very similar to sort of the supervised learning problem.

some examples : 

  • detect strange behavior or fraudulent behavior
  • manufacturing
  • monitoring computers in a data center

Gaussian Distribution

Gaussian distribution

\(x\sim N(\mu , \sigma ^2)\)

 

Gaussian probability density

\(p(x, \mu , \sigma ^2) = \frac {1}{\sqrt{2\pi }\sigma } exp(-\frac {(x – \mu)^2}{2 \sigma ^2})\)

 

The location of the center of this bell-shaped curve

\(\mu = \frac {1}{m} \sum _{i=1}^{m}x^{(i)}\)

 

The width of this bell-shaped curve

\(\sigma ^ 2 = \frac {1}{m} \sum _{i=1}^{m} (x^{(i)} – \mu) ^2\)

 

Notice : The formula here  we use \(m\) instead of \(m – 1\) which is used in a statistics.

Algorithm

Address anomaly detection :

\(\mu _j = \frac {1}{m} \sum _{i=1}^{m}x^{(i)} _j\)

 

\(\sigma ^ 2 _j = \frac {1}{m} \sum _{i=1}^{m} (x^{(i)}_j – \mu _j) ^2\)

 

\(p(x) = \prod _{j=1}^{n}p(x_j; \mu _j, \sigma ^2_j) = \prod _{j=1}^{1}\frac {1}{\sqrt{2\pi } \sigma _j} exp(-\frac {(x_j – \mu_j)^2}{2 \sigma ^2_j})\)

 

If \(p(x) < \varepsilon \), it’s anomaly.

Developing and Evaluating an Anomaly Detection System

How to develop and evaluate an algorithm ?

  1. Take the training sets and fit the model \(p(x)\)
  2. On the cross validation of the test set, try to use different \(\varepsilon\), and then compute the F1 score
  3. After choosed \(\varepsilon\), evaluation of the algorithm on the test sets

Anomaly Detection vs. Supervised Learning

Anomaly DetectionSupervised Learning
very small number of positive, and a relatively large number of negative examplesa reasonably large number of both positive and negative examples
many different types of anomalieshave enough positive examples for an algorithm to get a sense of what the positive examples are like
future anomalies may look nothing like the ones you've seen so far
fraud detection, manufacturing, data centerSPAM email, weather prediction, classifying cancers

Choosing What Features to Use

  1. model the features using this sort of Gaussian distribution (play with different transformations of the data in order to make it look more Gaussian)
  2. do an error analysis procedure to come up with features for an anomaly detection algorithm
  3. create new features by combining me features

Multivariate Gaussian Distribution

\(p(x) = \prod _{j=1}^{n}p(x_j; \mu, \sigma ^2_j) = \prod _{j=1}^{n}\frac {1}{\sqrt{2\pi } \sigma _j} exp(-\frac {(x_j – \mu_j)^2}{2 \sigma ^2_j})\)

 

\(\mu = \frac {1}{m} \sum _{i=1}^{m}x^{(i)}\)

 

\(\sum = \frac {1}{m} \sum_{i=1}^{m} (x^{(i)} – \mu )(x^{(i)} – \mu )^T = \frac {1}{m} (X – \mu)^T(X – \mu)\)

 

\(p(x) = \frac {1}{(2 \pi)^{\frac {n}{2}}\left | \sum \right | ^{\frac {1}{2}}} exp(-\frac {1}{2} (x-\mu)^T\sum ^{-1}(x-\mu))\)

 

Gaussian Distribution Multivariate Gaussian Distribution
Manually create features to capture anomaliesAutomatically captures correlations between features
Computationally cheaper
Must have m > 10n or else sum is non-invertible

14 Dimensionality Reduction

Motivation I Data Compression

  1. compress the data, use up less computer memory or disk space
  2. speed up our learning algorithms.

Motivation II Visualization

If you have 50 features, it’s very difficult to plot 50-dimensional data.

But if you reduce dimensions, the problems is what these new features means.

Principal Component Analysis Problem Formulation

By far the most commonly used algorithm is something called principal components analysis or PCA.

What PCA does is it tries to find the surface onto which to project the data so as to minimize that.

Before applying PCA it’s standard practice to first perform mean normalization and feature scaling.

Principal Component Analysis Algorithm

Reduce the dimensions :

  1. Mean normalization, maybe perform feature scaling as well
  2. Covariance matrix, \(\sum = \frac {1}{m} \sum _{i=1}^{n}(x^{(i)})(x^{(i)})^T\)
  3. Eigenvectors of the matrix sigma

Choosing The Number Of Principal Components

The variation of the training sets : \(\frac {1}{m} \sum_{i=1}^{m} \left \| x^{(i)} \right \|^2\)

Try to choose k, a pretty common rule of thumb for choosing k is to choose the smaller values so that the ratio of  the average square projection error and the total variation in the data between these is less than 0.01.

\(\frac {\frac {1}{m} \sum_{i=1}^{m} \left \| x^{(i)} – x^{(i)}_{approx} \right \|^2}{\frac {1}{m} \sum_{i=1}^{m} \left \| x^{(i)} \right \| ^2} = 1 – \frac {\sum_{i=1}^{k}S_{ii}}{\sum_{i=1}^{n}S_{ii}} \leq 1 \%\)

 

\(\frac {\sum_{i=1}^{k}S_{ii}}{\sum_{i=1}^{n}S_{ii}} \geq 99 \%\)

 

After compressed : 

\(x^{(i)}_{approx} = U_{reduce}Z^{(i)}\)

 

Advice for Applying PCA

  • Don’t think of PCA as a way to prevent over-fitting. A much better way to address it, to use regularization. And the reason is that it throws away or reduces the dimension of your data without knowing what the values of y is so that it might throw away some valueable information.
  • First consider doing it with your original raw data \(x^{(i)}\), and only if that doesn’t do what you want, then implement PCA before using \(Z^{(i)}\).

 

13 Clustering

Unsupervised Learning Introduction

In unsupervised learning, what we do is, we give this sort of unlabeled training set to an algorithm and we just ask the algorithm: find some structure in the data for us. Given this data set, one type of structure we might have an algorithm find, is that it looks like this data set has points grouped into two separate clusters and so an algorithm that finds that clusters like the ones I just circled, is called a clustering algorithm.

So what is clustering good for?

  • Market segmentation
  • Social network analysis
  • Organize compute clusters or to organize data centers
  • Understand galaxy formation and astronomical detail

K-Means Algorithm

The K Means algorithm is by far the most popular, by far the most widely used clustering algorithm.

K Means is an iterative algorithm and it does two things.

randomly initialize two points, called the cluster centroids

  1. cluster assignment step
  2. move centroid step
Repeat {
	for i = 1 to m
		c(i) := index (from 1 to K) of cluster centroid closet to x(i)
	for k = 1 to K
		µk := average (mean) of points assigned to cluster k
}

Optimization Objective

Distortion function : 

\(J(c^{(1)},\cdots c^{(m)}, \mu_1,\cdots \mu_K) = \frac{1}{m} \sum_{i=1}^{m}\left \| X^{(i)}-\mu_c(i) \right \|^2\)

 

\(\mu_c(i)\) : the distance between \(X^{(i)}\) and the cluster centroid

Random Initialization

K-means can end up converging to different solutions depending on exactly how the clusters were initialized, and so, depending on the random initialization. K-means can end up at different solutions. And, in particular, K-means can actually end up at local optima.

How to initialize K-means and how to make K-means avoid local optima as well.

What we can do is, initialize K-means lots of times and run K-means lots of times, and use that to try to make sure we get as good a solution, as good a local or global optima as possible.

If the number of clusters is anywhere from two up to maybe 10 then doing multiple random initialization can often, can sometimes make sure that you find a better local optima.

But if K is very large, less likely to make a huge difference.

Choosing the Number of Clusters

There actually isn’t a great way of answering this or doing this automatically and by far the most common way of choosing the number of clusters, is still choosing it manually by looking at visualizations or by looking at the output of the clustering algorithm or something else.

One method  is called the Elbow Method, but don’t always expect that to work well.

12 Support Vector Machines

Optimization Objective

Within supervised learning, the performance of many supervised learning algorithms will be pretty similar and when that is less more often be whether you use learning algorithm A or learning algorithm B but when that is small there will often be things like the amount of data you are creating these algorithms on. That’s always your skill in applying this algorithms. Seems like your choice of the features that you designed to give the learning algorithms and how you choose the regularization parameter and things like that.

Support Vector Machine (SVM) : sometimes gives a cleaner and sometimes more powerful way of learning complex nonlinear functions.

Alternative view of logistic regression

\(h_\theta (x) = \frac {1}{1 + e^{-\theta ^{T}x}}\)

 

Cost of example : 

\(-(ylogh_\theta (x) + (1-y)log(1-h_\theta(x))) \)

 

\(= -ylog\frac {1}{1 + e^{-\theta ^{T}x}} + (1-y)log(1-\frac {1}{1 + e^{-\theta ^{T}x}})\)

 

Support vector machine

Logistic regression : 

\(\underset{\theta }{min} \frac{1}{m} [\sum_{i=1}^{m}y^{(i)}(-logh_\theta(x^{(i)})) + (1-y^{(i)})(-log(1-h_\theta (x^{(i)})))] + \frac{\lambda }{2m}\sum_{j=1}^{n} \theta _j^2\)

 

Support vector machine : 

\(\underset{\theta }{min} C [\sum_{i=1}^{m}y^{(i)}cost_1(\theta ^Tx^{(i)}) + (1-y^{(i)})cost_0(\theta ^Tx^{(i)})] + \frac{1}{2}\sum_{i=1}^{n} \theta _j^2\)

 

Large Margin Intuition

Sometimes people talk about support vector machines, as large margin classifiers.

The margin of the support vector machine and this gives the SVM a certain robustness, because it tries to separate the data with as a large a margin as possible. So the support vector machine is sometimes also called a large  margin classifier.

\(min \frac{1}{2} \sum _{j=1}^{n} \ \theta _j^2 \ s.t \
\left\{\begin{matrix}
\theta^Tx^{(i)} \geq \ 1 \ \ if \ y^{(i)} = 1\\
\theta^Tx^{(i)} \leq -1 \ \ if \ y^{(i)} = 0\\
\end{matrix}\right.\)

 

In practice when applying support vector machines, when C is not very very large like that, it can do a better job ignoring the few outliers.

Mathematics Behind Large Margin Classification

\(\left \| u \right \| = \sqrt{u_1^2 + u_2^2}\)

p is the length of the projection of the vector V onto the vector U.

so \(u^Tv = p \cdot \left \| u \right \|\)

and \(u^Tv = u_1 \times  v_1 + u_2 \times v_2\)

so \(p \cdot \left \| u \right \| = u_1 \times  v_1 + u_2 \times v_2\)

SVM Decision Boundary

\(\underset {\theta }{min} \frac{1}{2} \sum _{j=1}^{n} \ \theta _j^2 \ s.t \
\left\{\begin{matrix}
\theta^Tx^{(i)} \geq \ 1 \ \ if \ y^{(i)} = 1\\
\theta^Tx^{(i)} \leq -1 \ \ if \ y^{(i)} = 0\\
\end{matrix}\right.\)

 

if n == 2:

\(\begin{Vmatrix}
u
\end{Vmatrix}
= \sqrt{u_1^2 + u_2^2}\)

 

\(\frac{1}{2} (\theta _1^2 + \theta _2^2) = \frac{1}{2} (\sqrt {\theta _1^2 + \theta _2^2})^2 = \frac{1}{2}\left \| \theta \right \| ^2\)

 

So all the support vector machine is doing in the optimization objective is it’s minimizing the squared norm of the square length of the parameter vector theta.

\(p^{(i)}\) : a projection of the i-th training example onto the parameter vector \(\theta \).

\(\theta ^Tx^{(i)} = \theta_1 \cdot x_1^{(i)} + \theta_2 \cdot x_2^{(i)} \)

 

SVM Decision Boundary

\(\underset {\theta }{min} \frac{1}{2} \sum _{j=1}^{n} \ \theta _j^2 \ s.t \
\left\{\begin{matrix}
p^{(i)} \cdot \left \| \theta \right \| \geq \ 1 \ \ if \ y^{(i)} = 1\\
p^{(i)} \cdot \left \| \theta \right \| \leq -1 \ \ if \ y^{(i)} = 0\\
\end{matrix}\right.\)

If we can make the norm of theta smaller and therefore make the squared norm of theta smaller, which is why the SVM would choose this hypothesis on the right instead. And this is how the SVM gives rise to this large margin certification effect. Mainly, if you look at this green line, if you look at this green hypothesis we want the projections of my positive and negative examples onto theta to be large, and the only way for that to hold true this is if surrounding the green line. There’s this large margin, there’s this large gap that separates positive and negative examples is really the magnitude of this gap. The magnitude of this margin is exactly the values of P1, P2, P3 and so on. And so by making the margin large, by these tyros P1, P2, P3 and so on that’s the SVM can end up with a smaller value for the norm of theta which is what it is trying to do in the objective. And this is why this machine ends up with enlarge margin classifiers because itss trying to maximize the norm of these P1 which is the distance from the training examples to the decision boundary.

Kernels

  • complex polynomial features
\(\theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x_1x_2 + \theta_4x_1^2 + \theta_5x_2^2 + \cdots\)

 

  • new denotation
\(f_1 = x_1, \ f_2 = x_2, \ f_3 = x_1x_2, \ f_4 = x_1^2, \ f_5 = x_2^2\) \(h_\theta (x) = \theta_1f_1 + \theta_2f_2 + \cdots + \theta_nf_n\)

 

Gaussian Kernel

\(f_1 = similarity(x, l^{(1)}) = e^{(-\frac {\left \| x – l^{(1)} \right \| ^ 2}{2\sigma ^2})}\)

 

\(\left \| x – l^{(1)} \right \| ^ 2 = \sum _{j=1}^{n}(x_j – l_j^{(1)})^2\)

 

We define some extra features using landmarks and similarity functions to learn more complex nonlinear classifiers.

We use new features that are computed by Kernels, not original features.

How the landmarks are chosen

Choose the the location of my landmarks to be exactly near the locations of my m training examples.

Given \((x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \cdots , (x^{(m)}, y^{(m)})\)

choose \(l^{(1)} = x^{(1)}, l^{(2)} = x^{(2)}, \cdots \cdots , l^{(m)} = x^{(m)}\)

Given example x : 

\(f_1 = similarity(x, l^{(1)})\) \(f_2 = similarity(x, l^{(2)})\) \(\cdots \)

Cost Function : 

\(minC\sum _{i=1}^{m}[y^{(i)}cost_1(\theta^Tf^{(i)}) + (1-y^{(i)})cost_0(\theta^Tf^{(i)})]+\frac{1}{2}\sum_{j=1}^{n=m}\theta_j^2\)

 

Using kernels with logistic regression is going too very slow.

How to choose \(C \ and \ \sigma\)

\(C = 1 \ / \ \lambda\)
  • \(C\) big: overfitting, higher variance
  • \(C\) small: underfitting, higher bias
  • \(\sigma\) big : lower variance, higher bias
  • \(\sigma\) small : lower bias, higher variance

Using An SVM

Some library function :

  • liblinear
  • libsvm

There are a few things need to do :

  1. parameter’s C
  2. choose the kernel (If we decide not to use any kernel. And the idea of no kernel is also called a linear kernel)

Logistic Regression or Support Vector Machine

  • n : number of features
  • m : number of training examples
  1. n >> m, Logistic Regression or Linear Kernel
  2. n small, and m middle, like 1 < n < 1000, 10 < m < 10000, SVM with Gaussian Kernel
  3. n small, and m big, like 1 < n < 1000, 50000 < m, SVM is slower, so try to munually create more features and then use logistic regression or an SVM without the Kernel

Well for all of these problems, for all of these different regimes, a well designed neural network is likely to work well as well.

The algorithm does matter, but what often matters even more is things like, how much data do you have. And how skilled are you, how good are you at doing error analysis and debugging learning algorithms, figuring out how to design new features and figuring out what other features to give you learning algorithms and so on.

 

11 Machine Learning System Design

Prioritizing What to Work On

How to strategize putting together a complex machine learning system.

It’s hard to choose the options which is the best use of your time.

In fact, if you even get to the stage where you brainstorm a list of different options to try, you’re probably already ahead of the curve.

We must have a more systematic way to choose among the options of the many different things.

Error Analysis

If you’re starting work on a machine learning product or building a machine learning application, it is often considered very good practice to start, not by building a very complicated system with lots of complex features and so on, but to instead start by building a very simple algorithm, the you can implement quickly.

It’s often by implementing even a very, very quick and dirty implementation and by plotting learning curves that that helps you make these decisions.

And often by doing that, this is the process that would inspire you to design new features. Or they’ll tell you whether the current things or current shortcomings of the system and give you the inspiration you need to come up with improvements to it.

Recommended way :

  1. start by building a very simple algorithm
  2. plot a learning curve
  3. error analysis (a single rule number evaluation metric)

Strongly recommended way to do error analysis is on the cross validation set rather than the test set.

Error Metrics for Skewed Classes

It’s particularly tricky to come up with an appropriate error metric, or evaluation metric, for your learning algorithm.

MatrixPrediction Value
PositiveNegtive
Actual ValueNegtiveFPTN
PositiveTPFN

Trading Off Precision and Recall

\(Precision = TP / (TP + FP)\)
  • If you want to make predicting only when you’re more confident, and so you end up with a classifier that has higher precision.
\(Recall= TP / (TP + FN)\)
  • If we want to avoid missing too many actual cases. So we want to avoid the false negatives. And in this case, what we would have is going to be a higher recall classifier.
\(F_1Score : 2 \frac{PR}{P + R}\)
  • Maybe we can choose the higher F1 value in some cases.

Data For Machine Learning

In some cases, I had cautioned against blindly going out and just spending lots of time collecting lots of data, because it’s only sometimes that that would actually help.

In machine learning that often in machine learning it’s not who has the best algorithm that wins, it’s who has the most data.

If you have a lot of data and you train a learning algorithm with lot of parameters, that might be a good way to give a high performance learning algorithm.

The Key : 

  1. Find some features x and confidently predict the value of y.
  2. Actually get a large training set, and train the learning algorithm with a lot of parameters in the training set.

If you can’t do both then you need a very kind performance learning algorithm.

10 Advice for Applying Machine Learning

Deciding What to Try Next

If you are developing a machine learning system or trying to improve the performance of a machine learning system, how do you go about deciding what are the proxy avenues to try next?

If you find that this is making huge errors in this prediction. What should you then try mixing in order to improve the learning algorithm?

One thing they could try, is to get more training examples. But sometimes getting more training data doesn’t actually help.

Other things you might try are to well maybe try a smaller set of features.

There is a pretty simple technique that can let you very quickly rule out half of the things on this list as being potentially promising things to pursue. And there is a very simple technique, that if you run, can easily rule out many of these options, and potentially save you a lot of time pursuing something that’s just is not going to work.

Machine Learning Diagnostics, and what a diagnostic is, is a test you can run, to get insight into what is or isn’t working with an algorithm, and which will often give you insight as to what are promising things to try to improve a learning algorithm’s performance.

Evaluating a Hypothesis

If there is any sort of ordinary to the data. That should be better to send a random 70% of your data to the training set and a random 30% of your data to the test set.

Model Selection and Train_Validation_Test Sets

To send 60% of your data’s, your training set, maybe 20% to your cross validation set, and 20% to your test set.

Diagnosing Bias vs. Variance

The training set error, will be high. And you might find that the cross validation error will also be high. It might be a close. Maybe just slightly higher than a training error. The algorithm may be suffering from high bias. In contrast if your algorithm is suffering from high variance.

Regularization and Bias_Variance

Looking at the plot of the whole or cross validation error, you can either manually, automatically try to select a point that minimizes 
the cross-validation error and select the value of lambda corresponding to low cross-validation error.

Learning Curves

Learning curves is often a very useful thing to plot. If either you wanted to sanity check that your algorithm is working correctly, or if you want to improve the performance of the algorithm.

To plot a learning curve, is plot j train which is, say, average squared error on my training set or Jcv which is the average squared error on my cross validation set. And I’m going to plot that as a function of m, that is as a function of the number of training examples.

In the high variance setting, getting more training data is, indeed, likely to help.

Deciding What to Do Next Revisited

  1. getting more training examples is good for high variance
  2. a smaller set of features fixes high variance
  3. adding features usually is a solution for fixing high bias
  4. similarly, adding polynomial features
  5. decreasing lambda fixes fixes high bias
  6. increasing lambda fixes high variance

It turns out if you’re applying neural network very often using a large neural network often it’s actually the larger, the better.

Using a single hidden layer is a reasonable default, but if you want to choose the number of hidden layers, one other thing you can try is find yourself a training cross-validation, and test set split and try training neural networks with one hidden layer or two hidden layers or three hidden layers and see which of those neural networks performs best on the cross-validation sets.

9 Neural Networks: Learning

Cost Function

Two types of classification problems:

  • Binary classification : where the labels y are either zero or one.
  • multiclass classification : where we may have k distinct classes.

Backpropagation Algorithm

It’s too hard to describe.

Backpropagation Intuition

It’s too hard to describe.

Implementation Note_ Unrolling Parameters

It’s too hard to describe.

Gradient Checking

Back prop as an algorithm has one unfortunate property is that there are many ways to have subtle bugs in back prop so that if you run it with gradient descent or some other optimization algorithm, it could actually look like it’s working. And, you know, your cost function \(J(\theta )\) may end up decreasing on every iteration of gradient descent, but this could pull through even though there might be some bug in your implementation of back prop. So it looks like \(J(\theta )\) is decreasing, but you might just wind up with a neural network that has a higher level of error than you would with a bug-free implementation and you might just not know that there was this subtle bug that’s giving you this performance.
Gradient checking that eliminates almost all of these problems.

Random Initialization

To train a neural network, what you should do is randomly initialize the weights to, you know, small values close to 0, between \(-\epsilon\) and \(+\epsilon\).

Putting It Together

How to implement a neural network learning algorithm ?

  • Pick some network architecture, it means connectivity pattern between the neurons.
  • Once you decides on the fix set of features x the number of input units will just be, the dimension of your features x(i) would be determined by that.
  • The number of output of this will be determined by the number of classes in your classification problem.
  • If you use more than one hidden layer, again the reasonable default will be to have the same number of hidden units in every single layer.
  • (As for the number of hidden units – usually, the more hidden units the better)

What we need to implement in order to trade in neural network ?

  1. set up the neural network and to randomly initialize the values of the weights
  2. forward propagation
  3. compute this cost function \(J(\theta )\)
  4. back-propagation
  5. gradient checking
  6. use an optimization algorithm

Autonomous Driving

A fun and historically important example of Neural Network Learning, just so so.