## Motivation I Data Compression

1. compress the data, use up less computer memory or disk space
2. speed up our learning algorithms.

## Motivation II Visualization

If you have 50 features, it’s very difficult to plot 50-dimensional data.

But if you reduce dimensions, the problems is what these new features means.

## Principal Component Analysis Problem Formulation

By far the most commonly used algorithm is something called principal components analysis or PCA.

What PCA does is it tries to find the surface onto which to project the data so as to minimize that.

Before applying PCA it’s standard practice to first perform mean normalization and feature scaling.

## Principal Component Analysis Algorithm

Reduce the dimensions :

1. Mean normalization, maybe perform feature scaling as well
2. Covariance matrix, $$\sum = \frac {1}{m} \sum _{i=1}^{n}(x^{(i)})(x^{(i)})^T$$
3. Eigenvectors of the matrix sigma

## Choosing The Number Of Principal Components

The variation of the training sets : $$\frac {1}{m} \sum_{i=1}^{m} \left \| x^{(i)} \right \|^2$$

Try to choose k, a pretty common rule of thumb for choosing k is to choose the smaller values so that the ratio of  the average square projection error and the total variation in the data between these is less than 0.01.

$$\frac {\frac {1}{m} \sum_{i=1}^{m} \left \| x^{(i)} – x^{(i)}_{approx} \right \|^2}{\frac {1}{m} \sum_{i=1}^{m} \left \| x^{(i)} \right \| ^2} = 1 – \frac {\sum_{i=1}^{k}S_{ii}}{\sum_{i=1}^{n}S_{ii}} \leq 1 \%$$

$$\frac {\sum_{i=1}^{k}S_{ii}}{\sum_{i=1}^{n}S_{ii}} \geq 99 \%$$

After compressed :

$$x^{(i)}_{approx} = U_{reduce}Z^{(i)}$$

## Advice for Applying PCA

• Don’t think of PCA as a way to prevent over-fitting. A much better way to address it, to use regularization. And the reason is that it throws away or reduces the dimension of your data without knowing what the values of y is so that it might throw away some valueable information.
• First consider doing it with your original raw data $$x^{(i)}$$, and only if that doesn’t do what you want, then implement PCA before using $$Z^{(i)}$$.

## Unsupervised Learning Introduction

In unsupervised learning, what we do is, we give this sort of unlabeled training set to an algorithm and we just ask the algorithm: find some structure in the data for us. Given this data set, one type of structure we might have an algorithm find, is that it looks like this data set has points grouped into two separate clusters and so an algorithm that finds that clusters like the ones I just circled, is called a clustering algorithm.

#### So what is clustering good for?

• Market segmentation
• Social network analysis
• Organize compute clusters or to organize data centers
• Understand galaxy formation and astronomical detail

## K-Means Algorithm

The K Means algorithm is by far the most popular, by far the most widely used clustering algorithm.

K Means is an iterative algorithm and it does two things.

randomly initialize two points, called the cluster centroids

1. cluster assignment step
2. move centroid step
Repeat {
for i = 1 to m
c(i) := index (from 1 to K) of cluster centroid closet to x(i)
for k = 1 to K
µk := average (mean) of points assigned to cluster k
}

## Optimization Objective

Distortion function :

$$J(c^{(1)},\cdots c^{(m)}, \mu_1,\cdots \mu_K) = \frac{1}{m} \sum_{i=1}^{m}\left \| X^{(i)}-\mu_c(i) \right \|^2$$

$$\mu_c(i)$$ : the distance between $$X^{(i)}$$ and the cluster centroid

## Random Initialization

K-means can end up converging to different solutions depending on exactly how the clusters were initialized, and so, depending on the random initialization. K-means can end up at different solutions. And, in particular, K-means can actually end up at local optima.

How to initialize K-means and how to make K-means avoid local optima as well.

What we can do is, initialize K-means lots of times and run K-means lots of times, and use that to try to make sure we get as good a solution, as good a local or global optima as possible.

If the number of clusters is anywhere from two up to maybe 10 then doing multiple random initialization can often, can sometimes make sure that you find a better local optima.

But if K is very large, less likely to make a huge difference.

## Choosing the Number of Clusters

There actually isn’t a great way of answering this or doing this automatically and by far the most common way of choosing the number of clusters, is still choosing it manually by looking at visualizations or by looking at the output of the clustering algorithm or something else.

One method  is called the Elbow Method, but don’t always expect that to work well.

## Optimization Objective

Within supervised learning, the performance of many supervised learning algorithms will be pretty similar and when that is less more often be whether you use learning algorithm A or learning algorithm B but when that is small there will often be things like the amount of data you are creating these algorithms on. That’s always your skill in applying this algorithms. Seems like your choice of the features that you designed to give the learning algorithms and how you choose the regularization parameter and things like that.

Support Vector Machine (SVM) : sometimes gives a cleaner and sometimes more powerful way of learning complex nonlinear functions.

Alternative view of logistic regression

$$h_\theta (x) = \frac {1}{1 + e^{-\theta ^{T}x}}$$

Cost of example :

$$-(ylogh_\theta (x) + (1-y)log(1-h_\theta(x)))$$

$$= -ylog\frac {1}{1 + e^{-\theta ^{T}x}} + (1-y)log(1-\frac {1}{1 + e^{-\theta ^{T}x}})$$

Support vector machine

Logistic regression :

$$\underset{\theta }{min} \frac{1}{m} [\sum_{i=1}^{m}y^{(i)}(-logh_\theta(x^{(i)})) + (1-y^{(i)})(-log(1-h_\theta (x^{(i)})))] + \frac{\lambda }{2m}\sum_{j=1}^{n} \theta _j^2$$

Support vector machine :

$$\underset{\theta }{min} C [\sum_{i=1}^{m}y^{(i)}cost_1(\theta ^Tx^{(i)}) + (1-y^{(i)})cost_0(\theta ^Tx^{(i)})] + \frac{1}{2}\sum_{i=1}^{n} \theta _j^2$$

## Large Margin Intuition

Sometimes people talk about support vector machines, as large margin classifiers.

The margin of the support vector machine and this gives the SVM a certain robustness, because it tries to separate the data with as a large a margin as possible. So the support vector machine is sometimes also called a large  margin classifier.

$$min \frac{1}{2} \sum _{j=1}^{n} \ \theta _j^2 \ s.t \ \left\{\begin{matrix} \theta^Tx^{(i)} \geq \ 1 \ \ if \ y^{(i)} = 1\\ \theta^Tx^{(i)} \leq -1 \ \ if \ y^{(i)} = 0\\ \end{matrix}\right.$$

In practice when applying support vector machines, when C is not very very large like that, it can do a better job ignoring the few outliers.

## Mathematics Behind Large Margin Classification

$$\left \| u \right \| = \sqrt{u_1^2 + u_2^2}$$

p is the length of the projection of the vector V onto the vector U.

so $$u^Tv = p \cdot \left \| u \right \|$$

and $$u^Tv = u_1 \times v_1 + u_2 \times v_2$$

so $$p \cdot \left \| u \right \| = u_1 \times v_1 + u_2 \times v_2$$

SVM Decision Boundary

$$\underset {\theta }{min} \frac{1}{2} \sum _{j=1}^{n} \ \theta _j^2 \ s.t \ \left\{\begin{matrix} \theta^Tx^{(i)} \geq \ 1 \ \ if \ y^{(i)} = 1\\ \theta^Tx^{(i)} \leq -1 \ \ if \ y^{(i)} = 0\\ \end{matrix}\right.$$

if n == 2:

$$\begin{Vmatrix} u \end{Vmatrix} = \sqrt{u_1^2 + u_2^2}$$

$$\frac{1}{2} (\theta _1^2 + \theta _2^2) = \frac{1}{2} (\sqrt {\theta _1^2 + \theta _2^2})^2 = \frac{1}{2}\left \| \theta \right \| ^2$$

So all the support vector machine is doing in the optimization objective is it’s minimizing the squared norm of the square length of the parameter vector theta.

$$p^{(i)}$$ : a projection of the i-th training example onto the parameter vector $$\theta$$.

$$\theta ^Tx^{(i)} = \theta_1 \cdot x_1^{(i)} + \theta_2 \cdot x_2^{(i)}$$

SVM Decision Boundary

$$\underset {\theta }{min} \frac{1}{2} \sum _{j=1}^{n} \ \theta _j^2 \ s.t \ \left\{\begin{matrix} p^{(i)} \cdot \left \| \theta \right \| \geq \ 1 \ \ if \ y^{(i)} = 1\\ p^{(i)} \cdot \left \| \theta \right \| \leq -1 \ \ if \ y^{(i)} = 0\\ \end{matrix}\right.$$

If we can make the norm of theta smaller and therefore make the squared norm of theta smaller, which is why the SVM would choose this hypothesis on the right instead. And this is how the SVM gives rise to this large margin certification effect. Mainly, if you look at this green line, if you look at this green hypothesis we want the projections of my positive and negative examples onto theta to be large, and the only way for that to hold true this is if surrounding the green line. There’s this large margin, there’s this large gap that separates positive and negative examples is really the magnitude of this gap. The magnitude of this margin is exactly the values of P1, P2, P3 and so on. And so by making the margin large, by these tyros P1, P2, P3 and so on that’s the SVM can end up with a smaller value for the norm of theta which is what it is trying to do in the objective. And this is why this machine ends up with enlarge margin classifiers because itss trying to maximize the norm of these P1 which is the distance from the training examples to the decision boundary.

## Kernels

• complex polynomial features
$$\theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x_1x_2 + \theta_4x_1^2 + \theta_5x_2^2 + \cdots$$

• new denotation
$$f_1 = x_1, \ f_2 = x_2, \ f_3 = x_1x_2, \ f_4 = x_1^2, \ f_5 = x_2^2$$ $$h_\theta (x) = \theta_1f_1 + \theta_2f_2 + \cdots + \theta_nf_n$$

Gaussian Kernel

$$f_1 = similarity(x, l^{(1)}) = e^{(-\frac {\left \| x – l^{(1)} \right \| ^ 2}{2\sigma ^2})}$$

$$\left \| x – l^{(1)} \right \| ^ 2 = \sum _{j=1}^{n}(x_j – l_j^{(1)})^2$$

We define some extra features using landmarks and similarity functions to learn more complex nonlinear classifiers.

We use new features that are computed by Kernels, not original features.

#### How the landmarks are chosen

Choose the the location of my landmarks to be exactly near the locations of my m training examples.

Given $$(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \cdots , (x^{(m)}, y^{(m)})$$

choose $$l^{(1)} = x^{(1)}, l^{(2)} = x^{(2)}, \cdots \cdots , l^{(m)} = x^{(m)}$$

Given example x :

$$f_1 = similarity(x, l^{(1)})$$ $$f_2 = similarity(x, l^{(2)})$$ $$\cdots$$

Cost Function :

$$minC\sum _{i=1}^{m}[y^{(i)}cost_1(\theta^Tf^{(i)}) + (1-y^{(i)})cost_0(\theta^Tf^{(i)})]+\frac{1}{2}\sum_{j=1}^{n=m}\theta_j^2$$

Using kernels with logistic regression is going too very slow.

#### How to choose $$C \ and \ \sigma$$

$$C = 1 \ / \ \lambda$$
• $$C$$ big: overfitting, higher variance
• $$C$$ small: underfitting, higher bias
• $$\sigma$$ big : lower variance, higher bias
• $$\sigma$$ small : lower bias, higher variance

## Using An SVM

Some library function :

• liblinear
• libsvm

There are a few things need to do :

1. parameter’s C
2. choose the kernel (If we decide not to use any kernel. And the idea of no kernel is also called a linear kernel)

#### Logistic Regression or Support Vector Machine

• n : number of features
• m : number of training examples
1. n >> m, Logistic Regression or Linear Kernel
2. n small, and m middle, like 1 < n < 1000, 10 < m < 10000, SVM with Gaussian Kernel
3. n small, and m big, like 1 < n < 1000, 50000 < m, SVM is slower, so try to munually create more features and then use logistic regression or an SVM without the Kernel

Well for all of these problems, for all of these different regimes, a well designed neural network is likely to work well as well.

The algorithm does matter, but what often matters even more is things like, how much data do you have. And how skilled are you, how good are you at doing error analysis and debugging learning algorithms, figuring out how to design new features and figuring out what other features to give you learning algorithms and so on.

## Prioritizing What to Work On

How to strategize putting together a complex machine learning system.

It’s hard to choose the options which is the best use of your time.

In fact, if you even get to the stage where you brainstorm a list of different options to try, you’re probably already ahead of the curve.

We must have a more systematic way to choose among the options of the many different things.

## Error Analysis

If you’re starting work on a machine learning product or building a machine learning application, it is often considered very good practice to start, not by building a very complicated system with lots of complex features and so on, but to instead start by building a very simple algorithm, the you can implement quickly.

It’s often by implementing even a very, very quick and dirty implementation and by plotting learning curves that that helps you make these decisions.

And often by doing that, this is the process that would inspire you to design new features. Or they’ll tell you whether the current things or current shortcomings of the system and give you the inspiration you need to come up with improvements to it.

#### Recommended way :

1. start by building a very simple algorithm
2. plot a learning curve
3. error analysis (a single rule number evaluation metric)

Strongly recommended way to do error analysis is on the cross validation set rather than the test set.

## Error Metrics for Skewed Classes

It’s particularly tricky to come up with an appropriate error metric, or evaluation metric, for your learning algorithm.

 Matrix Prediction Value Positive Negtive Actual Value Negtive FP TN Positive TP FN

## Trading Off Precision and Recall

$$Precision = TP / (TP + FP)$$
• If you want to make predicting only when you’re more confident, and so you end up with a classifier that has higher precision.
$$Recall= TP / (TP + FN)$$
• If we want to avoid missing too many actual cases. So we want to avoid the false negatives. And in this case, what we would have is going to be a higher recall classifier.
$$F_1Score : 2 \frac{PR}{P + R}$$
• Maybe we can choose the higher F1 value in some cases.

## Data For Machine Learning

In some cases, I had cautioned against blindly going out and just spending lots of time collecting lots of data, because it’s only sometimes that that would actually help.

In machine learning that often in machine learning it’s not who has the best algorithm that wins, it’s who has the most data.

If you have a lot of data and you train a learning algorithm with lot of parameters, that might be a good way to give a high performance learning algorithm.

#### The Key :

1. Find some features x and confidently predict the value of y.
2. Actually get a large training set, and train the learning algorithm with a lot of parameters in the training set.

If you can’t do both then you need a very kind performance learning algorithm.

## Deciding What to Try Next

If you are developing a machine learning system or trying to improve the performance of a machine learning system, how do you go about deciding what are the proxy avenues to try next?

If you find that this is making huge errors in this prediction. What should you then try mixing in order to improve the learning algorithm?

One thing they could try, is to get more training examples. But sometimes getting more training data doesn’t actually help.

Other things you might try are to well maybe try a smaller set of features.

There is a pretty simple technique that can let you very quickly rule out half of the things on this list as being potentially promising things to pursue. And there is a very simple technique, that if you run, can easily rule out many of these options, and potentially save you a lot of time pursuing something that’s just is not going to work.

Machine Learning Diagnostics, and what a diagnostic is, is a test you can run, to get insight into what is or isn’t working with an algorithm, and which will often give you insight as to what are promising things to try to improve a learning algorithm’s performance.

## Evaluating a Hypothesis

If there is any sort of ordinary to the data. That should be better to send a random 70% of your data to the training set and a random 30% of your data to the test set.

## Model Selection and Train_Validation_Test Sets

To send 60% of your data’s, your training set, maybe 20% to your cross validation set, and 20% to your test set.

## Diagnosing Bias vs. Variance

The training set error, will be high. And you might find that the cross validation error will also be high. It might be a close. Maybe just slightly higher than a training error. The algorithm may be suffering from high bias. In contrast if your algorithm is suffering from high variance.

## Regularization and Bias_Variance

Looking at the plot of the whole or cross validation error, you can either manually, automatically try to select a point that minimizes
the cross-validation error and select the value of lambda corresponding to low cross-validation error.

## Learning Curves

Learning curves is often a very useful thing to plot. If either you wanted to sanity check that your algorithm is working correctly, or if you want to improve the performance of the algorithm.

To plot a learning curve, is plot j train which is, say, average squared error on my training set or Jcv which is the average squared error on my cross validation set. And I’m going to plot that as a function of m, that is as a function of the number of training examples.

In the high variance setting, getting more training data is, indeed, likely to help.

## Deciding What to Do Next Revisited

1. getting more training examples is good for high variance
2. a smaller set of features fixes high variance
3. adding features usually is a solution for fixing high bias
4. similarly, adding polynomial features
5. decreasing lambda fixes fixes high bias
6. increasing lambda fixes high variance

It turns out if you’re applying neural network very often using a large neural network often it’s actually the larger, the better.

Using a single hidden layer is a reasonable default, but if you want to choose the number of hidden layers, one other thing you can try is find yourself a training cross-validation, and test set split and try training neural networks with one hidden layer or two hidden layers or three hidden layers and see which of those neural networks performs best on the cross-validation sets.

## Cost Function

Two types of classification problems:

• Binary classification : where the labels y are either zero or one.
• multiclass classification : where we may have k distinct classes.

## Backpropagation Algorithm

It’s too hard to describe.

## Backpropagation Intuition

It’s too hard to describe.

## Implementation Note_ Unrolling Parameters

It’s too hard to describe.

## Gradient Checking

Back prop as an algorithm has one unfortunate property is that there are many ways to have subtle bugs in back prop so that if you run it with gradient descent or some other optimization algorithm, it could actually look like it’s working. And, you know, your cost function $$J(\theta )$$ may end up decreasing on every iteration of gradient descent, but this could pull through even though there might be some bug in your implementation of back prop. So it looks like $$J(\theta )$$ is decreasing, but you might just wind up with a neural network that has a higher level of error than you would with a bug-free implementation and you might just not know that there was this subtle bug that’s giving you this performance.
Gradient checking that eliminates almost all of these problems.

## Random Initialization

To train a neural network, what you should do is randomly initialize the weights to, you know, small values close to 0, between $$-\epsilon$$ and $$+\epsilon$$.

## Putting It Together

How to implement a neural network learning algorithm ?

• Pick some network architecture, it means connectivity pattern between the neurons.
• Once you decides on the fix set of features x the number of input units will just be, the dimension of your features x(i) would be determined by that.
• The number of output of this will be determined by the number of classes in your classification problem.
• If you use more than one hidden layer, again the reasonable default will be to have the same number of hidden units in every single layer.
• (As for the number of hidden units – usually, the more hidden units the better)

What we need to implement in order to trade in neural network ?

1. set up the neural network and to randomly initialize the values of the weights
2. forward propagation
3. compute this cost function $$J(\theta )$$
4. back-propagation
5. gradient checking
6. use an optimization algorithm

## Autonomous Driving

A fun and historically important example of Neural Network Learning, just so so.

## Non-linear Hypotheses

Neural Networks, which turns out to be a much better way to learn complex hypotheses, complex nonlinear hypotheses even when your input feature space, even when n is large.

## Neurons and the Brain

Neural Networks are a pretty old algorithm that was originally motivated by the goal of having machines that can mimic the brain.

It’s actually a very effective state of the art technique for modern day machine learning applications.

## Model Representation

A neuron is is a computational unit that gets a number of inputs through its input wires, does some computation, and then it sends outputs, via its axon to other nodes or other neurons in the brain.

Weights of a model just means exactly the same thing as parameters of the model.

Forward Propagation : Because it start of with the activation of the input-units and then it sort of forward-propagate that to the hidden layer and then it sort of forward propagate that and compute the activation of the output layer.

The more complex features will be better than x^n, and it will be more work well for prediction new data.

## Examples and Intuitions

In normal Logistic Regression, though we can use some polynomial to contract some features, we still be limited by original features.But in Neuron Network, the original features just be work on input layer.

We can use contract neuron network to  be more complex neuron network that will do more complex compute.

## Multiclass Classification

four output units represents four classification

## The Problem of Overfitting

Regularization, that will allow us to ameliorate or to reduce this overfitting problem and get these learning algorithms to maybe work much better.

If you were to fit a very high-order polynomial, if you were to generate lots of high-order polynomial terms of speeches, then, logistical regression may contort itself, may try really hard to find a decision boundary that fits your training data or go to great lengths to contort itself, to fit every single training example well.
But this really doesn’t look like a very good hypothesis, for making predictions.

The term generalized refers to how well a hypothesis applies even to new examples.

In order to address over fitting, there are two main options for things that we can do.

• reduce the number of features
• regularization, keep all the features, but we’re going to reduce the magnitude or the values of the parameters
• ## Cost Function

#### Orignal Model :

$$h_\theta(x) = \theta_0 + \theta_1x_1 + \theta_2x\underset{2}{2} + \theta_3x\underset{3}{3} + \theta_4x\underset{4}{4}$$

#### Modified Model :

$$\underset{\theta }{min}\frac{1}{2m}[\sum_{i=1}^{m}(h_\theta(x^{(i)}-y^{(i)})^2+1000\theta\underset{3}{2}+10000\theta \underset{4}{2})]$$

#### Suppose :

$$J(\theta) = \frac {1}{2m} [\sum_{i=1}^{m} (h_\theta(x^{(i)}-y^{(i)})^2 + \lambda \sum_{j=1}^{n} \theta _j^2]$$

regularization parameter : $$\lambda$$

#### Notice :

$$\lambda \sum_{j=1}^{n} \theta _j^2$$

The extra regularization term at the end to shrink every single parameter and so this term we tend to shrink all of my parameters.

## Regularized Linear Regression

$$J(\theta) = \frac {1}{2m} [\sum_{i=1}^{m} (h_\theta(x^{(i)}-y^{(i)})^2 + \lambda \sum_{j=1}^{n} \theta _j^2]$$

repeat until convergence {

$$\theta _0 := \theta _0 – \alpha \frac {1}{m} \sum _{i=1}^{m} (h_\theta (x^{(i)}) – y^{(i)})x_0^{(i)}$$
$$\theta _j := \theta _j – \alpha [\frac {1}{m} \sum _{i=1}^{m} (h_\theta (x^{(i)}) – y^{(i)})x_j^{(i)} + \frac {\lambda }{m}\theta _j]$$

}

#### Modified :

$$\theta _j := \theta _j(1 – \alpha \frac {\lambda }{m}) – \alpha \frac {1}{m} \sum _{i=1}^{m} (h_\theta (x^{(i)}) – y^{(i)})x_j^{(i)}$$

## Regularized Logistic Regression

$$J(\theta) = \frac {1}{m}\sum _{i=1}^{m} [-y^{(i)} log(h_\theta (x^{(i)})) – (1 – y^{(i)})log(1 – h_\theta (x^{(i)}))] + \frac {\lambda}{2m} \sum _{i=1}^{n} \theta _j^{2}$$

#### Python Code :

1. import numpy as np
2. def costReg(theta, X, y, learningRate):
3. 	theta = np.matrix(theta)
4. 	X = np.matrix(X)
5. 	y = np.matrix(y)
6. 	first = np.multiply(-y, np.log(sigmoid(X * theta.T)))
7. 	second = np.multiply((1 - y), np.log(1 - sigmoid(X * theta.T)))
8. 	reg = learningRate / (2 * len(X)) * np.sum(np.power(theta[:,1:theta.shape], 2))
9. 	return np.sum(first - second) / len(X) + reg

## Classification

Classification : $$y =$$ 0 or 1
$$h_\theta (x)$$ can be > 1 or < 0

Logistic Regression: $$0 \leq h_\theta (x) \leq 1$$

Logistic regression which has the property that the output, the predictions of logistic regression are always between zero and one.

Logistic Regression is actually a classification algorithm.

## Hypothesis Representation

Sigmoid function : $$g(z) = \frac {1}{1+e^{-z}}$$

1. import numpy as np
2. def sigmoid(z):
3.     return 1 / (1 + np.exp(-z))

## Decision Boundary

Much higher order polynomials, then it’s possible to show that you can get even more complex decision boundaries and logistic regression can be used to find the zero boundaries.

## Cost Function

How to fit the parameters theta for logistic regression. In particular, I’d like to define the optimization objective or the cost function that we’ll use to fit the parameters. Here’s to supervised learning problem of fitting a logistic regression model.

#### Linear regression the cost function

$$J(\theta ) = \frac {1}{m} \sum_{i = 1}^{m} \frac {1}{2}(h_\theta(x^{(i)}) – y^{(i)})^{2}$$

#### Logistic regression the cost function

$$J(\theta ) = \frac {1}{m} \sum_{i = 1}^{m} Cost(h_\theta(x^{(i)}), y^{(i)})$$ $$Cost(h_\theta (x), y) = \begin{cases} -log(h_\theta(x)) & \text{ if } y=1 \\ -log(1 – h_\theta(x)) & \text{ if } y=0 \end{cases}$$

## Simplified Cost Function and Gradient Descent

How to implement a fully working version of logistic regression.
It’s too hard to notes here , if you are interested in details, you can visit coursera.org

A vectorized implementation can update, you know, all of these N plus 1 parameters all in one fell swoop.

Feature scaling can help gradient descents converge faster for linear regression. The idea of feature scaling also applies to gradient descent for logistic regression.

## Advanced Optimization

For gradient descent, I guess technically you don’t actually need code to compute the cost function $$J_\theta$$. You only need code to compute the derivative terms.

Conjugate gradient BFGS and L-BFGS are examples of more sophisticated optimization algorithms.

These algorithms have a number of advantages:

• do not need to manually pick the learning rate alpha.
• It is actually entirely possible to use these algorithms successfully and apply to lots of different learning problems without actually understanding the inter-loop of what these algorithms do.

For these algorithms also what I would recommend you do is just use a software library.

Sophisticated optimization library, it makes the just a little bit more opaque and so just maybe a little bit harder to debug. But these algorithms often run much faster than gradient descent.

If you have a large machine learning problem, you can use these algorithms instead of using gradient descent.

## Multiclass Classification_ One-vs-all

one-versus-all classification

Do the same thing for the third class and fit a third classifier H superscript 3 of X and maybe this
or give us a classifier that separates the positive and negative examples like that.

Basically pick the classifier, pick whichever one of the three classifiers is most confident, or most enthusistically says that it thinks it has a right class.

And with this little method you can now take the logistic regression classifier and make it work on multi-class classification problems as well.

## Basic Operations

If you want to build a large scale deployment of a learning algorithm, what people will often do is prototype and the language is Octave.

Get your learning algorithms to work more quickly in Octave. Then overall you have a huge time savings by first developing the algorithms in Octave, and then implementing and maybe C++ or Java, only after we have the ideas working.

Octave is nice because open sourced.

• % : comment
• ~= : not equal
• ; : suppresses the print output
• DISP : For more complex printing
1. V=1：0.1：2 % it sets V to the bunch of elements that start from 1. And increments and steps of 0.1 until you get up to 2.
2. 
3. ones(2, 3) % generates a matrix that is a two by three matrix that is the matrix of all ones.
4. 
5. C = 2 * ones(2, 3) % that is all two's.
6. 
7. w = zeros(1, 3) % that is all zero's.
8. 
9. rand(3,3)
10. 
11. w = rand(1, 3) % normal random variable
12. 
13. hist % plot a histogram
14. 
15. help

## Moving Data Around

1. size(A) % the size of a matrix
2. 
3. size(A, 1) % the first dimension of A, size of the first dimension of A.
4. 
5. length(v) % the size of the longest dimension.
6. 
7. load('featureX.dat')
8. 
9. who % the variables that Octave has in memory currently
10. 
11. whos % the detailed view
12. 
13. clear featuresX
14. 
15. save hello.mat v % save the variable V into a file called hello.mat.
16. 
17. save hello.txt v -ascii % a human readable format
18. 
19. A(3,2)
20. 
21. A(2,:) % fetch everything in the second row.
22. 
23. A([1 3],:) % get all of the elements of A who's first indexes one or three.
24. 
25. A = [A, [100, 101, 102]] % this will do is add another column vector to the right.
26. 
27. A(:) % put all elements with A into a single column vector
28. 
29. C = [A B] % taking these two matrices and just concatenating onto each other.
30. 
31. C = [A; B] % The semicolon notation means that I go put the next thing at the bottom.

There’s no point at all to try to memorize all these commands.

It’s just, but what you should do is, hopefully from this video you have gotten a sense of the sorts of things you can do.

## Computing on Data

1. AxC % multiply 2 of matrices
2. 
3. A .* B % take each elements of A and multiply it by the corresponding elements of B.
4. 
5. A .^ 2 % the element wise squaring of A
6. 
7. 1 ./ V % the element wise reciprocal of V
8. 
9. log(v) % an element wise logarithm of v
10. 
11. exp(v)
12. 
13. abs(V) % the element wise absolute value of V
14. 
15. -v % the same as -1 x V
16. 
17. v + ones(3,1) % this increments V by one.
18. 
19. v + 1 % another simpler way
20. 
21. A' % the apostrophe symbol, a transpose of A
22. 
23. val=max(a) % set val equals max of A
24. 
25. [val, ind] = max(a) % val = the maximum value, ind = the index
26. 
27. a < 3 % a = [1 15 2 0.5], the result will be [1 0 1 1]
28. 
29. find(a < 3) % [1 3 4]
30. 
31. A = magic(3) % Returns this matrices called magic squares that all of their rows and columns and diagonals sum up to the same thing.
32. 
33. [r,c] = find( A>=7 ) % This finds all the elements of a that are greater than and equals to 7 and so, R C sense a row and column.
34. 
35. sum(a) % This adds up all the elements of A.
36. 
37. prod(a) % multiply them
38. 
39. floor(a) % Floor A rounds down
40. 
41. ceil(A) % rounded up
42. 
43. type(3) % sets a 3 by 3 matrix
44. 
45. max(A,[],1) % takes the column wise maximum
46. 
47. max(A,[],2) % takes the per row maximum
48. 
49. max(A) % it defaults to column
50. 
51. max(max(A)) % the maximum element in the entire matrix A
52. 
53. sum(A,1) % does a per column sum
54. 
55. sum(A,2) % do the row wise sum
56. 
57. eye(9) % nine identity matrix
58. 
59. sum(sum(A.*eye(9)) % the sum of these diagonal elements
60. 
61. flipup/flipud % Flip UD stands for flip up/down
62. 
63. pinv(A) % a pseudo inference

After running a learning algorithm, often one of the most useful things is to be able to look at your results, or to plot, or visualize your result.

## Plotting Data

Often, plots of the data or of all the learning algorithm outputs will also give you ideas for how to improve your learning algorithm.

1. t = [0:0.01:0.98]
2. y1 = sin(2 * pi * 4 * t)
3. plot(t, y1) % plot the sine function
4. 
5. y2 = cos(2 * pi * 4 * 4)
6. plot(t, y2)
7. 
8. hold on % figures on top of the old one
9. 
10. plot(t, y2, 'r') % different color
11. 
12. xlabel('time') % label the X axis, or the horizontal axis
13. ylabel('value')
14. 
15. legend('sin', 'cos') % puts this legend up on the upper right showing what the 2 lines are
16. 
17. title('myplot') % the title at the top of this figure
18. 
19. print -dpng 'myplot.png' % save figure
20. 
21. close % disappeared
22. 
23. figure(1); plot(t, y1); % Starts up first figure, and that plots t, y1.
24. figure(2); plot(t, y2);
25. 
26. subplot(1,2,1) % sub-divides the plot into a one-by-two grid with the first 2 parameters are
27. plot(t, y1) % fills up this first element
28. subplot(1,2,2)
29. plot(t, y2)
30. 
31. axis([0.5 1 -1 1]) % sets the x range and y range for the figure on the right
32. 
33. clf % clear
34. 
35. imagesc(A) % visualize the matrix and the different colors correspond to the different values in the A matrix.
36. colormap gray % color map gray
37. 
38. imagesc(magic(15))，colorbar，colormap gray % running three commands at a time

## Control Statements_ for, while, if statements

1. for i = 1 : 10,
2.     v(i) = 2 ^ i;
3. end;
4. 
5. indices = 1 : 10;
6. for i = indices,
7.     disp(i);
8. end;
9. 
10. i = 1;
11. while i <= 5,
12.     v(i) = 100;
13.     i = i + 1;
14. end;
15. 
16. i = 1;
17. while true,
18.     v(i) = 999;
19.     i = i + 1;
20.     if i == 6,
21.         break;
22.     end;
23. end;
24. 
25. v(1) = 2;
26. if v(1) == 1,
27.     disp('The value is one');
28. elseif v(1) == 2,
29.     disp('The value is two');
30. else,
31.     disp('The value is not one or two');
32. end;
33. 
34. function name (arg-list)
35.   body
36. endfunction
37. 
38. function wakeup (message)
39.   printf ("\a%s\n", message);
40. endfunction
41. 
42. wakeup ("Rise and shine!");
43. 
44. function y = squareThisNumber(x)
45. 
46. addpath % add path for search dirs
47. 
48. [a, b] = SquareAndCubeThisNumber(5) % a = 25, b = 125

## Vectorization

#### Unvectorized implementation

$$h_\theta (x) = \sum _{j=0}^{n} \theta _jx_j$$
1. prediction = 0.0;
2. for j = 1 : n + 1,
3.     prediction = prediction + theta(j) * x(j)
4. end;

#### Vectorized implementation

$$h_\theta (x) = \theta ^Tx$$
1. prediction = theta' * x;

Using a vectorized implementation, you should be able to get a much more efficient implementation of linear regression.

## Working on and Submitting Programming Exercises

How to use the submission system which will let you verify right away that you got the right answer for your machine learning program exercise.

If you are interested in details, you can visit coursera.org