10 Advice for Applying Machine Learning

Deciding What to Try Next

If you are developing a machine learning system or trying to improve the performance of a machine learning system, how do you go about deciding what are the proxy avenues to try next?

If you find that this is making huge errors in this prediction. What should you then try mixing in order to improve the learning algorithm?

One thing they could try, is to get more training examples. But sometimes getting more training data doesn’t actually help.

Other things you might try are to well maybe try a smaller set of features.

There is a pretty simple technique that can let you very quickly rule out half of the things on this list as being potentially promising things to pursue. And there is a very simple technique, that if you run, can easily rule out many of these options, and potentially save you a lot of time pursuing something that’s just is not going to work.

Machine Learning Diagnostics, and what a diagnostic is, is a test you can run, to get insight into what is or isn’t working with an algorithm, and which will often give you insight as to what are promising things to try to improve a learning algorithm’s performance.

Evaluating a Hypothesis

If there is any sort of ordinary to the data. That should be better to send a random 70% of your data to the training set and a random 30% of your data to the test set.

Model Selection and Train_Validation_Test Sets

To send 60% of your data’s, your training set, maybe 20% to your cross validation set, and 20% to your test set.

Diagnosing Bias vs. Variance

The training set error, will be high. And you might find that the cross validation error will also be high. It might be a close. Maybe just slightly higher than a training error. The algorithm may be suffering from high bias. In contrast if your algorithm is suffering from high variance.

Regularization and Bias_Variance

Looking at the plot of the whole or cross validation error, you can either manually, automatically try to select a point that minimizes 
the cross-validation error and select the value of lambda corresponding to low cross-validation error.

Learning Curves

Learning curves is often a very useful thing to plot. If either you wanted to sanity check that your algorithm is working correctly, or if you want to improve the performance of the algorithm.

To plot a learning curve, is plot j train which is, say, average squared error on my training set or Jcv which is the average squared error on my cross validation set. And I’m going to plot that as a function of m, that is as a function of the number of training examples.

In the high variance setting, getting more training data is, indeed, likely to help.

Deciding What to Do Next Revisited

  1. getting more training examples is good for high variance
  2. a smaller set of features fixes high variance
  3. adding features usually is a solution for fixing high bias
  4. similarly, adding polynomial features
  5. decreasing lambda fixes fixes high bias
  6. increasing lambda fixes high variance

It turns out if you’re applying neural network very often using a large neural network often it’s actually the larger, the better.

Using a single hidden layer is a reasonable default, but if you want to choose the number of hidden layers, one other thing you can try is find yourself a training cross-validation, and test set split and try training neural networks with one hidden layer or two hidden layers or three hidden layers and see which of those neural networks performs best on the cross-validation sets.

9 Neural Networks: Learning

Cost Function

Two types of classification problems:

  • Binary classification : where the labels y are either zero or one.
  • multiclass classification : where we may have k distinct classes.

Backpropagation Algorithm

It’s too hard to describe.

Backpropagation Intuition

It’s too hard to describe.

Implementation Note_ Unrolling Parameters

It’s too hard to describe.

Gradient Checking

Back prop as an algorithm has one unfortunate property is that there are many ways to have subtle bugs in back prop so that if you run it with gradient descent or some other optimization algorithm, it could actually look like it’s working. And, you know, your cost function \(J(\theta )\) may end up decreasing on every iteration of gradient descent, but this could pull through even though there might be some bug in your implementation of back prop. So it looks like \(J(\theta )\) is decreasing, but you might just wind up with a neural network that has a higher level of error than you would with a bug-free implementation and you might just not know that there was this subtle bug that’s giving you this performance.
Gradient checking that eliminates almost all of these problems.

Random Initialization

To train a neural network, what you should do is randomly initialize the weights to, you know, small values close to 0, between \(-\epsilon\) and \(+\epsilon\).

Putting It Together

How to implement a neural network learning algorithm ?

  • Pick some network architecture, it means connectivity pattern between the neurons.
  • Once you decides on the fix set of features x the number of input units will just be, the dimension of your features x(i) would be determined by that.
  • The number of output of this will be determined by the number of classes in your classification problem.
  • If you use more than one hidden layer, again the reasonable default will be to have the same number of hidden units in every single layer.
  • (As for the number of hidden units – usually, the more hidden units the better)

What we need to implement in order to trade in neural network ?

  1. set up the neural network and to randomly initialize the values of the weights
  2. forward propagation
  3. compute this cost function \(J(\theta )\)
  4. back-propagation
  5. gradient checking
  6. use an optimization algorithm

Autonomous Driving

A fun and historically important example of Neural Network Learning, just so so.

8 Neural Networks: Representation

Non-linear Hypotheses

Neural Networks, which turns out to be a much better way to learn complex hypotheses, complex nonlinear hypotheses even when your input feature space, even when n is large.

Neurons and the Brain

Neural Networks are a pretty old algorithm that was originally motivated by the goal of having machines that can mimic the brain.

It’s actually a very effective state of the art technique for modern day machine learning applications.

Model Representation

A neuron is is a computational unit that gets a number of inputs through its input wires, does some computation, and then it sends outputs, via its axon to other nodes or other neurons in the brain.

Weights of a model just means exactly the same thing as parameters of the model.

Forward Propagation : Because it start of with the activation of the input-units and then it sort of forward-propagate that to the hidden layer and then it sort of forward propagate that and compute the activation of the output layer.

The more complex features will be better than x^n, and it will be more work well for prediction new data.

Examples and Intuitions

In normal Logistic Regression, though we can use some polynomial to contract some features, we still be limited by original features.But in Neuron Network, the original features just be work on input layer.

We can use contract neuron network to  be more complex neuron network that will do more complex compute.

Multiclass Classification

four output units represents four classification

7 Regularization

The Problem of Overfitting

Regularization, that will allow us to ameliorate or to reduce this overfitting problem and get these learning algorithms to maybe work much better.

If you were to fit a very high-order polynomial, if you were to generate lots of high-order polynomial terms of speeches, then, logistical regression may contort itself, may try really hard to find a decision boundary that fits your training data or go to great lengths to contort itself, to fit every single training example well.
But this really doesn’t look like a very good hypothesis, for making predictions.

The term generalized refers to how well a hypothesis applies even to new examples.

In order to address over fitting, there are two main options for things that we can do.

  • reduce the number of features
  • regularization, keep all the features, but we’re going to reduce the magnitude or the values of the parameters
  • Cost Function

    Orignal Model :

    \(
    h_\theta(x) = \theta_0 + \theta_1x_1 + \theta_2x\underset{2}{2} + \theta_3x\underset{3}{3} + \theta_4x\underset{4}{4}
    \)

    Modified Model :

    \(
    \underset{\theta }{min}\frac{1}{2m}[\sum_{i=1}^{m}(h_\theta(x^{(i)}-y^{(i)})^2+1000\theta\underset{3}{2}+10000\theta \underset{4}{2})]
    \)

    Suppose :

    \(
    J(\theta) = \frac {1}{2m} [\sum_{i=1}^{m} (h_\theta(x^{(i)}-y^{(i)})^2 + \lambda \sum_{j=1}^{n} \theta _j^2]
    \)

    regularization parameter : \(\lambda\)

    Notice :

    \(\lambda \sum_{j=1}^{n} \theta _j^2\)

    The extra regularization term at the end to shrink every single parameter and so this term we tend to shrink all of my parameters.

    Regularized Linear Regression

    \(
    J(\theta) = \frac {1}{2m} [\sum_{i=1}^{m} (h_\theta(x^{(i)}-y^{(i)})^2 + \lambda \sum_{j=1}^{n} \theta _j^2]
    \)

    repeat until convergence {

      \(\theta _0 := \theta _0 – \alpha \frac {1}{m} \sum _{i=1}^{m} (h_\theta (x^{(i)}) – y^{(i)})x_0^{(i)}\)
      \(\theta _j := \theta _j – \alpha [\frac {1}{m} \sum _{i=1}^{m} (h_\theta (x^{(i)}) – y^{(i)})x_j^{(i)} + \frac {\lambda }{m}\theta _j]\)

    }

    Modified :

    \(\theta _j := \theta _j(1 – \alpha \frac {\lambda }{m}) – \alpha \frac {1}{m} \sum _{i=1}^{m} (h_\theta (x^{(i)}) – y^{(i)})x_j^{(i)}\)

    Regularized Logistic Regression

    \(J(\theta) = \frac {1}{m}\sum _{i=1}^{m} [-y^{(i)} log(h_\theta (x^{(i)})) – (1 – y^{(i)})log(1 – h_\theta (x^{(i)}))] + \frac {\lambda}{2m} \sum _{i=1}^{n} \theta _j^{2}\)

    Python Code :

    1. import numpy as np
    2. def costReg(theta, X, y, learningRate):
    3. 	theta = np.matrix(theta)
    4. 	X = np.matrix(X)
    5. 	y = np.matrix(y)
    6. 	first = np.multiply(-y, np.log(sigmoid(X * theta.T)))
    7. 	second = np.multiply((1 - y), np.log(1 - sigmoid(X * theta.T)))
    8. 	reg = learningRate / (2 * len(X)) * np.sum(np.power(theta[:,1:theta.shape[1]], 2))
    9. 	return np.sum(first - second) / len(X) + reg

    6 Logistic Regression

    Classification

    Classification : \(y = \) 0 or 1
    \(h_\theta (x)\) can be > 1 or < 0

    Logistic Regression: \(0 \leq h_\theta (x) \leq 1 \)

    Logistic regression which has the property that the output, the predictions of logistic regression are always between zero and one.

    Logistic Regression is actually a classification algorithm.

    Hypothesis Representation

    Sigmoid function : \(g(z) = \frac {1}{1+e^{-z}}\)

    1. import numpy as np
    2. def sigmoid(z):
    3.     return 1 / (1 + np.exp(-z))

    Decision Boundary

    Much higher order polynomials, then it’s possible to show that you can get even more complex decision boundaries and logistic regression can be used to find the zero boundaries.

    Cost Function

    How to fit the parameters theta for logistic regression. In particular, I’d like to define the optimization objective or the cost function that we’ll use to fit the parameters. Here’s to supervised learning problem of fitting a logistic regression model.

    Linear regression the cost function

    \(
    J(\theta ) = \frac {1}{m} \sum_{i = 1}^{m} \frac {1}{2}(h_\theta(x^{(i)}) – y^{(i)})^{2}
    \)

    Logistic regression the cost function

    \(
    J(\theta ) = \frac {1}{m} \sum_{i = 1}^{m} Cost(h_\theta(x^{(i)}), y^{(i)})
    \)

    \(
    Cost(h_\theta (x), y) = \begin{cases}
    -log(h_\theta(x)) & \text{ if } y=1 \\
    -log(1 – h_\theta(x)) & \text{ if } y=0
    \end{cases}
    \)

    Simplified Cost Function and Gradient Descent

    How to implement a fully working version of logistic regression.
    It’s too hard to notes here , if you are interested in details, you can visit coursera.org

    A vectorized implementation can update, you know, all of these N plus 1 parameters all in one fell swoop.

    Feature scaling can help gradient descents converge faster for linear regression. The idea of feature scaling also applies to gradient descent for logistic regression.

    Advanced Optimization

    For gradient descent, I guess technically you don’t actually need code to compute the cost function \(J_\theta\). You only need code to compute the derivative terms.

    Conjugate gradient BFGS and L-BFGS are examples of more sophisticated optimization algorithms.

    These algorithms have a number of advantages:

  • do not need to manually pick the learning rate alpha.
  • It is actually entirely possible to use these algorithms successfully and apply to lots of different learning problems without actually understanding the inter-loop of what these algorithms do.

    For these algorithms also what I would recommend you do is just use a software library.

    Sophisticated optimization library, it makes the just a little bit more opaque and so just maybe a little bit harder to debug. But these algorithms often run much faster than gradient descent.

    If you have a large machine learning problem, you can use these algorithms instead of using gradient descent.

    Multiclass Classification_ One-vs-all

    one-versus-all classification

    Do the same thing for the third class and fit a third classifier H superscript 3 of X and maybe this
    or give us a classifier that separates the positive and negative examples like that.

    Basically pick the classifier, pick whichever one of the three classifiers is most confident, or most enthusistically says that it thinks it has a right class.

    And with this little method you can now take the logistic regression classifier and make it work on multi-class classification problems as well.

    5 Octave Tutorial

    Basic Operations

    If you want to build a large scale deployment of a learning algorithm, what people will often do is prototype and the language is Octave.

    Get your learning algorithms to work more quickly in Octave. Then overall you have a huge time savings by first developing the algorithms in Octave, and then implementing and maybe C++ or Java, only after we have the ideas working.

    Octave is nice because open sourced.

  • % : comment
  • ~= : not equal
  • ; : suppresses the print output
  • DISP : For more complex printing
    1. V=10.12 % it sets V to the bunch of elements that start from 1. And increments and steps of 0.1 until you get up to 2.
    2.  
    3. ones(2, 3) % generates a matrix that is a two by three matrix that is the matrix of all ones.
    4.  
    5. C = 2 * ones(2, 3) % that is all two's.
    6.  
    7. w = zeros(1, 3) % that is all zero's.
    8.  
    9. rand(3,3)
    10.  
    11. w = rand(1, 3) % normal random variable
    12.  
    13. hist % plot a histogram
    14.  
    15. help

    Moving Data Around

    1. size(A) % the size of a matrix
    2.  
    3. size(A, 1) % the first dimension of A, size of the first dimension of A.
    4.  
    5. length(v) % the size of the longest dimension.
    6.  
    7. load('featureX.dat')
    8.  
    9. who % the variables that Octave has in memory currently
    10.  
    11. whos % the detailed view
    12.  
    13. clear featuresX
    14.  
    15. save hello.mat v % save the variable V into a file called hello.mat.
    16.  
    17. save hello.txt v -ascii % a human readable format
    18.  
    19. A(3,2)
    20.  
    21. A(2,:) % fetch everything in the second row.
    22.  
    23. A([1 3],:) % get all of the elements of A who's first indexes one or three.
    24.  
    25. A = [A, [100, 101, 102]] % this will do is add another column vector to the right.
    26.  
    27. A(:) % put all elements with A into a single column vector
    28.  
    29. C = [A B] % taking these two matrices and just concatenating onto each other.
    30.  
    31. C = [A; B] % The semicolon notation means that I go put the next thing at the bottom.

    There’s no point at all to try to memorize all these commands.

    It’s just, but what you should do is, hopefully from this video you have gotten a sense of the sorts of things you can do.

    Computing on Data

    1. AxC % multiply 2 of matrices
    2.  
    3. A .* B % take each elements of A and multiply it by the corresponding elements of B.
    4.  
    5. A .^ 2 % the element wise squaring of A
    6.  
    7. 1 ./ V % the element wise reciprocal of V
    8.  
    9. log(v) % an element wise logarithm of v
    10.  
    11. exp(v) 
    12.  
    13. abs(V) % the element wise absolute value of V
    14.  
    15. -v % the same as -1 x V
    16.  
    17. v + ones(3,1) % this increments V by one.
    18.  
    19. v + 1 % another simpler way
    20.  
    21. A' % the apostrophe symbol, a transpose of A
    22.  
    23. val=max(a) % set val equals max of A
    24.  
    25. [val, ind] = max(a) % val = the maximum value, ind = the index
    26.  
    27. a < 3 % a = [1 15 2 0.5], the result will be [1 0 1 1]
    28.  
    29. find(a < 3) % [1 3 4]
    30.  
    31. A = magic(3) % Returns this matrices called magic squares that all of their rows and columns and diagonals sum up to the same thing.
    32.  
    33. [r,c] = find( A>=7 ) % This finds all the elements of a that are greater than and equals to 7 and so, R C sense a row and column.
    34.  
    35. sum(a) % This adds up all the elements of A.
    36.  
    37. prod(a) % multiply them
    38.  
    39. floor(a) % Floor A rounds down
    40.  
    41. ceil(A) % rounded up
    42.  
    43. type(3) % sets a 3 by 3 matrix
    44.  
    45. max(A,[],1) % takes the column wise maximum
    46.  
    47. max(A,[],2) % takes the per row maximum
    48.  
    49. max(A) % it defaults to column
    50.  
    51. max(max(A)) % the maximum element in the entire matrix A
    52.  
    53. sum(A,1) % does a per column sum
    54.  
    55. sum(A,2) % do the row wise sum
    56.  
    57. eye(9) % nine identity matrix
    58.  
    59. sum(sum(A.*eye(9)) % the sum of these diagonal elements
    60.  
    61. flipup/flipud % Flip UD stands for flip up/down
    62.  
    63. pinv(A) % a pseudo inference

    After running a learning algorithm, often one of the most useful things is to be able to look at your results, or to plot, or visualize your result.

    Plotting Data

    Often, plots of the data or of all the learning algorithm outputs will also give you ideas for how to improve your learning algorithm.

    1. t = [0:0.01:0.98]
    2. y1 = sin(2 * pi * 4 * t)
    3. plot(t, y1) % plot the sine function
    4.  
    5. y2 = cos(2 * pi * 4 * 4)
    6. plot(t, y2)
    7.  
    8. hold on % figures on top of the old one
    9.  
    10. plot(t, y2, 'r') % different color
    11.  
    12. xlabel('time') % label the X axis, or the horizontal axis
    13. ylabel('value')
    14.  
    15. legend('sin', 'cos') % puts this legend up on the upper right showing what the 2 lines are
    16.  
    17. title('myplot') % the title at the top of this figure
    18.  
    19. print -dpng 'myplot.png' % save figure
    20.  
    21. close % disappeared
    22.  
    23. figure(1); plot(t, y1); % Starts up first figure, and that plots t, y1.
    24. figure(2); plot(t, y2); 
    25.  
    26. subplot(1,2,1) % sub-divides the plot into a one-by-two grid with the first 2 parameters are
    27. plot(t, y1) % fills up this first element
    28. subplot(1,2,2)
    29. plot(t, y2)
    30.  
    31. axis([0.5 1 -1 1]) % sets the x range and y range for the figure on the right
    32.  
    33. clf % clear
    34.  
    35. imagesc(A) % visualize the matrix and the different colors correspond to the different values in the A matrix.
    36. colormap gray % color map gray
    37.  
    38. imagesc(magic(15))colorbarcolormap gray % running three commands at a time

    Control Statements_ for, while, if statements

    1. for i = 1 : 10, 
    2.     v(i) = 2 ^ i;
    3. end;
    4.  
    5. indices = 1 : 10;
    6. for i = indices,
    7.     disp(i);
    8. end;
    9.  
    10. i = 1;
    11. while i <= 5,
    12.     v(i) = 100;
    13.     i = i + 1;
    14. end;
    15.  
    16. i = 1;
    17. while true, 
    18.     v(i) = 999;
    19.     i = i + 1;
    20.     if i == 6,
    21.         break;
    22.     end;
    23. end;
    24.  
    25. v(1) = 2;
    26. if v(1) == 1,
    27.     disp('The value is one');
    28. elseif v(1) == 2,
    29.     disp('The value is two');
    30. else,
    31.     disp('The value is not one or two');
    32. end;
    33.  
    34. function name (arg-list)
    35.   body
    36. endfunction
    37.  
    38. function wakeup (message)
    39.   printf ("\a%s\n", message);
    40. endfunction
    41.  
    42. wakeup ("Rise and shine!");
    43.  
    44. function y = squareThisNumber(x)
    45.  
    46. addpath % add path for search dirs
    47.  
    48. [a, b] = SquareAndCubeThisNumber(5) % a = 25, b = 125

    Vectorization

    Unvectorized implementation

    \(h_\theta (x) = \sum _{j=0}^{n} \theta _jx_j\)
    1. prediction = 0.0;
    2. for j = 1 : n + 1,
    3.     prediction = prediction + theta(j) * x(j)
    4. end;

    Vectorized implementation

    \(h_\theta (x) = \theta ^Tx\)
    1. prediction = theta' * x;

    Using a vectorized implementation, you should be able to get a much more efficient implementation of linear regression.

    Working on and Submitting Programming Exercises

    How to use the submission system which will let you verify right away that you got the right answer for your machine learning program exercise.

    If you are interested in details, you can visit coursera.org

    4 Linear Regression with Multiple Variables

    Linear Regression with Multiple Variables

    \(x^{(i)}_j\) : refer to feature number j in the x factor.

    Gradient Descent for Multiple Variables

    1. def computeCost(X, y, theta):
    2.     inner = np.power(((X * theta.T) - y), 2)
    3.     return np.sum(inner) / (2 * len(X))

    Gradient Descent in Practice I – Feature Scaling

    the different features take on similar ranges of values, then gradient descents can converge more quickly.

    \(
    x_n = \frac {x_n -\mu _n}{s_n}
    \)
    \(\mu _n\) : the average value
    \(s_n\) : the range of values

    Gradient Descent in Practice II – Learning Rate

    Looking at figure that pluck the cost function j of theta as gradient descent runs and the x-axis here is the number of iteration of gradient descent and as gradient descent runs.

    Maybe \(\alpha = 0.01, 0.03, 0.1, 0.3, 1, 3, 10\)

    Features and Polynomial Regression

    Look at the data and choose features.

    You can put polynomial functions as well and sometimes by appropriate insight into the feature simply get a much better model for your data.

    Feature Scaling is very necessary if you use polynomial functions.

    Normal Equation

    The normal equation, which for some linear regression problems, will give us a much better way to solve for the optimal value of the parameters theta.

    \(\theta = (X^{T}X)^{-1}X^{T}y\)

    Disadvantages of gradient descent

  • choose the learning rate Alpha.
  • many more iterations
  • Advantages of gradient descent

  • works pretty well even if you have millions of features
  • many kinds of models
  • Normal Equation Noninvertibility

    Some matrices are invertible and some matrices do not have an inverse we call those non-invertible matrices.

    Look at your features and see if you have redundant features or being a linear function of each other, and if you do have redundant features and if you just delete one of these features you really don’t need both of these features that will solve your non-invertibility problem.

    3 Linear Algebra Review

    Matrices and Vectors

    The dimension of the matrix is going to be written as the number of row times the number of columns in the matrix.

    A vector turns out to be a special case of a matrix.

    Matrix with just one column is what we call a vector.

    Addition and Scalar Multiplication

    Addition

    \(
    \begin{bmatrix}
    1 & 0 \\
    2 & 5 \\
    3 & 1
    \end{bmatrix}
    +
    \begin{bmatrix}
    4 & 0.5 \\
    2 & 5 \\
    0 & 1
    \end{bmatrix}
    =
    \begin{bmatrix}
    5 & 0.5 \\
    4 & 10 \\
    3 & 2
    \end{bmatrix}
    \)

    Scalar Multiplication

    \(
    3 *
    \begin{bmatrix}
    1 & 0 \\
    2 & 5 \\
    3 & 1
    \end{bmatrix}
    =
    \begin{bmatrix}
    3 & 0 \\
    6 & 15 \\
    9 & 3
    \end{bmatrix}
    =
    \begin{bmatrix}
    1 & 0 \\
    2 & 5 \\
    3 & 1
    \end{bmatrix}
    * 3
    \)

    Matrix Vector Multiplication

    \(
    \begin{bmatrix}
    1 & 3 \\
    4 & 0 \\
    2 & 1
    \end{bmatrix}
    *
    \begin{bmatrix}
    1 \\
    5
    \end{bmatrix}
    =
    \begin{bmatrix}
    16 = 1 * 1 + 3 * 5 \\
    4 = 4 * 1 + 0 * 5 \\
    7 = 2 * 1 + 1 * 5
    \end{bmatrix}
    \)

    Matrix Matrix Multiplication

    \(
    \begin{bmatrix}
    C0 & C1 \\
    C2 & C3
    \end{bmatrix}
    =
    \begin{bmatrix}
    A0 & A1 \\
    A2 & A3
    \end{bmatrix}
    *
    \begin{bmatrix}
    B0 & B1 \\
    B2 & B3
    \end{bmatrix}
    \)

    \(C0 = A0 * B0 + A1 * B2\)
    \(C1 = A0 * B1 + A1 * B3\)
    \(C2 = A2 * B0 + A3 * B2\)
    \(C3 = A2 * B1 + A3 * B3\)

    Matrix Multiplication Properties

    \(A * B \neq B * A\)
    \(A * (B * C) = (A * B) * C\)

    Identity matrix, has the property that it has ones along the diagonals, right, and so on and is zero everywhere else.

    \(AA^{-1} = A^{-1}A = I\)
    \(AI = IA = A\)

    Inverse and Transpose

    Inverse

    \(A^{-1}\)

    Transpose

    \(A^{T}\)

    \(
    \begin{vmatrix}
    a & b\\
    c & d\\
    e & f
    \end{vmatrix}
    ^{T}
    =
    \begin{vmatrix}
    a & c & e\\
    b & d & f
    \end{vmatrix}
    \)

    2 Linear Regression with One Variable

    Model Representation

    \(m\) : the number of training examples

    \(x\) : input variables

    \(y\) : output variables

    \((x, y)\) : a single training example

    \((x^{(i)}, y^{(i)})\) : refer to the ith training example

    \(h\) : representing the hypothesis

    Cost Function

    \(J({\theta }_0, {\theta }_1) = \frac {1}{2m}\sum ^{m}_{i=1}( h_{\theta }(x^{(i)}) – y^{(i)})^2\)

    Cost function is also called the squared error function or sometimes called the square error cost function.

    Gradient Descent

    repeat until convergence {

      \({\theta }_j := {\theta }_j – {\alpha }\frac {\partial }{\partial {\theta }_j}J({\theta }_0 – {\theta }_1)
      \) (for j = 0 and j = 1)

    }

    when people talk about gradient descent, they always mean simultaneous update.

    Gradient Descent For LinearRegression

      The term batch gradient descent means that refers to the fact that, in every step of gradient descent we’re looking at all of the training examples.
      In case it turns out gradient descent will scale better to larger data sets than that normal equals method.

    1 Introduction

    Why to have a Machine Learning

    For the most part we just did not know how to write AI programs to do the more interesting things such as web search or photo tagging or email anti-spam.
    There was a realization that the only way to do these things was to have a machine learn to do it by itself.

    What is Machine Learning

    Even among machine learning practitioners there isn’t a well accepted definition of what is and what
    isn’t machine learning.

    • Someone defined machine learning as the field of study that gives computers the ability to learn without being explicitly programmed.
    • A computer program is said to learn from experience E, with respect to some task T, and some performance measure P, if its performance on T as measured by P improves with experience E.

    And if you actually tried to develop a machine learning system, how to make those best practices type decisions about the way in which you build your system.

    Supervised Learning

    • Supervised learning refers to the fact that we gave the algorithm a data set in which the “right answers” were given.
    • To define with a bit more terminology this is also called a regression problem and by regression problem I mean we’re trying to predict a continuous value output.
    • Support Vector Machine,
      that will allow a computer to deal with an infinite number of features.

    Unsupervised Learning

    • Given this data set, an Unsupervised Learning algorithm might decide that the data lives in two different clusters.
      Supervised Learning algorithm may break these data into these two separate clusters. So this is called a clustering algorithm.
    • Because we’re not giving the algorithm the right answer for the examples in my data set, this is Unsupervised Learning.
    • Clustering is just one type of Unsupervised Learning.
    • Octave, is free open source software, and using a tool like Octave or Matlab, many learning algorithms become just a few lines of code to implement.

    Learn much faster if you use Octave as your programming environment, and if you use Octave as your
    learning tool and as your prototyping tool, it’ll let you learn and prototype learning algorithms much more quickly.
    And in fact what many people will do to in the large Silicon Valley companies is in fact, use an algorithm like Octave to first prototype the learning algorithm, and only after you’ve gotten it
    to work, then you migrate it to C++ or Java or whatever.