Model Representation
\(m\) : the number of training examples
\(x\) : input variables
\(y\) : output variables
\((x, y)\) : a single training example
\((x^{(i)}, y^{(i)})\) : refer to the ith training example
\(h\) : representing the hypothesis
Cost Function
\(J({\theta }_0, {\theta }_1) = \frac {1}{2m}\sum ^{m}_{i=1}( h_{\theta }(x^{(i)}) – y^{(i)})^2\)Cost function is also called the squared error function or sometimes called the square error cost function.
Gradient Descent
repeat until convergence {
-
\({\theta }_j := {\theta }_j – {\alpha }\frac {\partial }{\partial {\theta }_j}J({\theta }_0 – {\theta }_1)
\) (for j = 0 and j = 1)
}
when people talk about gradient descent, they always mean simultaneous update.
Gradient Descent For LinearRegression
- The term batch gradient descent means that refers to the fact that, in every step of gradient descent we’re looking at all of the training examples.
- In case it turns out gradient descent will scale better to larger data sets than that normal equals method.