7 Regularization

The Problem of Overfitting

Regularization, that will allow us to ameliorate or to reduce this overfitting problem and get these learning algorithms to maybe work much better.

If you were to fit a very high-order polynomial, if you were to generate lots of high-order polynomial terms of speeches, then, logistical regression may contort itself, may try really hard to find a decision boundary that fits your training data or go to great lengths to contort itself, to fit every single training example well.
But this really doesn’t look like a very good hypothesis, for making predictions.

The term generalized refers to how well a hypothesis applies even to new examples.

In order to address over fitting, there are two main options for things that we can do.

  • reduce the number of features
  • regularization, keep all the features, but we’re going to reduce the magnitude or the values of the parameters
  • Cost Function

    Orignal Model :

    h_\theta(x) = \theta_0 + \theta_1x_1 + \theta_2x\underset{2}{2} + \theta_3x\underset{3}{3} + \theta_4x\underset{4}{4}

    Modified Model :

    \underset{\theta }{min}\frac{1}{2m}[\sum_{i=1}^{m}(h_\theta(x^{(i)}-y^{(i)})^2+1000\theta\underset{3}{2}+10000\theta \underset{4}{2})]

    Suppose :

    J(\theta) = \frac {1}{2m} [\sum_{i=1}^{m} (h_\theta(x^{(i)}-y^{(i)})^2 + \lambda \sum_{j=1}^{n} \theta _j^2]

    regularization parameter : \(\lambda\)

    Notice :

    \(\lambda \sum_{j=1}^{n} \theta _j^2\)

    The extra regularization term at the end to shrink every single parameter and so this term we tend to shrink all of my parameters.

    Regularized Linear Regression

    J(\theta) = \frac {1}{2m} [\sum_{i=1}^{m} (h_\theta(x^{(i)}-y^{(i)})^2 + \lambda \sum_{j=1}^{n} \theta _j^2]

    repeat until convergence {

      \(\theta _0 := \theta _0 – \alpha \frac {1}{m} \sum _{i=1}^{m} (h_\theta (x^{(i)}) – y^{(i)})x_0^{(i)}\)
      \(\theta _j := \theta _j – \alpha [\frac {1}{m} \sum _{i=1}^{m} (h_\theta (x^{(i)}) – y^{(i)})x_j^{(i)} + \frac {\lambda }{m}\theta _j]\)


    Modified :

    \(\theta _j := \theta _j(1 – \alpha \frac {\lambda }{m}) – \alpha \frac {1}{m} \sum _{i=1}^{m} (h_\theta (x^{(i)}) – y^{(i)})x_j^{(i)}\)

    Regularized Logistic Regression

    \(J(\theta) = \frac {1}{m}\sum _{i=1}^{m} [-y^{(i)} log(h_\theta (x^{(i)})) – (1 – y^{(i)})log(1 – h_\theta (x^{(i)}))] + \frac {\lambda}{2m} \sum _{i=1}^{n} \theta _j^{2}\)

    Python Code :

    1. import numpy as np
    2. def costReg(theta, X, y, learningRate):
    3. 	theta = np.matrix(theta)
    4. 	X = np.matrix(X)
    5. 	y = np.matrix(y)
    6. 	first = np.multiply(-y, np.log(sigmoid(X * theta.T)))
    7. 	second = np.multiply((1 - y), np.log(1 - sigmoid(X * theta.T)))
    8. 	reg = learningRate / (2 * len(X)) * np.sum(np.power(theta[:,1:theta.shape[1]], 2))
    9. 	return np.sum(first - second) / len(X) + reg