12 Support Vector Machines

Optimization Objective

Within supervised learning, the performance of many supervised learning algorithms will be pretty similar and when that is less more often be whether you use learning algorithm A or learning algorithm B but when that is small there will often be things like the amount of data you are creating these algorithms on. That’s always your skill in applying this algorithms. Seems like your choice of the features that you designed to give the learning algorithms and how you choose the regularization parameter and things like that. Support Vector Machine (SVM) : sometimes gives a cleaner and sometimes more powerful way of learning complex nonlinear functions. Alternative view of logistic regression [latex]h_\theta (x) = \frac {1}{1 + e^{-\theta ^{T}x}}[/latex] Cost of example : [latex]-(ylogh_\theta (x) + (1-y)log(1-h_\theta(x))) [/latex] [latex]= -ylog\frac {1}{1 + e^{-\theta ^{T}x}} + (1-y)log(1-\frac {1}{1 + e^{-\theta ^{T}x}})[/latex] Support vector machine Logistic regression : [latex]\underset{\theta }{min} \frac{1}{m} [\sum_{i=1}^{m}y^{(i)}(-logh_\theta(x^{(i)})) + (1-y^{(i)})(-log(1-h_\theta (x^{(i)})))] + \frac{\lambda }{2m}\sum_{j=1}^{n} \theta _j^2[/latex] Support vector machine : [latex]\underset{\theta }{min} C [\sum_{i=1}^{m}y^{(i)}cost_1(\theta ^Tx^{(i)}) + (1-y^{(i)})cost_0(\theta ^Tx^{(i)})] + \frac{1}{2}\sum_{i=1}^{n} \theta _j^2[/latex]

Large Margin Intuition

Sometimes people talk about support vector machines, as large margin classifiers. The margin of the support vector machine and this gives the SVM a certain robustness, because it tries to separate the data with as a large a margin as possible. So the support vector machine is sometimes also called a large margin classifier. [latex]min \frac{1}{2} \sum _{j=1}^{n} \ \theta _j^2 \ s.t \ \left\{\begin{matrix} \theta^Tx^{(i)} \geq \ 1 \ \ if \ y^{(i)} = 1\\ \theta^Tx^{(i)} \leq -1 \ \ if \ y^{(i)} = 0\\ \end{matrix}\right.[/latex] In practice when applying support vector machines, when C is not very very large like that, it can do a better job ignoring the few outliers.

Mathematics Behind Large Margin Classification

[latex]\left \ u \right \ = \sqrt{u_1^2 + u_2^2}[/latex] p is the length of the projection of the vector V onto the vector U. so [latex]u^Tv = p \cdot \left \ u \right \[/latex] and [latex]u^Tv = u_1 \times v_1 + u_2 \times v_2[/latex] so [latex]p \cdot \left \ u \right \ = u_1 \times v_1 + u_2 \times v_2[/latex] SVM Decision Boundary [latex]\underset {\theta }{min} \frac{1}{2} \sum _{j=1}^{n} \ \theta _j^2 \ s.t \ \left\{\begin{matrix} \theta^Tx^{(i)} \geq \ 1 \ \ if \ y^{(i)} = 1\\ \theta^Tx^{(i)} \leq -1 \ \ if \ y^{(i)} = 0\\ \end{matrix}\right.[/latex] if n == 2: [latex]\begin{Vmatrix} u \end{Vmatrix} = \sqrt{u_1^2 + u_2^2}[/latex] [latex]\frac{1}{2} (\theta _1^2 + \theta _2^2) = \frac{1}{2} (\sqrt {\theta _1^2 + \theta _2^2})^2 = \frac{1}{2}\left \ \theta \right \ ^2[/latex] So all the support vector machine is doing in the optimization objective is it’s minimizing the squared norm of the square length of the parameter vector theta. [latex]p^{(i)}[/latex] : a projection of the i-th training example onto the parameter vector [latex]\theta [/latex]. [latex]\theta ^Tx^{(i)} = \theta_1 \cdot x_1^{(i)} + \theta_2 \cdot x_2^{(i)} [/latex] SVM Decision Boundary [latex]\underset {\theta }{min} \frac{1}{2} \sum _{j=1}^{n} \ \theta _j^2 \ s.t \ \left\{\begin{matrix} p^{(i)} \cdot \left \ \theta \right \ \geq \ 1 \ \ if \ y^{(i)} = 1\\ p^{(i)} \cdot \left \ \theta \right \ \leq -1 \ \ if \ y^{(i)} = 0\\ \end{matrix}\right.[/latex] If we can make the norm of theta smaller and therefore make the squared norm of theta smaller, which is why the SVM would choose this hypothesis on the right instead. And this is how the SVM gives rise to this large margin certification effect. Mainly, if you look at this green line, if you look at this green hypothesis we want the projections of my positive and negative examples onto theta to be large, and the only way for that to hold true this is if surrounding the green line. There’s this large margin, there’s this large gap that separates positive and negative examples is really the magnitude of this gap. The magnitude of this margin is exactly the values of P1, P2, P3 and so on. And so by making the margin large, by these tyros P1, P2, P3 and so on that’s the SVM can end up with a smaller value for the norm of theta which is what it is trying to do in the objective. And this is why this machine ends up with enlarge margin classifiers because itss trying to maximize the norm of these P1 which is the distance from the training examples to the decision boundary.

Kernels

complex polynomial features

[latex]\theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x_1x_2 + \theta_4x_1^2 + \theta_5x_2^2 + \cdots[/latex]

new denotation

[latex]f_1 = x_1, \ f_2 = x_2, \ f_3 = x_1x_2, \ f_4 = x_1^2, \ f_5 = x_2^2[/latex] [latex]h_\theta (x) = \theta_1f_1 + \theta_2f_2 + \cdots + \theta_nf_n[/latex] Gaussian Kernel [latex]f_1 = similarity(x, l^{(1)}) = e^{(-\frac {\left \ x - l^{(1)} \right \ ^ 2}{2\sigma ^2})}[/latex] [latex]\left \ x - l^{(1)} \right \ ^ 2 = \sum _{j=1}^{n}(x_j - l_j^{(1)})^2[/latex] We define some extra features using landmarks and similarity functions to learn more complex nonlinear classifiers. We use new features that are computed by Kernels, not original features.

How the landmarks are chosen

Choose the the location of my landmarks to be exactly near the locations of my m training examples. Given [latex](x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \cdots , (x^{(m)}, y^{(m)})[/latex] choose [latex]l^{(1)} = x^{(1)}, l^{(2)} = x^{(2)}, \cdots \cdots , l^{(m)} = x^{(m)}[/latex] Given example x : [latex]f_1 = similarity(x, l^{(1)})[/latex] [latex]f_2 = similarity(x, l^{(2)})[/latex] [latex]\cdots [/latex] **Cost Function : ** [latex]minC\sum _{i=1}^{m}[y^{(i)}cost_1(\theta^Tf^{(i)}) + (1-y^{(i)})cost_0(\theta^Tf^{(i)})]+\frac{1}{2}\sum_{j=1}^{n=m}\theta_j^2[/latex] Using kernels with logistic regression is going too very slow.

How to choose [latex]C \ and \ \sigma[/latex]

[latex]C = 1 \ / \ \lambda[/latex]

[latex]C[/latex] big: overfitting, higher variance
[latex]C[/latex] small: underfitting, higher bias
[latex]\sigma[/latex] big : lower variance, higher bias
[latex]\sigma[/latex] small : lower bias, higher variance

Using An SVM

Some library function :

liblinear
libsvm

There are a few things need to do :

parameter’s C
choose the kernel (If we decide not to use any kernel. And the idea of no kernel is also called a linear kernel)

Logistic Regression or Support Vector Machine

n : number of features
m : number of training examples

n >> m, Logistic Regression or Linear Kernel
n small, and m middle, like 1 < n < 1000, 10 < m < 10000, SVM with Gaussian Kernel
n small, and m big, like 1 < n < 1000, 50000 < m, SVM is slower, so try to munually create more features and then use logistic regression or an SVM without the Kernel

Well for all of these problems, for all of these different regimes, a well designed neural network is likely to work well as well. The algorithm does matter, but what often matters even more is things like, how much data do you have. And how skilled are you, how good are you at doing error analysis and debugging learning algorithms, figuring out how to design new features and figuring out what other features to give you learning algorithms and so on.