alt text     alt text     alt text     alt text     alt text


This is a summary of Support Vector Machine, as for preparation of another post Linear SVM on top of a deep net

Hard-margined reasoning

Suppose we want to classify over data , with . Define the kernel and the implicit decision boundary (implicit, since is not explicitly defined). The euclidean distance from point to the decision boundary is then:

Suppose data is linearly separable, i.e. there is at least one solution surface, SVM tries to find the best one. Amongst all data points there is always one that minimises . Say this happens at , SVM formulation says that we want to maximize to obtain the best surface. Let’s say we did successfully maximize the thing by .

Since is the normal vector of our solution surface, we can scale it without affecting the surface’s orientation (the solution). That means, if is a solution, with being a non-zero constant, is also one. Assume , we know there is some that can be derived from such that:

Now investigate . By definition of , this means . Talking about the point , this means maximises

Equivalently, is also a maximizer of

In both and , we realise that and is coupled and thus can be regarded as a single variable . This allow a reformulation of the original optimization problem:

subject to $$t_n y(s_n; w, b) \geq 1 \ \ \forall n$$

Soft-margined reasoning

The above reasoning assumes separable data, due to this ideal situation we are able to establish a strict standard on all data points base on . Things need to relax when it comes to the linearly inseparable case. Namely we have to introduce a slack variable for each of the data points, so that some of them can violate the standard: . The soft-margined SVM objective is:

subject to $$\xi_n \geq 0 \ \forall n$$ $$t_n y(s_n; w, b) \geq 1 - \xi_n$$

Collectively denote the variables of interest as , we rewrite the problem declaration:

where $$ f(x) = C \sum_{n=1}^{N}\xi_n + \frac{1}{2} {\left \| w \right \|}^2$$ $$g_{1n}(x) = -\xi_n$$ $$g_{2n}(x) = 1 - \xi_n - t_ny(s_n; w, b)$$

Notice is a collection of parameters with component , whose number of parameters equals to the number of data training points.

Primal and Dual problems

Is there an unconstrained version of the above problem, cuz constraints are kinda ugly isn’t it? In other words, is there a function such that minimizing yields the solution to our problem? In fact, there is one:

Equivalence can be mentally checked easily by investigating two cases. First, for each that violates at least one constraint, check that reaches and thus cannot be a minimum. Second, for each that satisfies all constraints, check that . Minimizing is called the Primal problem (which is essentially a minimax problem), and turns out this one is no less nasty than the original one since minimax problems are definitely not easy to handle.

Fortunately the whole Primal idea is not just about throwing away constraints, a guy named Slater and his friend told us that, for the specific problem we are solving, the equivalence extends a bit further: the corresponding maximin problem yields the same solution to our minimax problem (Primal problem) and thus the same solution to our original problem. This maximin problem is called the Dual problem, and turns out to be quite nice to its pet human whenever they do some trick to its amuse. Next part will introduce that trick called KKT necessary conditions.

To have a clearer notation that faithfully serves the names “minimax” and “maximin”, let’s introduce . Then the Primal and Dual problems respectively become

Absolutely on-point, isn’t it?

Solving the dual problem by the help of KKT conditions

There is another giant’s shoulder to step on besides Slater’s in the quest for . It concerns something called Karush-Kuhn-Tucker (KKT) conditions: conditions that are necessary for any solution of our optimization problem. In other words, from the fact that is a solution to our problem, what can we infer? Just for the record, these conditions later greatly simplify the Dual problem.

We can actually derive such conditions intuitively. Check here for a longer talk, I’ll just cherry pick here the sweetest parts for the sake of our SVM. First, look at an illustrative plot of and . There are only two possible scenarios:


On the left: the solution of the unconstrained problem satifies , thus it is also solution to the whole problem. To obtain , we can optimize regardless of , i.e. minimizing with solves an equivalent problem. On the right: solution to the unconstrained problem does not satify , thus lies on the boundary of -satified region, where . Again, minimizing along solves an equivalent problem.

In fact, these two cases are analogous to the two checks we did with the Primal problem. They will be visited again in later parts. Either case, . I.e. our first two KKT conditions are:

Remember that solves the Dual problem , which means minimizes . This gives the next conditions: . Taking the partial derivative with respect to , , and separately yields:

The remaining conditions are obvious:

Final form of the Dual problem

Now that’s everything is settled, we are ready to simplify the Dual problem as promised. In fact, we will use all KKT conditions that involves as substitutions into the Dual objective. The obvious effect is that we completely eliminate variables out of the Dual objective. Also use to throw away . Finally we obtain the simplified objective of maximization: . This optimization is subjected to , and - conditions that involves - which, can be rewritten as

But there is something else happening, a side effect that reveals itself after the derivation of : , who emerges from the disappearance of . They say the thing did its kernel trick. But what’s good about anyway? To shortly explain the hype around , consider solving the Dual problem in this final form. Clearly we only have to deal with , not explicitly. There’s a body of theories around this situation that states that there are certain which are extremely convenient to deal with mathematically, who turn out to have their ultra-high-dimensional, sometimes even infinite.

In other words, we are able to learn non-parameterized high-quality features without the need to deal with the curse of dimensionality. In the context of Deep Learning, is explicitly defined as the feature extractor of the network and its parameters are trained jointly with . We’ll come back to this point in Part 2.

Sequential minimal optimization: the Dual solver

How come this final form of Dual objective nicer to solve than, say, the Original and the Primal?

Realise that the Dual objective is a polynomial of degree 2 for each of the , the problem is thus convex and if we are able to progressively better the objective until convergence, the solution is found. So, we are talking about an interative algorithm and wish to define a convergence criteria for it. In fact, the KKT conditions and implies three things that can be used to check convergence around the optimal point:

There is this guy John Platt who proposed the algorithm we are looking for. His reasoning goes roughly like this: pick a subset from and optimize the objective with respect to this subset, because that’s easier than dealing with the whole bunch. Pick another one and another one and repeat until convergence. Basically each time we optimize a sub-problem, and this problem has to have at least two variables because otherwise, the equality constraint gives no room for (single-variable) optimization. And in fact he did choose to be the size of all sub-problems.

The sub-problem goes as follows: Maximize with respect to a pair of variable such that . From we get where is the current value of . This gives a one to one correspondence between and , hence the sub-problem reduce to a single-variable optimization problem. Say we choose to express in terms of and solve the single-variable problem in , there is another thing to care about: the bound of must be adjusted so that also lies in .

Optimizing such single-variable quadratic function is something quite trivial: maximizer of is . And since the problem is convex, the constraints can be handled easily by clipping the solution of the non-constrained problem back the the boundary of the feasible region. All the nitty-gritties that ensure convergence and heuristics to maximise algorithm’s speed is discussed in the original paper. And there’s that for SVM optimization!


alt text     alt text     alt text     alt text     alt text