Mathematical Introduction to SVM and Kernel Methods

By Tsuyoshi Matsuzaki on 2020-06-01• ( Leave a comment )

Support vector machine (SVM) in machine learning is so useful in the real classification (or anomaly detection) problems, since this learner covers many of scenarios and it doesn’t require the complicated tuning, which is often needed in neural network modeling.
However, it’s necessary to know about the idea of this learner in order to tune parameters, minimize loss, so on and so forth, in practical use.

In this post, I describe how SVM (support vector machine) works and make you understand strengths and weaknesses in the practical use.
Of course, the idea behind SVM is based on mathematics (statistics), however, for the purpose of building your intuitions, I’ll try to explain with a lot of examples and visualizations as possible.
This post will also help you understand kernel methods, which is used behind SVM.

Maximum Margin Classification

To begin discussion, first I show you the idea of maximum margin classification.
In this post, I’ll start with a simple linear classification example (trivial model with a simple linear function), and later we’ll discuss more difficult and practical problems in margin maximization, which eventually leads to the method of SVM.
As you can see later, there’re several kinds of SVM learners (C-SVM, ν-SVM, One-class SVM, …), but all are commonly based on the idea of margin maximization, and you’ll find why the name of this learner implies “support vector”.

First, let’s see the following 2-class linear classification example, which has inputs of 2-dimensional vectors $(x_1, x_2)$ , and labeled values 1 (which is marked as a circle) and -1 (which is marked as a cross).

Assume that we have $n=1,\ldots,N$ input data.
$\mathbf{x}_n$ is the n-th input data, and $t_n \in \{ 1,-1 \}$ is the n-th labeled value.

Note : I’m sorry, but the syntax for both $\mathbf{x}_k$ and $x_k$ might confuse you. In this post, I use $x$ for a scalar value, and use $\mathbf{x}$ for a vector value. (Then $\mathbf{x} = (x_1, x_2)$ here.)

As you can easily see in above picture, this data can be linearly classified. Then, when we denote $\mathbf{w} = (w_1, w_2, \ldots, w_N)$ as weights and $b_0$ (scalar value) as a bias, the following equation will hold for $n=1,2,\ldots,N$ .

$\displaystyle \mathbf{w}^{T} \mathbf{x}_n + b_0 \geq 0$ when $t_n=1$

$\displaystyle \mathbf{w}^{T} \mathbf{x}_n + b_0 < 0$ when $t_n=-1$

Dividing these equations by $\| \mathbf{w} \|$ , you will get the following equations. (Here $b = b_0 / \| \mathbf{w} \|$ .)

$\displaystyle \frac{\mathbf{w}^{T}}{\| \mathbf{w} \|} \mathbf{x}_n + b \geq 0$ when $t_n=1\ \ \ \ \ \ \ \ \ \ (1)$

$\displaystyle \frac{\mathbf{w}^{T}}{\| \mathbf{w} \|} \mathbf{x}_n + b < 0$ when $t_n=-1\ \ \ \ \ \ \ \ \ \ (2)$

Now we pick up some position vector $\mathbf{p}$ on boundary (see the below picture). Then this $\mathbf{p}$ will hold $( \mathbf{w}^{T} / \| \mathbf{w} \| ) \mathbf{p} + b = 0$ .
Using this vector $\mathbf{p}$ , the left side of above equations (1) and (2) is represented as follows.

$\displaystyle \frac{\mathbf{w}^{T}}{\| \mathbf{w} \|} (\mathbf{x}_n - \mathbf{p})$

What’s $(\mathbf{w}^{T} / \| \mathbf{w} \|) (\mathbf{x}_n - \mathbf{p})$ meaning geometrically ?
The value $(\mathbf{w}^{T} / \| \mathbf{w} \|) (\mathbf{x}_n - \mathbf{p})$ is an inner product between $\mathbf{w}^{T} / \| \mathbf{w} \|$ and $\mathbf{x}_n - \mathbf{p}$ . Then this value is equal to $\| \mathbf{x}_n - \mathbf{p} \| cos\theta_n$ , as the following picture shows.

Note : That is, $\mathbf{w}$ decides the direction of boundary (decision hyperplane).

Eventually, every $\left| (\mathbf{w}^{T} / \| \mathbf{w} \|) (\mathbf{x}_n - \mathbf{p}) \right|$ for $n = 1, \ldots , N$ is representing a distance from the boundary as a following picture. This distance is called a margin.

As you can see below, this margin will differ when the decision boundary has changed.

Now we assume that all input’s vectors are not on the boundary. (If there exists a vector on exact boundary, please change the boundary away from that.)
As you saw above, $( \mathbf{w}^{T} / \| \mathbf{w} \| ) (\mathbf{x}_n - \mathbf{p})$ is a positive number, when $t_n=1$ . It’s a negative, when $t_n=-1$ .
Then the equations (1) and (2) will be written as a following single equation.

$\displaystyle t_n \left( \frac{\mathbf{w}^{T}}{\| \mathbf{w} \|} \mathbf{x}_n + b \right) > 0$

We assume $\kappa = \min_n \{ \verb| margin of n-th | \} = \min_n \left| (\mathbf{w}^{T} / \| \mathbf{w} \|) (\mathbf{x}_n - \mathbf{p}) \right|$ .
Then the above equation will be written as follows.

$\displaystyle t_n \left( \frac{\mathbf{w}^{T}}{\| \mathbf{w} \|} \mathbf{x}_n + b \right) \geq \kappa$

The decision boundary $\mathbf{w}^{T} \mathbf{x} + b$ is the same as $k \mathbf{w}^{T} \mathbf{x} + k b$ for any scalar value $k$ .
Then the above equation will also written as follows without losing generality. In the following equation, I replaced $\mathbf{w}^{T} / (\kappa \| \mathbf{w} \|)$ to $\mathbf{w}$ , and $b / \kappa$ to $b$ .

$\displaystyle D(\mathbf{x}_n) = t_n \left( \mathbf{w}^{T} \mathbf{x}_n + b \right) \geq 1\ \ \ \ \ \ \ \ \ \ (3)$

Now let’s consider which one is the most optimal decision boundary.

First, as you can see below (see the following picture), the decision boundary will become better, when the minimum margin $\kappa$ is more larger.
Then you should find $\max_{\mathbf{w}, \mathbf{p}} \kappa = \max_{\mathbf{w}, \mathbf{p}} \left( min_n \left| (\mathbf{w}^{T} / \| \mathbf{w} \|) (\mathbf{x}_n - \mathbf{p}) \right| \right)$ for obtaining the optimal decision boundary.

Second, when the decision boundary $\mathbf{w}^{T} \mathbf{x} + b$ is obtained, this optimal $\mathbf{w}$ doesn’t depend on $b$ .
Let’s see the following picture. If you change $b$ , only the intercept in boundary function changes and the gradient is the same.
Thus, in order to maximize margin $\kappa$ , the optimal $b^{\prime}$ is given as the center between two classes. That is, if $\mathbf{w}$ is given, the optimal $b^{\prime}$ (i.e, $\mathbf{p}$ ) is determined.

Note : $b$ decides the position of boundary (decision hyperplane). Please remember that $\mathbf{w}$ has decided the “direction” of boundary (decision hyperplane).

Here I note that there exist $\mathbf{x}$ which satisfies $\mathbf{w}^{T} \mathbf{x} + b^{\prime} = \pm 1$ in equation (3). (These $\mathbf{x}$ definitely exist, since, if there doesn’t exist, you can get the better $b$ and then $b^{\prime}$ is not optimal.) We assume that these are $\mathbf{x}_{n_1}$ and $\mathbf{x}_{n_2}$ as follows.

$\displaystyle D(\mathbf{x}_{n_1}) = \mathbf{w}^{T} \mathbf{x}_{n_1} + b^{\prime} = 1$

$\displaystyle D(\mathbf{x}_{n_2}) = \mathbf{w}^{T} \mathbf{x}_{n_2} + b^{\prime} = -1$

Now let’s consider how the margin is give by $\mathbf{w}$ with the condition of equation (3).
As you saw above, the margin of each $\mathbf{x}_n$ is given by $(\mathbf{w}^{T} / \| \mathbf{w} \|) \mathbf{x}_n$ . Thus the minimum margin is given as :

$\displaystyle \min_n \{ \verb| margin of n-th | \}$

$\displaystyle = \frac{1}{2} \left( \min_{\mathbf{x} \in C_{t=1}} \left| \frac{\mathbf{w}^{T}}{\| \mathbf{w} \|} \mathbf{x} \right| + \min_{\mathbf{x} \in C_{t=-1}} \left| \frac{\mathbf{w}^{T}}{\| \mathbf{w} \|} \mathbf{x} \right| \right)$

$\displaystyle = \frac{1}{2} \left( \min_{x \in C_{t=1}} \frac{\mathbf{w}^{T}}{\| \mathbf{w} \|} \mathbf{x} - \max_{x \in C_{t=-1}} \frac{\mathbf{w}^{T}}{\| \mathbf{w} \|} \mathbf{x} \right)$

$\displaystyle = \frac{1}{2} \left( \frac{\mathbf{w}^{T}}{\| \mathbf{w} \|} \mathbf{x}_{n_1} - \frac{\mathbf{w}^{T}}{\| \mathbf{w} \|} \mathbf{x}_{n_2} \right)$

$\displaystyle = \frac{1}{2} \left( \frac{1-b^{\prime}}{\| \mathbf{w} \|} - \frac{-1-b^{\prime}}{\| \mathbf{w} \|} \right)$

$\displaystyle = \frac{1}{\| \mathbf{w} \|}$

Then, in order to maximize the class margin, you should find parameters $\arg \min_{\mathbf{w}} \| \mathbf{w} \|$ and corresponding $b$ , which satisfies :

$\displaystyle D(\mathbf{x}_n) = t_n \left( \mathbf{w}^{T} \mathbf{x}_n + b \right) \geq 1$ for all $n = 1, \ldots , N$

This problem is the conditional min/max problem with inequality constraints, and we then apply the following KKT (Karush–Kuhn–Tucker) conditions with Lagrange multipliers $\mathbf{a} = (a_1, a_2, \ldots, a_N)$ , instead of applying Lagrange conditions. (See Wikipedia “Karush–Kuhn–Tucker conditions” for details.)

KKT condition

Find parameters $\mathbf{w}$ , $b$ , and Lagrange multipliers $\mathbf{a} = (a_1, a_2, \ldots, a_N)$ to optimize the following Lagrange function $L(\mathbf{w}, b, \mathbf{a})$ , subject to the following 1 – 5. :

$\displaystyle L(\mathbf{w}, b, \mathbf{a}) = \frac{1}{2} \mathbf{w}^{T} \mathbf{w} - \sum_{n=1}^{N} a_n \left( t_n (\mathbf{w}^{T} \mathbf{x}_n + b) - 1 \right)$

$\displaystyle \frac{\partial L}{\partial \mathbf{w}} = \mathbf{w} - \sum_{n=1}^{N} a_n t_n \mathbf{x}_n = 0$
$\displaystyle \frac{\partial L}{\partial b} = \sum_{n=1}^{N} a_n t_n = 0$
$\displaystyle t_n (\mathbf{w}^{T} \mathbf{x}_n + b) - 1 \geq 0$
$\displaystyle a_n \geq 0$
$\displaystyle a_n \left( t_n (\mathbf{w}^{T} \mathbf{x}_n + b) - 1 \right) = 0$

Note : Especillay, the condition 5 is important in KKT.
For simplicity, let me assume a single Lagrange multiplier (i.e, $N = 1$ ).
When a Lagrange multiplier $a_1$ is equal to 0 (i.e, $a_1 = 0$ in condition 5), the condition 1 means that the optimal $\mathbf{w}$ is the mathematical limit of the norm $\frac{1}{2} \mathbf{w}^{T} \mathbf{w}$ , and you will then find that data point $\mathbf{x}_n$ doesn’t contribute to the result when $N > 1$ .
This intuitively implies that the limit is within the range of inequality constraints $D(\mathbf{x}_n)$ , such like the left side of the following picture. (In this picture, $\mathbf{w}$ has 2 dimensions.)
When a Lagrange multiplier $a_1$ is not equal to 0 (i.e, $a_1 \neq 0$ ), the condition 5 says that the optimal $\mathbf{w}$ should be on boundary in inequality constraints $D(\mathbf{x}_n)$ , such like the right side of the following picture.
In this case, the data point $\mathbf{x}_n$ is the support vector. (See below for support vector.)

Here I don’t go so far, but this problem yields to the following equivalent problems with only Lagrange multipliers $\mathbf{a} = (a_1, a_2, \ldots, a_N)$ by substituting and erasing $\mathbf{w}$ .
This equivalent representation is called dual representation in mathematics.

Dual representation

Find $\mathbf{a} = (a_1, a_2, \ldots, a_N)$ to maximize the following $\widetilde{L}(\mathbf{a})$ :

$\displaystyle \widetilde{L}(\mathbf{a}) = \sum_{n=1}^{N} a_n - \frac{1}{2} \sum_{n=1}^{N} \sum_{m=1}^{N} a_n a_m t_n t_m \mathbf{x}_n^{T} \mathbf{x}_m\ \ \ \ \ \ \ \ \ \ (4)$

Subject to :

$\displaystyle a_n \geq 0$ for all $n = 1,2, \ldots, N$
$\displaystyle \sum_{n=1}^{N} a_n t_n = 0$

Note : The objective function $\widetilde{L}(\mathbf{a})$ is a quadratic function and you can then solve this problem as a Quadratic Programming (QP) problem. (You can use any generic QP solver in computer systems.)

When $\mathbf{a} = (a_1, a_2, \ldots, a_N)$ is obtained by the above dual representation, eventually you can get the decision boundary by the following formula. (As we saw above, $b$ can also be easily obtained, because we have $\mathbf{w} = \sum_{n=1}^{N} a_n t_n \mathbf{x}_n$ .) :

$\displaystyle D(\mathbf{x}) = \sum_{n=1}^{N} a_n t_n \mathbf{x}_n^{T} \mathbf{x} + b\ \ \ \ \ \ \ \ \ \ (5)$

As you saw earlier, the vector (input data) which has the minimum margin is especially important for this problem. Then this vector is called a support vector in SVM.
For instance, all of the following 5 vectors are support vectors.

As you saw above, this problem is to get the optimal parameters by minimizing $\frac{1}{2} \mathbf{w}^{T} \mathbf{w}$ . By this mechanism, margin maximization in SVM essentially avoids overfitting by L2 regularization. (See here for L2 regularization in overfitting problems.)

Introducing Kernel Methods

In above example, we saw the idea of support vector machines (SVM) using a trivial linear classification. But, the real problem is not so simple.
From here, we enter into more practical topics step by step.

For instance, let’s see the following input vectors.
As you can easily see, this cannot be classified by any of linear decision boundary.

Now let me introduce kernel tricks in this section.

In order to make it classified by linear decision boundary, we assume that the above 2-dimensional vectors are mapped into more high dimensional space.
For example, let us assume that above 2-dimensional vectors are mapped into 3-dimensional vectors by the following mapping definition.

$\displaystyle \varphi : \mathbf{x} = (x_1, x_2) \longrightarrow (x_1, x_2, x_1^2 + x_2^2)$

With this mapping, the data is easily classified by the 3-dimensional hyperplane, which is the grey-colored plane $z = 1$ in the following picture.

Note : As you see in above picture, the mapped coordinates $\varphi(\mathbf{x}) = (x_1, x_2, z)$ are not dense. (It’s a 2-dimensional manifold embedded in 3-dimensional space.) Thus the learner with this method (with which, the inputs are mapped into high dimensional space) is sometimes called “sparse kernel machine”.

$\varphi$ is called a basis function.
When $\varphi$ is given, this problem (to find weight vector and bias) yields to the following dual representation. (See above section for dual representation.)

Dual representation

Find $\mathbf{a} = (a_1, a_2, \ldots, a_N)$ to maximize the following $\widetilde{L}(\mathbf{a})$ :

$\displaystyle \widetilde{L}(\mathbf{a}) = \sum_{n=1}^{N} a_n - \frac{1}{2} \sum_{n=1}^{N} \sum_{m=1}^{N} a_n a_m t_n t_m \varphi^{T}(\mathbf{x}_n) \varphi(\mathbf{x}_m)\ \ \ \ \ \ \ \ \ \ (6)$

Subject to :

$\displaystyle a_n \geq 0$ for all $n = 1,2, \ldots, N$
$\displaystyle \sum_{n=1}^{N} a_n t_n = 0$

When Lagrange multipliers, $\mathbf{a} = (a_1, a_2, \ldots, a_N),$ are given, you can obtain the decision boundary by the following formula. :

$\displaystyle D(\mathbf{x}) = \sum_{n=1}^{N} a_n t_n \varphi(\mathbf{x}_n)^{T} \varphi(\mathbf{x}) + b\ \ \ \ \ \ \ \ \ \ (7)$

Now we denote $\varphi^{T}(\mathbf{x}) \varphi(\mathbf{z})$ as $k(\mathbf{x}, \mathbf{z})$ . This $k(\mathbf{x}, \mathbf{z})$ is called a kernel function.
Using kernel functions, we can write above (7) as follows. It’s simply given by a linear combination of the target values from the training set.

$\displaystyle D(\mathbf{x}) = \sum_{n=1}^{N} a_n t_n k(\mathbf{x}_n, \mathbf{x}) + b$

As you can easily see, above problem is all written (described) by unknown kernel $k(\mathbf{x}, \mathbf{z})$ without base function $\varphi(\cdot)$ . The additional constraint is that $k(\mathbf{x}, \mathbf{z})$ should have a function $\varphi(\cdot)$ , such as, $k(\mathbf{x}, \mathbf{z}) = \varphi^{T}(\mathbf{x}) \varphi(\mathbf{z})$ . From now on, we consider this problem (mapped into high-dimensional spaces) only using the kernel $k(\mathbf{x}, \mathbf{z})$ .
Here I don’t describe proofs, but it’s known that many other loss functions or predictive functions (also, dual representations) in popular machine learning algorithms can also be written with kernel functions. (You can then use kernel methods in various machine learning problems.) For instance, when we apply regularized least square for regression problems with basis function : $y = \mathbf{w}^T \varphi(\mathbf{x})$ (see my previous post for linear regression by least square), it’s known that the obtained predictive function $y$ can be written as follows using some kernel function $k(\mathbf{x}, \mathbf{z})$ .

$\displaystyle y = \sum^{N}_{k=1} k(\mathbf{x}, \mathbf{x}_k) \mathbf{t}_k$

where $\{\mathbf{x}_k\} (k = 1, \ldots, N)$ is training input’s set and $\{\mathbf{t}_k\} (k = 1, \ldots, N)$ is corresponding label’s (target value’s) set.

Now $k(\mathbf{x}, \mathbf{z})$ is unknown and might be an arbitrary function, but one important constraint is “it should be kernel function”.
For instance, $k(\mathbf{x}, \mathbf{z}) = (1 + \mathbf{x}^{T} \mathbf{z})^2$ is a valid kernel function, since this function can be expanded as follows. :

$\displaystyle k(\mathbf{x}, \mathbf{z})$

$\displaystyle = (1 + \mathbf{x}^{T} \mathbf{z})^2$

$\displaystyle = 1 + 2 x_1 z_1 + 2 x_2 z_2 + x_1^2 z_1^2 + 2 x_1 z_1 x_2 z_2 + x_2^2 z_2^2$

$\displaystyle = (1,\sqrt{2} x_1,\sqrt{2} x_2,x_1^2,\sqrt{2} x_1 x_2,x_2^2) \begin{pmatrix} 1 \\ \sqrt{2} z_1 \\ \sqrt{2} z_2 \\ z_1^2 \\ \sqrt{2} z_1 z_2 \\ z_2^2 \end{pmatrix}$

$\displaystyle = \varphi(\mathbf{x})^{T} \varphi(\mathbf{z})$

where $\varphi : (x_1, x_2) \longrightarrow (1,\sqrt{2} x_1,\sqrt{2} x_2,x_1^2,\sqrt{2} x_1 x_2,x_2^2)$

It’s known that sum, product, and composition of kernel functions are also kernel functions.
However, in most cases, the way without having explicitly to construct the original basis function $\varphi(x)$ is used to know whether a function is a valid kernel function.

Note : In kernel methods, it’s often used a matrix called Gram matrix, which has the value $k(\mathbf{x}_n, \mathbf{x}_m)$ on $(n, m)$ element, where $\{\mathbf{x}_k\} (k = 1, \ldots, N)$ is training inputs.
Gram matrix is symmetric and should be positive semidefinite, if and only if $k(\mathbf{x}_n, \mathbf{x}_m)$ is a valid kernel function.

How should we obtain (or approximate) the appropriate kernel function in support vector machines ?
I’ll give you one of solutions for this question in the next section.

RBF Kernel – Why it’s widely used ?

In this section, we see RBF (Radial Basis Function) kernel, which has flexible representation and is mostly used in practical kernel methods.

RBF kernel $k(\mathbf{x}_n, \mathbf{x}_m)$ is a kernel, which only depends on its norm $\| {x}_n - {x}_m \|$ .
Especially, the following form of kernel is called Gaussian kernel.

$\displaystyle k(\mathbf{x}_n, \mathbf{x}_m) = \exp\left( -\frac{\| \mathbf{x}_n - \mathbf{x}_m \|}{2 \sigma^2} \right)$

Note : It’s known that $\exp(k(\mathbf{x}_n, \mathbf{x}_m))$ is a valid kernel function, if $k(\mathbf{x}_n, \mathbf{x}_m)$ is a kernel function.
Gaussian kernel has infinite dimensionality.

In this section, I’ll show you how it fits to the real data and make you understand why this kernel (Parzen estimation) is so popular.
For simplicity, we discuss using previous linear regression at first

Now, to make things simple, let us assume the following binary classification of 2 dimension’s vector $(x_1, x_2)$ , and consider the possibility of errors.

As you can easily imagine, you will see the error values with large possibility when it’s near the boundary, and with less possibility when it’s far from the boundary. As you can see below, the possibility of errors will follow the 2-dimensional normal distribution (Gaussian distribution) depending on the distance (Euclidean norm) from boundary.

Note : For simplicity, here we’re assuming that two variables $x_1$ and $x_2$ are independent each other, then their covariance is equal to zero. And we also assume the standard deviation for $x_1$ and $x_2$ are both $\sigma$ .
(i.e, the covariance matrix in Gaussian distribution is isotropic.)

On contrary, let us consider the error possibility of the following point.
As you see below, this will be affected by both upper side’s boundary and lower side’s boundary, and it will become the sum of both possibilities.

Note : If you simply add these possibilities (for upper-side and lower-side), the total possibilities will exceed 1. Thus, strictly speaking, you should normalize the sum of these possibilities.

Eventually the possibility of errors will be described as probability density distribution by the combination (sum and normalization) of normal distributions in each observed points.

In order to see this in the brief example, let us assume the following 1-dimensional sine curve $t = 10 \sin(\pi x_1 / 60)$ and we have the following 6 observed points exact on this curve.

Then, by applying the following steps, we can estimate the original sine curve with these 6 points.

Assume normal distributions (Gaussian distributions) for each 6 observed points.
Get the weighted ratio for each distributions.
For instance, we assume $t_1=0.011, t_2=0.037, t_3=0.002, t_4=0.000, t_5=0.000, t_6=0.000$ on $x = 28$ in above picture. Then the weighted value for each elements on $x = 28$ are :
$t_1=0.011 / (0.011 + 0.037 + 0.002 + 0.000 + 0.000 + 0.000) = 0.22$ ,
$t_2=0.037 / (0.011 + 0.037 + 0.002 + 0.000 + 0.000 + 0.000) = 0.74$ ,
$t_3= 0.002 / (0.011 + 0.037 + 0.002 + 0.000 + 0.000 + 0.000) = 0.04$ ,
$t_4=t_5=t_6=0$ .
The following picture shows the weighted plots of each distributions.
Multiply by each observed values.
For instance, if t-value of the first observed point is 5 (see above picture of sine curve), then the $t_1$ effect of this first value (black-colored line) on $x=28$ will be equal to $5 \times 0.22 = 1.1$ . (See below picture.)
Finally, sum all these values (i.e, these effects) for 6 elements on each points of x-axis.

Please remind that the predictive function by linear regression with basis function can be written as the linear combination between target values (t) and kernel functions. (See previous section.)
As a result, you can easily estimate original curve by Gaussian kernel using the given observed data as follows.
Gaussian kernel has rich representation and can fit to a various kind of formula.

Note : Here I showed a brief example using a simple 1-dimension sine curve, but see chapter 6.3.1 in “Pattern Recognition and Machine Learning” (Christopher M. Bishop, Microsoft) for the general steps of Nadaraya-Watson regression (kernel smoother).

$\sigma$ in Gaussian is experimentally determined.
When $\sigma$ is larger, the model will become more smoother. On contrary, when $\sigma$ is smaller, the model is locally dominated by nearby observed values.

The value of standard deviation ( $\sigma$ ) is large

The value of standard deviation ( $\sigma$ ) is small

Note : Kernel density estimation (KDE) is the application for estimating the empirical probability density function (PDF) by applying this kind of Gaussian kernel smoother.
When $\sigma$ differs extremely in each points, you can use the estimation by kNN (K nearest neighbor) method in non-parametric approach.
See “Mathematical Introduction to Regression” for the estimation application of parametric approach.

Now let’s go back to the equation (6) and (7).

In these equations, the formula of $\varphi^{T}(\mathbf{x}_n) \varphi(\mathbf{x}_m) (=k(\mathbf{x}_n, \mathbf{x}_m))$ is unknown, but we can expect that these are also estimated by Gaussian kernel, and then we can get optimal Lagrange multipliers $\mathbf{a} = (a_1, a_2, \ldots, a_N)$ under this assumptions.
As you saw above, when $\sigma$ is smaller in Gaussian kernel, the hyperplane will also be increasingly dominated by nearby observed data relative to the distant ones.

The kernel methods is very powerful, since it can be applied for various unknown stochastic distributions. But you should remember that the computational complexity will increase linearly by increasing the size of training data.

Note : In general, a regression function which forms a linear combination of kernel by the training set and target values, $y(\mathbf{x}) = \sum_{n=1}^{N} k(\mathbf{x}, \mathbf{x}_n) t_n$ , is called a linear smoother.
Here we got this form by intuitive thinking, but you can obtain the equivalent regression result by Bayesian inference (algebraic calculation) for a linear basis function.

Overlapping Class Distributions

Until now, we assumed that all data is exactly separated by support vector machines (i.e, hard-margin SVM). However, there’ll also be observed errors (noise) in practical SVM.
Here I show you soft-margin SVMs (such as, C-SVM and ν-SVM) which evaluate these losses.

First of all, I introduce new variable $\xi_n (n=1,2,\ldots,N)$ (called slack variables) which satisfies :

$t_n (\mathbf{w}^{T} \varphi(\mathbf{x}_n) + b) \geq 1 - \xi_n$ for any $n=1,2,\ldots,N$

$\xi_n$ subjects to :

$\xi_n > 1$ , when it exceeds decision boundary.
$\xi_n = 1$ , when it’s on decision boundary.
$0 < \xi_n < 1$ , when it’s between decision boundary and margin.
$\xi_n = 0$ , others (when it’s correctly outside of margin, including support vectors.)

In the previous problems (without considering losses), we found $\arg \min_{\mathbf{w}} \| \mathbf{w} \|$ to satisfy conditions.
Instead, in this problem, we find the parameters $\mathbf{w}, b, \xi_n$ to minimize the following equation.

$\displaystyle \frac{1}{2} \| \mathbf{w} \|^2 + C\sum_{n=1}^{N} \xi_n$

Note : The above loss formula is also written by :
$\displaystyle \frac{1}{2} \| \mathbf{w} \|^2 + C\sum_{n=1}^{N} \max(0, 1 - t_n y_n)$ where $y_n = \mathbf{w}^{T} \varphi(\mathbf{x}_n) + b$
The part $L(t_n, y_n) = \max(0, 1 - t_n y_n)$ is sometimes called hinge loss. (The name is because of the shape of the function $L(z)=\max(0, 1-z)$ .)

Here the constant $C$ is a penalty for loss, and it’s determined experimentally by using such as cross-validation. (Later I’ll show you how $C$ affects the model.)

This is called C-SVM (C-support vector machine), and it’s also solved by dual representation with Lagrange multipliers. (I don’t describe this mathematical solution in this post.)

Let’s see another soft-margin SVM, called ν-SVM (ν-support vector machine).
In the previous C-SVM, if the number of inputs, $N$ , increases, the penalty will also linearly increase. That is, there will be the bias depending on $N$ .

In ν-SVM, this bias is suppressed by using $(1 / N) \sum_{n=1}^{N} \xi_n$ .
ν-SV classifier is defined as :

Find $\mathbf{w}, b, \xi_n, \rho$ to minimize $\displaystyle \frac{1}{2} \| \mathbf{w} \|^2 - \nu \rho + \frac{1}{N} \sum_{n=1}^{N} \xi_n$
Subject to : $t_n (\mathbf{w}^{T} \varphi(\mathbf{x}_n) + b) \geq \rho - \xi_n$ for any $n=1,2,\ldots,N$

With condition $t_n (\mathbf{w}^{T} \varphi(\mathbf{x}_n) + b) \geq \rho - \xi_n$ , the margin becomes $2 \rho / \| \mathbf{w} \|$ . Then if $\rho$ is smaller, the loss will also be smaller. Eventually this classifier might pick up so small margin and will cause overfitting. (Then $\| \mathbf{w} \|$ will become so large.)
To prevent this unexpected learning, the term $- \nu \rho$ is introduced in this classifier.
Same like $C$ in C-SVM, $\nu$ is also determined experimentally.

Note : In this post, we’re discussing 2-class problems. However, even when only one class of data (i.e, only true data) is given, you can also obtain the optimal shpere by optimizing $\rho$ (i.e, the distance from boundary). By this approach, this trained model can then detect outside of the boundary – i.e, outlier or anomaly – by only using true data (inlier data) in training.
This learner is useful, because it is usually difficult for you to collect sufficient error (outlier) data. This learner is called one-class SVM. (I note that there exists another type of one-class SVM, but I don’t go details so far in this post.)
See here for training and prediction by one-class SVM in Python.

Let’s consider how this $\nu$ affects to the model.

To see this, now I show you corresponding KKT condition for this ν-SVM as follows.

KKT condition

$\displaystyle L(\mathbf{w}, b, \xi_1,\ldots,\xi_n, \rho, \mathbf{a}, \mathbf{c}) = \frac{1}{2} \| \mathbf{w} \|^2 - \nu \rho + \frac{1}{N} \sum_{n=1}^{N} \xi_n - \sum_{n=1}^{N} a_n \left( t_n (\mathbf{w}^{T} \varphi(\mathbf{x}_n) + b) - \rho + \xi_n \right) - \sum_{n=1}^{N} c_n \xi_n$

1. $\displaystyle \frac{\partial L}{\partial \mathbf{w}} = \mathbf{w} - \sum_{n=1}^{N} a_n t_n \varphi(\mathbf{x}_n) = 0$
2. $\displaystyle \frac{\partial L}{\partial b} = \sum_{n=1}^{N} a_n t_n = 0$
3. $\displaystyle \frac{\partial L}{\partial \xi_n} = \frac{1}{N} - a_n - c_n = 0$
4. $\displaystyle \frac{\partial L}{\partial \rho} = -\nu + \sum_{n=1}^{N} a_n = 0$
5. $\displaystyle t_n (\mathbf{w}^{T} \varphi(\mathbf{x}_n) + b) - \rho + \xi_n \geq 0$
6. $\displaystyle \xi_n \geq 0, a_n \geq 0, c_n \geq 0$
7. $\displaystyle a_n \left( t_n (\mathbf{w}^{T} \varphi(\mathbf{x}_n) + b) - \rho + \xi_n \right) = 0$
8. $\displaystyle c_n \xi_n = 0$

From condition 3 and 6, we get $0 \leq a_n \leq 1 / N$ .
When we denote the number of support vectors as $N_s$ , the number of $a_n > 0$ is at most $N_s$ from condition 7.
Then, from condition 4, we get :

$\displaystyle \nu = \sum_{n=1}^{N} a_n \leq \frac{N_s}{N}$

It means that the ratio of support vectors is equal or larger than $\nu$ .

On contrary, the vector which satisfies $\xi_n > 0$ (i.e, the noise vector) is called bounded support vector, and we denote the number of bounded support vectors as $N_b$ .

From condition 8, the bounded support vector has $c_n = 0$ . From condition 3, it means $a_n = 1 / N$ .
Then we get :

$\displaystyle \nu = \sum_{n=1}^{N} a_n \geq \frac{N_b}{N}$

It means that the ratio of bounded support vectors is equal or smaller than $\nu$ . (This is the reason of the name of “bounded support vector”.)

The following picture shows how $\nu$ (or $C$ in C-SVM) intuitively affects model. When $\nu$ increases (or $C$ decreases), both support vectors and bounded support vectors will tend to increase.

With high dimensional mappings, this parameter will affect the trained model, such as the following image.
In general, when $\nu$ is smaller, the model complexity will become higher, and it might become rich enough to represent more complicated problems. However, this might also cause overfitting. (On contrary, when $\nu$ get larger, the model becomes more generalized.)
This parameter should be experimentally determined to fit the real problems, considering these trade-off.

Note : For details about overfitting, see my early post “Mathematical Understanding of Overfitting in Machine Learning“.

For training support vector machines in real computing, the analytical algorithm (such as, sequential minimal optimization) will be applied to fit in memory and scalability.

In this post, we have observed mathematics fundamentals behind SVMs and kernel tricks.
This idea of margin maximization gives you optimal solutions for high dimensional separation, and the idea is then also applied in a variety of other ML tasks – such as, inverse reinforcement learning (IRL), etc.

Reference :

“Pattern Recognition and Machine Learning” (Christopher M. Bishop, Microsoft), Chapter 6 and 7

Categories: Uncategorized

Tagged as: MachineLearning

tsmatz

Professional Development, Data Science

Mathematical Introduction to SVM and Kernel Methods

Maximum Margin Classification

Introducing Kernel Methods

RBF Kernel – Why it’s widely used ?

Overlapping Class Distributions

Leave a Reply Cancel reply

Recent Posts

Reinforcement Learning

Imitation Learning

Language Processing

Diffusion Models

Tags

Follow

Mathematical Introduction to SVM and Kernel Methods

Maximum Margin Classification

Introducing Kernel Methods

RBF Kernel – Why it’s widely used ?

Overlapping Class Distributions

Share this:

Related

Leave a Reply Cancel reply

Recent Posts

Reinforcement Learning

Imitation Learning

Language Processing

Diffusion Models

Tags

Follow