Skip to content

Optimization for Support Vector Machines (SVM)#

18. Optimization for Support Vector Machines (SVM)#

Support Vector Machines (SVM) are powerful supervised learning models used for classification and regression tasks. The key idea behind SVM is to find the hyperplane that best separates data points of different classes. In binary classification, the goal is to maximize the margin between the two classes, where the margin is defined as the distance between the hyperplane and the closest points from each class (the support vectors).

The optimization problem in SVM involves finding this optimal hyperplane by solving a convex optimization problem.


18.1 SVM Objective#

Given a dataset {(xi,yi)}, where xiRn are the feature vectors and yi{1,1} are the class labels, the objective is to find a hyperplane:

wTx+b=0

That separates the two classes with the maximum margin. The margin is defined as the perpendicular distance from the hyperplane to the nearest points from either class (the support vectors).

The optimization problem for the SVM is formulated as:

minw,b12w2

Subject to the constraints:

yi(wTxi+b)1,i

Where:
- w is the weight vector perpendicular to the hyperplane,
- b is the bias term (offset of the hyperplane from the origin),
- yi is the class label of the i-th point, which must satisfy the constraint yi(wTxi+b)1 to ensure the points are correctly classified.

The objective function 12w2 corresponds to maximizing the margin, since the margin is given by 1w.


18.2 Primal Form of SVM Optimization#

The optimization problem can be written in its primal form as:

minw,b12w2

subject to:

yi(wTxi+b)1for alli

This is a quadratic optimization problem with linear constraints.


18.3 Lagrange Multipliers and Dual Form#

To solve the primal problem, we introduce Lagrange multipliers to incorporate the constraints into the objective function. The Lagrangian for this problem is:

L(w,b,α)=12w2i=1Nαi(yi(wTxi+b)1)

Where αi are the Lagrange multipliers corresponding to each constraint.

The next step is to maximize the Lagrangian with respect to αi, and minimize it with respect to w and b. After differentiating and setting the gradient to zero, we obtain the dual form of the SVM optimization problem.


18.4 Dual Form of SVM Optimization#

By solving the Lagrangian, we get the dual form of the SVM optimization problem, which is:

maxαi=1Nαi12i,j=1NαiαjyiyjxiTxj

subject to:

i=1Nαiyi=0

and

αi0,i

Where:
- αi are the Lagrange multipliers,
- yi are the class labels,
- xi are the feature vectors.

This is a quadratic programming problem, and solving it will give us the values of αi, which are used to compute the weight vector w and the bias b.


18.5 Support Vectors and Decision Boundary#

  • The support vectors are the data points that lie on the margin boundaries, and they are the only points that affect the position of the decision boundary.
  • The weight vector w can be computed as:

[
\mathbf{w} = \sum_{i=1}^{N} \alpha_i y_i \mathbf{x}_i
]

  • The bias b can be computed using any support vector xi as:

[
b = y_i - \mathbf{w}^T \mathbf{x}_i
]

The decision boundary is given by:

wTx+b=0

18.6 Soft Margin SVM#

In practice, the data may not be perfectly linearly separable. To account for this, we introduce a soft margin that allows some misclassifications. This is done by introducing slack variables ξi, which represent the amount of violation of the margin constraint for each data point.

The soft margin SVM optimization problem becomes:

minw,b,ξ12w2+Ci=1Nξi

subject to:

yi(wTxi+b)1ξi,i

and

ξi0,i

Where C is a regularization parameter that controls the trade-off between maximizing the margin and minimizing the classification error.


18.7 Nonlinear SVM (Kernel Trick)#

When the data is not linearly separable in its original feature space, we can map it into a higher-dimensional feature space using a kernel function. The most common kernel functions are:

  1. Polynomial Kernel:
K(xi,xj)=(xiTxj+1)d
  1. Radial Basis Function (RBF) Kernel:
K(xi,xj)=exp(xixj22σ2)
  1. Sigmoid Kernel:
K(xi,xj)=tanh(αxiTxj+c)

The kernel trick allows us to compute the dot product in the higher-dimensional space without explicitly transforming the data, making it computationally efficient.

In the dual form, the kernel function replaces the dot product xiTxj, and the optimization problem remains the same.


18.8 SVM Training and Classification#

  1. Training: Use the dual form of the SVM optimization problem and solve it using quadratic programming (QP) solvers. This will provide the values of the Lagrange multipliers αi.

  2. Classification: Once the parameters w and b are computed, classify new data points by evaluating the decision function:

f(x)=wTx+b

If f(x)>0, classify the data point as +1, and if f(x)<0, classify it as 1.


18.9 Practice Problems#

  1. Primal SVM: Formulate the primal optimization problem for an SVM with the following dataset:

[
\mathbf{X} = [23 34 45 56], \quad \mathbf{y} = [1 1 1 1]
]

Solve it using the quadratic programming approach.

  1. Soft Margin SVM: Given the dataset:

[
\mathbf{X} = \begin{bmatrix}
1 & 1 \
2 & 2 \
3 &

3 \
4 & 4
\end{bmatrix}, \quad \mathbf{y} = [1 1 1 1]
]

Solve the soft margin SVM optimization problem for C=1.


This concludes the overview of Optimization for Support Vector Machines. Would you like to dive deeper into any specific part, or do you want to continue with another topic?