Support Vector Machine Python Example

August 12, 2019

Support Vector Machine Python Example

Support Vector Machine (SVM) is a supervised machine learning algorithm capable of performing classification, regression and even outlier detection. The linear SVM classifier works by drawing a straight line between two classes. All the data points that fall on one side of the line will be labeled as one class and all the points that fall on the other side will be labeled as the second. Sounds simple enough, but there’s an infinite amount of lines to choose from. How do we know which line will do the best job of classifying the data? This is where the LSVM algorithm comes in to play. The LSVM algorithm will select a line that not only separates the two classes but stays as far away from the closest samples as possible. In fact, the “support vector” in “support vector machine” refers to two position vectors drawn from the origin to the points which dictate the decision boundary.

Algorithm

Suppose, we had a vector w which is always normal to the hyperplane (perpendicular to the line in 2 dimensions). We can determine how far away a sample is from our decision boundary by projecting the position vector of the sample on to the vector w. As a quick refresher, the dot product of two vectors is proportional to the projection of the first vector on to the second.

If it’s a positive sample, we’re going to insist that the proceeding decision function (the dot product of w and the position vector of a given sample plus some constant) returns a value greater than or equal to 1.

Similarly, if it’s a negative sample, we’re going to insist that the proceeding decision function returns a value smaller than or equal to -1.

In other words, we won’t consider any samples located between the decision boundary and support vectors.

As stated in the MIT lecture, we introduce an additional variable stickily for convenience. The variable y will be equal to positive one for all positive samples and negative one for all negative samples.

After multiplying by y, the equations for the positive and negative samples are equal to one another.

Meaning, we can simplify the constraints down to a single equation.

Next, we need to address the process by which we go about maximizing the margin. To get an equation for the width of the margin, we subtract the first support vector from the one below it and the multiply the result by the unit vector of w which is always perpendicular to the decision boundary.

Using the constraints from above and a bit of algebra, we get the following.

Therefore, in order to select the optimal decision boundary, we must maximize the equation we just computed. We apply a few more tricks before proceeding (refer to the MIT lecture).

Now, in most machine learning algorithms, we’d use something like gradient descent to minimize said function, however, for support vector machines, we use the Lagrangian. The Lagrangian is beyond the scope of this article but if you’re in need of a quick crash course, I recommend checking out Khan Academy. In essence, using Lagrangian, we can solve for the global minimum like we’d do in high school level calculus (i.e. take the derivative of the function and make it equal to zero). The Lagrange tells us to subtract the cost function by the summation over all the constraints where each of those constraints will be multiplied by some constant alpha (normally written as lambda for the Lagrangian).

Then, we perform some more algebra, plugging the equations we found in the previous step back into the original equation.

Before we can proceed any further, we need to express the equation in terms of matrices instead of summations. The reason being, the qp function from the CVXOPT library, which we’ll use to solve the Lagrangian, accepts very specific arguments. Thus, we need to go from:

Where:

And:

To:

We can achieve this using the following identities:

In applying them, we get:

Note: X is the multiplication of x an y (not to be confused with x)

Then, we map the variables to those expected by the CVXOPT library.

Python Code

Now, we’re ready to write some code. We’ll start off by importing the necessary libraries.

import numpy as np  
import cvxopt  
from sklearn.datasets.samples_generator import make_blobs  
from sklearn.model_selection import train_test_split  
from matplotlib import pyplot as plt  
from sklearn.svm import LinearSVC  
from sklearn.metrics import confusion_matrix

Then, we define our SVM class. As we mentioned previously, instead of using gradient descent to find the best fitting line as in the case of Linear Regression, we can directly solve for w and b using the Lagrangian.

class SVM:

def fit(self, X, y):  
        n_samples, n_features = X.shape

# P = X^T X  
        K = np.zeros((n_samples, n_samples))  
        for i in range(n_samples):  
            for j in range(n_samples):  
                K[i,j] = np.dot(X[i], X[j])

P = cvxopt.matrix(np.outer(y, y) * K)

# q = -1 (1xN)  
        q = cvxopt.matrix(np.ones(n_samples) * -1)

# A = y^T   
        A = cvxopt.matrix(y, (1, n_samples))

# b = 0   
        b = cvxopt.matrix(0.0)

# -1 (NxN)  
        G = cvxopt.matrix(np.diag(np.ones(n_samples) * -1))

# 0 (1xN)  
        h = cvxopt.matrix(np.zeros(n_samples))

solution = cvxopt.solvers.qp(P, q, G, h, A, b)

# Lagrange multipliers  
        a = np.ravel(solution['x'])

# Lagrange have non zero lagrange multipliers  
        sv = a > 1e-5  
        ind = np.arange(len(a))[sv]  
        self.a = a[sv]  
        self.sv = X[sv]  
        self.sv_y = y[sv]

# Intercept  
        self.b = 0  
        for n in range(len(self.a)):  
            self.b += self.sv_y[n]  
            self.b -= np.sum(self.a * self.sv_y * K[ind[n], sv])  
        self.b /= len(self.a)

# Weights  
        self.w = np.zeros(n_features)  
        for n in range(len(self.a)):  
            self.w += self.a[n] * self.sv_y[n] * self.sv[n]  
          
    def project(self, X):  
        return np.dot(X, self.w) + self.b  
      
      
    def predict(self, X):  
        return np.sign(self.project(X))

To keep things simple, we’ll use the scikit-learn library to generate linearly separable data. We label the negative samples as -1 instead of 0. cvxopt expects the data to be in a specific format which is why we take an intermediate step.

X, y = make_blobs(n_samples=250, centers=2,  
                  random_state=0, cluster_std=0.60)

y[y == 0] = -1

tmp = np.ones(len(X))

y = tmp * y

Let’s get a feel for the data by plotting it.

plt.scatter(X[:, 0], X[:, 1], c=y, cmap='winter')

We split the data into training and testing sets.

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

Then, we create and train an instance of our support vector machine class.

svm = SVM()

svm.fit(X_train, y_train)

Next, we plot the decision boundary and support vectors.

def f(x, w, b, c=0):  
    return (-w[0] * x - b + c) / w[1]

plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap='winter')

# w.x + b = 0  
a0 = -4; a1 = f(a0, svm.w, svm.b)  
b0 = 4; b1 = f(b0, svm.w, svm.b)  
plt.plot([a0,b0], [a1,b1], 'k')

# w.x + b = 1  
a0 = -4; a1 = f(a0, svm.w, svm.b, 1)  
b0 = 4; b1 = f(b0, svm.w, svm.b, 1)  
plt.plot([a0,b0], [a1,b1], 'k--')

# w.x + b = -1  
a0 = -4; a1 = f(a0, svm.w, svm.b, -1)  
b0 = 4; b1 = f(b0, svm.w, svm.b, -1)  
plt.plot([a0,b0], [a1,b1], 'k--')

We use our model to predict the classes of the samples in the testing set. Given that we’re using our model to classify data, we use a confusion matrix to evaluate its accuracy.

y_pred = svm.predict(X_test)

confusion_matrix(y_test, y_pred)

Let’s attempt the same thing using the scikit-learn implementation of the support vector classifier.

svc = LinearSVC()

svc.fit(X_train, y_train)

After training our model, we plot the decision boundary and support vectors.

plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap='winter');  
ax = plt.gca()  
xlim = ax.get_xlim()  
w = svc.coef_[0]  
a = -w[0] / w[1]  
xx = np.linspace(xlim[0], xlim[1])

yy = a * xx - svc.intercept_[0] / w[1]  
plt.plot(xx, yy)

yy = a * xx - (svc.intercept_[0] - 1) / w[1]  
plt.plot(xx, yy, 'k--')

yy = a * xx - (svc.intercept_[0] + 1) / w[1]  
plt.plot(xx, yy, 'k--')

Again, we predict which sample belongs to what class based off which side of the line they fall.

y_pred = svc.predict(X_test)

confusion_matrix(y_test, y_pred)

As we can see, the classifier correctly classified every sample.

Conclusion

We saw how we could go about using the Lagrangian to determine the line that best separates our data. In the real world, most problems are not linear separable. Thus, we make use of something called the kernel trick to separate the data using something other than a straight line. Stay tuned for an upcoming article where we cover this topic.