SVM From Scratch — Python (2024)

Important Concepts Summarized

Published in

Towards Data Science

What this blog will cover:

1. SVM Introduction
2. Reading the Dataset
3. Feature Engineering
4. Splitting the Dataset
5. Cost Function
6. The Gradient of the Cost Function
7. Train Model Using SGD
8. Stoppage Criterion for SGD
9. Testing the Model
10.
11. Give Me the Code

Before diving right into the code or technical details, I would like to mention that while there are many libraries/frameworks available to implement SVM (Support Vector Machine) algorithm without writing a bunch of code, I decided to write the code with as few high-level libraries as possible so that you and I can get a good grasp of important components involved in training an SVM model (with 99% accuracy, 0.98 recall, and precision). If you are looking for a quick implementation of SVM, then you are better off using packages like scikit-learn, cvxopt, etc. Otherwise, let’s get started!

The SVM (Support Vector Machine) is a supervised machine learning algorithm typically used for binary classification problems. It’s trained by feeding a dataset with labeled examples (xᵢ, yᵢ). For instance, if your examples are email messages and your problem is spam detection, then:

An example email message xᵢ is defined as an n-dimensional feature vector that can be plotted on n-dimensional space.
The feature vector, as the name explains, contains features (eg. word count, link count, etc.) of your email message in numerical form
Each feature vector is labeled with a class yᵢ
The class yᵢ can either be a +ve or -ve (eg. spam=1, not-spam=-1)

Using this dataset the algorithm finds a hyperplane (or decision boundary) which should ideally have the following properties:

It creates separation between examples of two classes with a maximum margin
Its equation (w.x + b = 0) yields a value ≥ 1 for examples from+ve class and ≤-1 for examples from -ve class

How does it find this hyperplane? By finding the optimal values w* (weights/normal) and b* (intercept) which define this hyperplane. The optimal values are found by minimizing a cost function. Once the algorithm identifies these optimal values, the SVM model f(x) is then defined as shown below:

Before we move any further let’s import the required packages for this tutorial and create a skeleton of our program svm.py:

# svm.py
import numpy as np # for handling multi-dimensional array operation
import pandas as pd # for reading data from csv 
import statsmodels.api as sm # for finding the p-value
from sklearn.preprocessing import MinMaxScaler # for normalization
from sklearn.model_selection import train_test_split as tts
from sklearn.metrics import accuracy_score 
from sklearn.utils import shuffle# >> FEATURE SELECTION << #
def remove_correlated_features(X):
def remove_less_significant_features(X, Y):# >> MODEL TRAINING << #
def compute_cost(W, X, Y):
def calculate_cost_gradient(W, X_batch, Y_batch):
def sgd(features, outputs):def init():

We’ll be working with a breast cancer dataset available on Kaggle. The features in the dataset are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe the characteristics of the cell nuclei present in the image. Based on these features we will train our SVM model to detect if the mass is benign B (generally harmless) or malignant M (cancerous).

Download the dataset and place the data.csv file in the same folder as svm.py. Then add this code inside init() function:

def init():
 data = pd.read_csv('./data.csv') # SVM only accepts numerical values. 
 # Therefore, we will transform the categories M and B into
 # values 1 and -1 (or -1 and 1), respectively.
 diagnosis_map = {'M':1, 'B':-1}
 data['diagnosis'] = data['diagnosis'].map(diagnosis_map) # drop last column (extra column added by pd)
 # and unnecessary first column (id)
 data.drop(data.columns[[-1, 0]], axis=1, inplace=True)

read_csv() function of the Pandas package reads data from the .csv file and stores it in a DataFrame. Think of DataFrame as an implementation of a data structure that looks like a table with labeled columns and rows. Here’s how the data read from data.csv looks like inside DataFrame:

As you can see the labels and structure of our dataset are preserved and that’s what makes DataFrame intuitive.

Machine learning algorithms operate on a dataset that is a collection of labeled examples which consist of features and a label i.e. in our case diagnosis is a label, [radius_mean, structure_mean, texture_mean…] features, and each row is an example.

In most of the cases, the data you collect at first might be raw; its either incompatible with your model or hinders its performance. That’s when feature engineering comes to rescue. It encompasses preprocessing techniques to compile a dataset by extracting features from raw data. These techniques have two characteristics in common:

Preparing the data which is compatible with the model
Improving the performance of the machine learning algorithm

Normalization is one of the many feature engineering techniques that we are going to use. Normalization is the process of converting a range of values, into a standard range of values, typically in the interval [−1, 1] or [0, 1]. It’s not a strict requirement but it improves the speed of learning (e.g. faster convergence in gradient descent) and prevents numerical overflow. Add following code in init() functions to normalize all of your features:

# inside init()
# put features & outputs in different DataFrames for convenience
Y = data.loc[:, 'diagnosis'] # all rows of 'diagnosis' 
X = data.iloc[:, 1:] # all rows of column 1 and ahead (features)# normalize the features using MinMaxScalar from
# sklearn.preprocessing
X_normalized = MinMaxScaler().fit_transform(X.values)
X = pd.DataFrame(X_normalized)

We’ll split the dataset into train and test set using the train_test_split() function from sklearn.model_selection. We need a separate dataset for testing because we need to see how our model will perform on unseen observations. Add this code in init():

# inside init()# first insert 1 in every row for intercept b
X.insert(loc=len(X.columns), column='intercept', value=1)# test_size is the portion of data that will go into test set
# random_state is the seed used by the random number generator
print("splitting dataset into train and test sets...")
X_train, X_test, y_train, y_test = tts(X, Y, test_size=0.2, random_state=42)

If you are confused about why we added 1 in every row then don’t worry. You will get the answer in the next section.

Also known as the Objective Function. One of the building blocks of every machine learning algorithm, it’s the function we try to minimize or maximize to achieve our objective.

What’s our objective in SVM? Our objective is to find a hyperplane that separates +ve and -ve examples with the largest margin while keeping the misclassification as low as possible (see Figure 3).

How do we achieve this objective? We will minimize the cost/objective function shown below:

You might have seen another version of a cost function that looks like this:

In this cost function, λ is essentially equal to 1/C and has the opposite effect i.e larger λ gives a wider margin and vice versa. We can use any of the above cost functions keeping in mind what each regularization parameter (C and λ) does and then tuning them accordingly. Let’s see how can we calculate the total cost as given in (1) and then we will move on to its gradient which will be used in the training phase to minimize it:

def compute_cost(W, X, Y):
 # calculate hinge loss
 N = X.shape[0]
 distances = 1 - Y * (np.dot(X, W))
 distances[distances < 0] = 0 # equivalent to max(0, distance)
 hinge_loss = reg_strength * (np.sum(distances) / N) # calculate cost
 cost = 1 / 2 * np.dot(W, W) + hinge_loss
 return cost

As you might have noticed that the intercept term b is missing. That’s because we have pushed it into the weight vector like this:

That’s why we added an extra column with all 1s before splitting our dataset. Keep this in mind for the rest of the tutorial.

As you would have noticed there are some changes in cost function in equation (4). Don’t worry, if you solve it analytically it’s the same. Now let’s implement the calculate_cost_gradient() function using equation (5):

# I haven't tested it but this same function should work for
# vanilla and mini-batch gradient descent as well
def calculate_cost_gradient(W, X_batch, Y_batch):
 # if only one example is passed (eg. in case of SGD)
 if type(Y_batch) == np.float64:
 Y_batch = np.array([Y_batch])
 X_batch = np.array([X_batch]) distance = 1 - (Y_batch * np.dot(X_batch, W))
 dw = np.zeros(len(W)) for ind, d in enumerate(distance):
 if max(0, d) == 0:
 di = W
 else:
 di = W - (reg_strength * Y_batch[ind] * X_batch[ind])
 dw += di dw = dw/len(Y_batch) # average
 return dw

Remember I said above, “To achieve our objective we try to minimize or maximize the cost function”. In the SVM algorithm, we minimize the cost function.

Why do we minimize the cost function? Because the cost function is essentially a measure of how bad our model is doing at achieving the objective. If you look closely at J(w), to find it’s minimum, we have to:

Minimize ∣∣w∣∣² which maximizes margin (2/∣∣w∣∣)
Minimize the sum of hinge loss which minimizes misclassifications.

Because both of our SVM objectives are achieved by minimizing the cost function, that’s why we minimize it.

How do we minimize it? Well, there are multiple ways but the one we will use is called Stochastic Gradient Descent or SGD. Before diving into SGD, I will briefly explain how Gradient Descent works in the first place.

Gradient descent algorithm works as follows:

Find the gradient of cost function i.e. ∇J(w’)
Move opposite to the gradient by a certain rate i.e. w’ = w’ — ∝(∇J(w’))
Repeat step 1–3 until convergence i.e we found w’ where J(w) is smallest

Why does it move opposite to the direction of the gradient? Because gradient is the direction of the fastest increase of the function. We need to move opposite to that direction to minimize our function J(w). Hence, the word “descent” in Gradient Descent is used.

In typical gradient descent (a.k.a vanilla gradient descent) the step 1 above is calculated using all the examples (1…N). In SGD, however, only 1 example is used at a time. I won’t discuss the benefits of SGD here but you can find some useful links at the end of this blog. Here’s how to implement the SGD in code:

def sgd(features, outputs):
 max_epochs = 5000
 weights = np.zeros(features.shape[1])
 # stochastic gradient descent
 for epoch in range(1, max_epochs): 
 # shuffle to prevent repeating update cycles
 X, Y = shuffle(features, outputs)
 for ind, x in enumerate(X):
 ascent = calculate_cost_gradient(weights, x, Y[ind])
 weights = weights - (learning_rate * ascent) return weights

Let’s call it inside init() function by adding this code:

# inside init()# train the model
print("training started...")
W = sgd(X_train.to_numpy(), y_train.to_numpy())
print("training finished.")
print("weights are: {}".format(W))

In the above implementation of sgd() we are running the loop 5000 times (could have been any number). Each iteration costs us time and extra computations. We don’t need to complete all the iterations. We can terminate the loop when our stoppage criterion is met.

What should be the stoppage criterion? There are multiple options, but we will use the simplest one. We will stop the training when the current cost hasn’t decreased much as compared to the previous cost. Here is how we will define sgd() with stoppage criterion:

def sgd(features, outputs):
 max_epochs = 5000
 weights = np.zeros(features.shape[1])
 nth = 0
 prev_cost = float("inf")
 cost_threshold = 0.01 # in percent
 # stochastic gradient descent
 for epoch in range(1, max_epochs):
 # shuffle to prevent repeating update cycles
 X, Y = shuffle(features, outputs)
 for ind, x in enumerate(X):
 ascent = calculate_cost_gradient(weights, x, Y[ind])
 weights = weights - (learning_rate * ascent) # convergence check on 2^nth epoch
 if epoch == 2 ** nth or epoch == max_epochs - 1:
 cost = compute_cost(weights, features, outputs)
 print("Epoch is:{} and Cost is: {}".format(epoch, cost))
 # stoppage criterion
 if abs(prev_cost - cost) < cost_threshold * prev_cost:
 return weights
 prev_cost = cost
 nth += 1
 return weights

After training the model using SGD we finally got the optimal weights w* which defines the best possible hyperplane separating two classes. Let’s test our model using this hyperplane. Add this code in init() function:

# inside init()# testing the model on test set
y_test_predicted = np.array([])
for i in range(X_test.shape[0]):
 yp = np.sign(np.dot(W, X_test.to_numpy()[i])) #model
 y_test_predicted = np.append(y_test_predicted, yp)print("accuracy on test dataset: {}".format(accuracy_score(y_test.to_numpy(), y_test_predicted)))
print("recall on test dataset: {}".format(recall_score(y_test.to_numpy(), y_test_predicted)))
print("precision on test dataset: {}".format(recall_score(y_test.to_numpy(), y_test_predicted)))

Now let’s call init() function:

# set hyper-parameters and call init
# hyper-parameters are normally tuned using cross-validation
# but following work good enough
reg_strength = 10000 # regularization strength
learning_rate = 0.000001
init()

Here’s the output:

# OUTPUT
reading dataset...
applying feature engineering...
splitting dataset into train and test sets...
training started...
Epoch is: 1 and Cost is: 5333.266133501857
Epoch is: 2 and Cost is: 3421.9128432834573
Epoch is: 4 and Cost is: 2437.2790231100216
Epoch is: 8 and Cost is: 1880.2998267933792
Epoch is: 16 and Cost is: 1519.5578612139725
Epoch is: 32 and Cost is: 1234.642324549297
Epoch is: 64 and Cost is: 977.3285621274708
Epoch is: 128 and Cost is: 804.8893546235923
Epoch is: 256 and Cost is: 703.407799431284
Epoch is: 512 and Cost is: 645.8275191300031
Epoch is: 1024 and Cost is: 631.6024252740094
Epoch is: 2048 and Cost is: 615.8378582171482
Epoch is: 4096 and Cost is: 605.0990964730645
Epoch is: 4999 and Cost is: 606.8186618758745
training finished.
weights are: [ 1.32516553 0.83500639 1.12489803 2.16072054 -1.24845441 -3.24246498
 3.27876342 6.83028706 -0.46562238 0.10063844 5.68881254 -1.93421932
 3.27394523 3.77331751 1.67333278 -2.43170464 -1.7683188 0.84065607
 -1.9612766 -1.84996828 2.69876618 5.32541102 1.0380137 3.0787769
 2.2140083 -0.61998182 2.66514199 0.02348447 4.64797917 2.17249278
 -9.27401088]
testing the model...
accuracy on test dataset: 0.9736842105263158
recall on test dataset: 0.9534883720930233
precision on test dataset: 0.9534883720930233

Note the accuracy, recall, and precision score of the model and epochs it took to complete. Now let’s try to improve it using feature selection.

Feature selection encompasses statistical techniques that help in filtering irrelevant or redundant features. Correlation & p-values are among these statistical techniques. Using them we will select a subset of relevant and important features from our original set of features

What is correlation? Correlation is a degree of linear dependence (or linear relationship) between two variables. Two features are said to be correlated if the values of one feature can be explained by some linear relationship of the second feature. The degree of this relationship is given by the correlation coefficient (or “r”). It ranges from -1.0 to +1.0. The closer r is to +1 or -1, the more closely the two variables are related.

Why do we remove one of the correlated features? There are multiple reasons but the simplest of them is that correlated features almost have the same effect on the dependent variable. Moreover, correlated features won’t improve our model and would most probably worsen it, therefore we are better off using only one of them. After all, fewer features result in improved learning speed and a simpler model (model with fewer features).

What are p-values? It’s too broad to cover in this blog. But, in the context of feature selection, p-values help us find the features which are most significant in explaining variation in the dependent variable (y).

A feature with a low p-value has more significance and a feature with a high p-value has less significance in explaining the variation. Usually, we set a Significance Level SL (threshold) and if a feature has a p-value above this level it is discarded. I will leave some links at the end of this blog for an in-depth study on p-values.

Why do we remove features with high p-values? Because they don’t tell much about the behavior of the dependent variable. So, why keep them and unnecessarily increase the complexity of our model when they are not helping us in predicting the result.

We have two functions named remove_correlated_features() remove_less_significant_features() for removing highly correlated features and less significant features (using p-values and backward elimination) respectively:

# >> FEATURE SELECTION << #
def remove_correlated_features(X):
 corr_threshold = 0.9
 corr = X.corr()
 drop_columns = np.full(corr.shape[0], False, dtype=bool)
 for i in range(corr.shape[0]):
 for j in range(i + 1, corr.shape[0]):
 if corr.iloc[i, j] >= corr_threshold:
 drop_columns[j] = True
 columns_dropped = X.columns[drop_columns]
 X.drop(columns_dropped, axis=1, inplace=True)
 return columns_droppeddef remove_less_significant_features(X, Y):
 sl = 0.05
 regression_ols = None
 columns_dropped = np.array([])
 for itr in range(0, len(X.columns)):
 regression_ols = sm.OLS(Y, X).fit()
 max_col = regression_ols.pvalues.idxmax()
 max_val = regression_ols.pvalues.max()
 if max_val > sl:
 X.drop(max_col, axis='columns', inplace=True)
 columns_dropped = np.append(columns_dropped, [max_col])
 else:
 break
 regression_ols.summary()
 return columns_dropped

Let’s call these functions inside init() before we apply normalization:

# inside init()# filter features
remove_correlated_features(X)
remove_less_significant_features(X, Y)

Rerun the code and check the output:

# OUTPUT WITH FEATURE SELECTION
reading dataset...
applying feature engineering...
splitting dataset into train and test sets...
training started...
Epoch is: 1 and Cost is: 7198.889722245353
Epoch is: 2 and Cost is: 6546.424590270085
Epoch is: 4 and Cost is: 5448.724593530262
Epoch is: 8 and Cost is: 3839.8660601754004
Epoch is: 16 and Cost is: 2643.2493061396613
Epoch is: 32 and Cost is: 2003.9830891013514
Epoch is: 64 and Cost is: 1595.2499320295813
Epoch is: 128 and Cost is: 1325.7502330505054
Epoch is: 256 and Cost is: 1159.7928936478063
Epoch is: 512 and Cost is: 1077.5846940303365
Epoch is: 1024 and Cost is: 1047.208390340501
Epoch is: 2048 and Cost is: 1040.2241600540974
training finished.
weights are: [ 3.53520254 11.03169318 -2.31444264 -7.89186867 10.14516174 -1.28905488
 -6.4397589 2.26113987 -3.87318997 3.23075732 4.94607957 4.81819288
 -4.72111236]
testing the model...
accuracy on test dataset: 0.9912280701754386
recall on test dataset: 0.9767441860465116
precision on test dataset: 0.9767441860465116

As you can see the accuracy (99%), precision (0.98), and recall (0.98) scores have improved. Moreover, SGD has converged faster; The training ended within 2048 epochs which is way less as compared to the previous one (5000 epochs)

You can get the complete code in this Github repository:

qandeelabbassi/python-svm-sgd

Python implementation of stochastic sub-gradient descent algorithm for SVM from scratch - qandeelabbassi/python-svm-sgd

github.com

FAQs

How accurate is SVM in Python? ›

The SVM classifier we defined above gives a 98% accuracy on the digits dataset. The confusion matrix analysis shows that the model is performing really well.

Tell Me More ›

Why is SVM hard? ›

For a hard margin SVM, the decision boundary is solely defined by the points that allow for the widest margin whilst maintaining strict separation. If any one of those points draws inwards towards the decision plane, the boundary will shrink accordingly.

Explore More ›

Why SVM is not good for large datasets? ›

Abstract. Support vector machine (SVM) is a powerful technique for data classification. Despite of its good theoretic foundations and high classification accuracy, normal SVM is not suitable for classification of large data sets, because the training complexity of SVM is highly dependent on the size of data set.

Learn More ›

Why SVM is better than XGBoost? ›

SVM and XGBoost are different types of algorithms with distinct strengths and weaknesses. SVM is powerful for finding optimal decision boundaries, especially in high-dimensional spaces, while XGBoost excels at capturing complex patterns in the data through the combination of weak learners.

Keep Reading ›

Is SVM prone to Overfitting? ›

SVMs perform better with high-dimensional data and are less prone to overfitting compared to decision trees.

Tell Me More ›

Why is SVM not popular nowadays? ›

1) SVMs are not suitable for large datasets

The original SVM implementation is known to have a concrete theoretical foundation, but it is not suitable for classifying in large datasets for one straightforward reason — the complexity of the algorithm's training is highly dependent on the size of the dataset.

Is SVM a lazy learner? ›

Eager Learning - Is SVM an eager or lazy learner? An eager learning method is Support Vector Machines (SVM). It is a powerful method that may be applied to regression and classification problems. In order to generate predictions on fresh data, SVM builds a hyperplane to differentiate between classes during training.

Discover More Details ›

Is SVM a weak learner? ›

It would be somewhat strange to call a SVM a 'weak learner', even in situations where it performs poorly, but it would be perfectly reasonable to call a single decision stump a weak learner even when it performs surprisingly well by itself.

Get More Info Here ›

Which algorithm is better than SVM? ›

Convolutional Neural Networks (CNNs) are typically better than Support Vector Machines (SVMs) for image classification because they are able to learn more complex features from images. CNNs are specifically designed to extract features from images, while SVMs are more general-purpose classifiers.

Get More Info ›

Why LSTM is better than SVM? ›

Overall, LSTM performs better than SVM in all the scenarios. This is because of its ability to remember or forget the data in an efficient manner than SVM. With moving averages, the SVM and LSTM models both perform significantly better on the combined dataset over the standard base dataset.

How much data is needed for SVM? ›

There is no specific guideline. We are trying to make the machine or algorithms compute like us. The more data points we feed to them, the results are expected to be better. While using SVM, the major issue tuning the kernel and kernel parameters.

Show Me More ›

How to get accuracy of SVM? ›

SVM Report Classification The accuracy of the classification with SVM can be calculated based on equation (1) by adding up all the data that are True Positive (TP) and True Negative (TN), then dividing the result of the summation by the total number of data.

Discover More ›

Why is SVM better than other classifiers? ›

Advantages. SVM Classifiers offer good accuracy and perform faster prediction compared to Naïve Bayes algorithm. They also use less memory because they use a subset of training points in the decision phase. SVM works well with a clear margin of separation and with high dimensional space.

Keep Reading ›

Can we use SVM for multiclass classification? ›

In its most basic type, SVM doesn't support multiclass classification. For multiclass classification, the same principle is utilized after breaking down the multi-classification problem into smaller subproblems, all of which are binary classification problems.

View Details ›

What are the disadvantages of SVM algorithm? ›

SVM Disadvantages

Long training time for large datasets. Difficult to understand and interpret the final model, variable weights and individual impact. Since the final model is not so easy to see, we can not do small calibrations to the model hence its tough to incorporate our business logic.