In this post we'll see how to use Support Vector Regression (SVR) which extends Support Vector Machine (SVM) to regression tasks. Since SVR is a regression model that means it is used for predicting continuous values where as SVM is a classification model which predicts a category or class label.
SVR is particularly useful when relationship between features and target may be non-linear or complex, when traditional linear regression struggles.
How does SVR work
Support vector regression works on the concept of hyperplane and support vectors. It also has a margin of error, called epsilon (ε). Let's try to understand these concepts.
1. Hyperplane- SVR model works by finding a function (hyperplane) that fits most data points within a defined margin of error (ε-tube).
In the case of linear relationship (Linear SVR) this hyperplane can be thought of as a straight line in 2D space.
In case of non-linear relationship this hyperplane can be thought of as existing in a higher dimensional space but in 2D space it manifests as a best fitting curved function.
2. Kernel trick- This ability of SVR to work in higher dimensional spaces is achieved by "kernel trick". A non-linear relationship is very difficult (actually impossible) to represent with a straight line. By mapping the data into a higher-dimensional feature space, the relationship may become linear in that space. That transformation is not done explicitly as it can be very expensive.
Using kernel trick this computation becomes easier, this lets SVR work in high-dimensional spaces implicitly, without ever explicitly constructing the transformed features.
Common kernel functions used in SVR are linear, polynomial, radial basis function (RBF), and sigmoid.
3. Margin of error- SVR has some tolerance for error. It gives a margin of error (ε) and data points that reside with in that margin are considered to have no error. You can think of this margin as a tube that goes on both sides of the fitted line (or curve). That tube is known as ε-insensitive tube because it makes the model insensitive to minor fluctuations.
SVR tries to fit a function such that most points lie within an ε-tube around the regression line.
4. Support vectors- These are the points that fall outside the ε-tube or lie on the edge of it. These support vectors determine the position of the hyperplane. These are the only points that influence the regression function because points falling with in the ε-tube are considered to have no error.
5. Slack Variables- Numerical values assigned to the variables representing the distance of the points outside the tube from the ε-tube. These slack variables are represented using the symbol \(\xi\).
Following images try to clarify all the above nomenclatures.
SVR equation
In SVR there is an objective function and the goal is to minimize that function-
$$\frac{1}{2}\| w\| ^2+C\sum _{i=1}^n(\xi _i+\xi _i^*) $$
1. Here w is the weight vector which is computed from the support vectors:
$$w=\sum _{i=1}^n(\alpha _i -\alpha _i^*)x_i$$
where
- xi are the support vectors (training points that lie outside the ε-tube)
- \(\alpha _i, \alpha _i^*\) are Lagrange multipliers from the optimization problem
Note that Lagrange multipliers are used here in finding the maximum or minimum of a function when certain conditions (constraints) must be satisfied.
Constraints here are; for each data point (xi,yi):
$$y_i-(w\cdot x_i+b)\leq \varepsilon +\xi _i$$ $$(w\cdot x_i+b)-y_i\leq \varepsilon +\xi _i^*$$ $$\xi _i,\xi _i^*\geq 0$$
Note that in linear SVR, w is the weight vector defining the regression hyperplane.
In non-linear SVR, the kernel trick replaces direct inner products (xi, xj) with a kernel function
K(xi, xj) and w is not computed explicitly.
minimizing \(\frac{1}{2}\| w\| ^2\) (norm of the weight vector) ensures the function is as flat as possible. Benefit of doing that is improved generalization of the model to new data and increased robustness against outliers.
2. \((\xi _i, \xi _i^*)\) are slack variables that measure deviations outside the ε-insensitive tube.
3. C is the regularization parameter controlling trade-off between flatness and total deviations outside the ε-insensitive tube. A large C value fits data closely which may risk overfitting, a smaller C value means simpler model which may mean risk of underfitting.
SVR tries to minimize the objective function during training. It finds the best \(w, b, \alpha _i, \alpha _i^*\) that minimize this objective function.
Once the optimized values are calculated, the regression function is defined as:
$$f(x)=\sum _{i=1}^n(\alpha _i-\alpha _i^*)K(x_i,x)+b$$
This is the actual function you use to make predictions.
Where:
- xi = training data points
- K(xi, x) = kernel function (linear, polynomial, RBF, etc.)
- \(\alpha _i,\alpha _i^*\) = Lagrange multipliers from optimization
- b = bias term
If you use a linear kernel, the equation simplifies to:
$$f(x)=w\cdot x+b$$
In case kernel is RBF (Radial Basis Function), the SVR prediction equation takes the form:
$$f(x)=\sum _{i=1}^n(\alpha _i-\alpha _i^*)\, \exp \left( -\gamma \| x-x_i\| ^2\right) +b$$
When SVR with Polynomial Kernel is used then the regression function becomes:
$$f(x)=\sum _{i=1}^n(\alpha _i-\alpha _i^*)\, (\gamma \cdot (x_i\cdot x)+r)^d+b$$
Support Vector Regression using scikit-learn Python library
Dataset used here can be downloaded from- https://www.kaggle.com/datasets/mariospirito/position-salariescsv
Goal is to predict the salary based on the position level.
In the implementation code is broken into several smaller units with some explanation in between for data pre-processing steps.
1. Importing libraries and reading CSV file
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('./Position_Salaries.csv')
Position_Salaries.csv file is in the current directory.
2. Getting info about the data.
print(df.info())
Output
RangeIndex: 10 entries, 0 to 9 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Position 10 non-null object 1 Level 10 non-null int64 2 Salary 10 non-null int64
As you can see count of records is 10 only. Since dataset is already small so splitting is not done. Also, it is evident from one look at the small number of records that there are no duplicates and null values.
3. Feature and label selection
X = df.iloc[:, 1:-1] y = df.iloc[:, -1]
Explanation- X = df.iloc[:, 1:-1]
- : means "select all rows."
- 1:-1 means "from column index 1 up to (but not including) the last column"
y = df.iloc[:, -1]
- : means select all rows.
- -1 means select the last column, uses negative indexing
"Position" column has been dropped as "level" column is also signifying the same thing in numerical values.
4.print(X) print(y)
On printing these two variables you can see that X is a 2D array where as y is a 1D array. Printing here just to make you understand that y may need conversion to 2D array as some of the functions need 2D array as parameter.
5. Scaling data
SVR relies on kernel functions (like RBF, polynomial) that compute distances between points. If features are on different scales, one feature can dominate the distance metric, skewing the model. So, standardizing the features is required.
With SVR target scaling is also required because the ε-insensitive tube and penalty parameter C are defined relative to the scale of y. Scaling y ensures that ε and C operate in a normalized space
from sklearn.preprocessing import StandardScaler from sklearn.svm import SVR from sklearn.metrics import r2_score, mean_squared_error scaler_X = StandardScaler() scaler_y = StandardScaler() X_scaled = scaler_X.fit_transform(X) y_scaled = scaler_y.fit_transform(y.reshape(-1, 1)).ravel()
Dependent variable y needs to be changed to a 2D array for that reshape is used. Note that with reshape() one of the dimensions can be -1. In that case, the value is inferred from the length of the array. Here row is passed as -1 so NumPy will infer how many rows are needed based on array size.
ravel() is used to flatten the 2D array again. Otherwise, fit() method will give problem as that expects dependent variable to be a 1D array.
6. Fitting the model
reg = SVR(kernel='rbf') reg.fit(X_scaled, y_scaled)
An object of class SVR is created, kernel is passed as 'rbf' which is also a default. For C and epsilon default values are used which are 1 and 0.1 respectively. Note that y_scaled was already changed to a 1D array using ravel(), if not done you'll get the following error.
A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
7. Predicting values
First a single value is predicted.
#prediciting salary by passing level
y_pred_scaled = reg.predict(scaler_X.transform([[6]]))
# prediction will also be scaled, so need to do inverse transformation
y_pred = scaler_y.inverse_transform(y_pred_scaled.reshape(-1,1))
print("Prediction in original scale:", y_pred.ravel()) #145503.10688572
Note that X value has to be scaled as model is trained with scaled values. Also, predicted value has be inversely transformed to bring it back to original scale.
Predicting for the whole data. Since splitting is not done and there is no train and test data so X_scaled is used to predict values.
y_pred = scaler_y.inverse_transform(reg.predict(X_scaled).reshape(-1,1)).ravel()
8. Comparing test and predicted values
A dataframe is created and printed to display original and predicted values side-by-side.
df_results = pd.DataFrame({'Target':y, 'Predictions':y_pred})
print(df_results)
Output
Target Predictions 0 45000 73416.856829 1 50000 78362.982831 2 60000 88372.122821 3 80000 108481.435811 4 110000 138403.075511 5 150000 178332.360333 6 200000 225797.711581 7 300000 271569.924155 8 500000 471665.638386 9 1000000 495411.293695
As you can see, model has lots of room for improvement, lack of proper data is one of the main reason here. For 1 million, prediction is way off the mark.
9. Seeing the model metrics such as R squared, mean squared error and root mean squared error.
#Metrics - R-Squared, MSE, RMSE
print("R2 score", r2_score(y, y_pred))
mse = mean_squared_error(y, y_pred)
print("Mean Squared Error", mse)
print("Root Mean Squared Error", np.sqrt(mse))
Output
R2 score 0.7516001070620797 Mean Squared Error 20036494264.13176 Root Mean Squared Error 141550.3241399742
10. Visualize the result
plt.scatter(X, y, color='red')
plt.plot(X, scaler_y.inverse_transform(reg.predict(X_scaled).reshape(-1,1)), color='blue')
plt.xlabel('Level')
plt.ylabel('Salary')
plt.title("SVR")
That's all for this topic Support Vector Regression With Example. If you have any doubt or any suggestions to make please drop a comment. Thanks!
>>>Return to Python Tutorial Page
Related Topics
You may also like-



No comments:
Post a Comment