In this post we'll see how to use polynomial regression. With simple linear regression or multiple linear regression, a straight-line (linear) relationship between predictors and target is assumed but that may not be the case always with the real-world data. Also, if the scatterplot of the residuals (y_test - y_pred) versus the predicted values (y_pred) shows a curvature or patterns, it suggests that the relationship between predictors and the response is non-linear.
In such cases, a simple linear regression is inadequate, and a more flexible model like polynomial regression can often improve the fit.
Polynomial Regression
Polynomial regression is a kind of linear regression that allows you to model non-linear relationship between the independent variables (X) and the dependent variable (y) by using the polynomial terms of the independent variable(s).
Polynomial regression model for a single predictor, X, is:
$$ y=\beta _0+\beta _1x+\beta _2x^2+\beta _3x^3+\dots +\beta _nx^n+\epsilon$$
where n is called the degree of the polynomial, so above equation is a n-th degree polynomial. Such a relationship is called quadratic if degree is 2, cubic if degree is 3 and so on. Here
- y is the dependent variable.
- x is the independent variable.
- \( \beta _0, \beta _1, \dots , \beta _n \) are the coefficients of the polynomial terms.
- \(\epsilon\) is the error term.
If there are multiple predictors (like x1,x2), polynomial regression also includes-
- Powers of each feature (x12,x22 )
- interaction terms (for example, x1.x2)
Suppose the predictors are x1,x2,x3. A polynomial regression of degree 2 (quadratic) can be written as:
$$ y=\beta _0+\beta _1x_1+\beta _2x_2+\beta _3x_3+\beta _{11}x_1^2+\beta _{22}x_2^2+\beta _{33}x_3^2+ \\ \beta _{12}x_1x_2+\beta _{13}x_1x_3+\beta _{23}x_2x_3+\epsilon$$
- \(\beta _0\): intercept
- \(\beta _i\): linear coefficients
- \(\beta _{ii}\): quadratic terms (squares of predictors)
- \(\beta _{ij}\): interaction terms (cross-products between predictors)
- \(\epsilon\) : error term
Generalized form of polynomial regression is as given below-
For a polynomial of degree d with three predictors:
$$y=\sum _{i+j+k\leq d}\beta _{ijk}\, x_1^i\, x_2^j\, x_3^k+\epsilon $$
One thing to keep in mind about polynomial regression is that, though the features are non-linear transformations of inputs, polynomial regression is still considered linear regression since it is linear in the regression coefficients \(\beta _1, \beta _2, \beta _3 … \beta _n\).
Polynomial linear regression using scikit-learn Python library
Dataset used here can be downloaded from- https://www.kaggle.com/datasets/rukenmissonnier/manufacturing-data-for-polynomial-regression/dataGoal is to predict the quality rating based on the given features.
In the implementation code is broken into several smaller units with some explanation in between for data pre-processing steps.
1. Importing libraries and reading CSV file
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
df = pd.read_csv('./manufacturing.csv')
manufacturing.csv file is in the current directory.
2. Getting info about the data.
print(df.info())
Output
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3957 entries, 0 to 3956 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Temperature (°C) 3957 non-null float64 1 Pressure (kPa) 3957 non-null float64 2 Temperature x Pressure 3957 non-null float64 3 Material Fusion Metric 3957 non-null float64 4 Material Transformation Metric 3957 non-null float64 5 Quality Rating 3957 non-null float64
You can also use the following command to get summary statisctics like mean, standard deviation, min and max values for each columns.
print(df.describe())
3. Removing columns
You can check for duplicate rows in order to remove them if required.
#checking for duplicates print(df.duplicated().sum()) #0
4. Another check is for missing values
#count the number of missing (null, or NaN) values in each column of a DataFrame print(df.isnull().sum())
Output
Temperature (°C) 0 Pressure (kPa) 0 Temperature x Pressure 0 Material Fusion Metric 0 Material Transformation Metric 0 Quality Rating 0
So, there are no missing values.
5. Checking for multicollinearity
You can also check for multicollinearity by displaying a correlation heatmap which displays the relationships between variables.
- Values close to 1 or -1 indicate strong correlations
- Values close to 0 indicate weak or no correlations
# check for multicollinearity
correlation_matrix = df.corr()
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()
If you want to remove any columns because of high multicollinearity following code can be used, in this example code no column has been removed.
# select columns with numerical values
v = X.select_dtypes(include ='number')
corr_matrix = v.corr().abs() # absolute correlations
#corr_matrix
#print(corr_matrix)
upper = corr_matrix.where(
#upper triangular part of an array
np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
)
# get the columns having any corr value > .85
to_drop = [column for column in upper.columns if any(upper[column] > 0.85)]
print(to_drop)
X_reduced = X.drop(columns=to_drop)
6. Feature and label selection
X = df.iloc[:, :-1] y = df.iloc[:, -1]
Explanation-
X = df.iloc[:, :-1] in this
- : means "select all rows."
- :-1 means "select all columns except the last one."
y = df.iloc[:, -1]
- : means select all rows.
- -1 means select the last column, uses negative indexing
7. Plotting predictor-target relationship using scatter plot to show that it is not linear
#plot predictor-target relationship using scatter plot
features = X.columns
fig, axes = plt.subplots(1, len(features), sharey=True, figsize=(15, 4))
for i, col in enumerate(features):
#plt.scatter(df[col], df["Quality Rating"])
sns.scatterplot(x=df[col], y=df["Quality Rating"], ax=axes[i])
axes[i].set_xlabel(col)
axes[i].set_title(f"{col} \nvs Quality Rating")
plt.show()
8. Splitting and scaling data
Splitting is done using train_test_split where test_size is passed as 0.2, meaning 20% of the data is used as test data whereas 80% of the data is used to train the model.
As seen in polynomial regression equation, it creates higher-degree terms (squared, cubic, etc.) from your variables. These variables will increase exponentially in value, which can skew the results. That is why normalizing your features is important, otherwise features with larger numeric ranges can dominate the model.
Note that both fitting and transformation (using fit_transform) is done for training data, whereas only transform() method is used for test data. That's how it should be done.
from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler #Polynomial Regression from sklearn.preprocessing import PolynomialFeatures from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score, mean_squared_error X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state = 0) #scaling values scaler_X = StandardScaler() X_train_scaled = scaler_X.fit_transform(X_train) X_test_scaled = scaler_X.transform(X_test)
9. Polynomial Feature
Next thing is to get the degree of the polynomial. With the help of PolynomialFeatures class in scikit-learn library it becomes very easy to transform your existing features into higher-degree terms.
poly_reg = PolynomialFeatures(degree=2, include_bias=False) x_poly = poly_reg.fit_transform(X_train_scaled)
The parameter include_bias controls whether a bias (intercept) column of ones is added to the transformed feature matrix. When you use PolynomialFeatures together with LinearRegression by default, LinearRegression(fit_intercept=True) already adds an intercept term to the model. So, if you also set include_bias=True in PolynomialFeatures, you'll end up with a redundant constant column of ones in your design matrix.
10. Fitting the model
lin_reg = LinearRegression() lin_reg.fit(x_poly, y_train)
You may think why LinearRegression is used here. Keep in mind that it is applied to the polynomial features (x_poly).
Once the model is trained, predictions can be made using test data which can then be compared with the actual test data (y_test)
# predicting values y_pred = lin_reg.predict(poly_reg.transform(X_test_scaled))
11. Comparing test and predicted data
# getting the residual percentage
df_results = pd.DataFrame({'Target':y_test, 'Predictions':y_pred})
df_results['Residual'] = df_results['Target'] - df_results['Predictions']
df_results['Difference%'] = np.abs((df_results['Residual'] * 100)/df_results['Target'])
print(df_results.head(10))
Output
Target Predictions Residual Difference%
3256 100.00 102.00 -2.00 2.00
142 100.00 99.54 0.46 0.46
2623 99.58 103.70 -4.12 4.14
3741 100.00 100.79 -0.79 0.79
2858 99.58 103.68 -4.10 4.11
3137 95.87 93.66 2.22 2.31
2672 100.00 99.01 0.99 0.99
1420 100.00 99.08 0.92 0.92
1669 100.00 98.94 1.06 1.06
1606 100.00 99.25 0.75 0.75
12. Seeing the model metrics such as R squared, mean squared error and root mean squared error.
#Metrics - R-Squared, MSE, RMSE
print("R2 score", r2_score(y_test, y_pred))
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error", mse)
print("Root Mean Squared Error", np.sqrt(mse))
13. Plotting residuals Vs predicted values
# Residuals = actual - predicted
residuals = y_test - y_pred
# Scatterplot: residuals vs fitted
sns.scatterplot(x=y_pred, y=residuals)
plt.axhline(0, color='red', linestyle='--') # reference line at 0
plt.xlabel("Predicted Values (y_pred)")
plt.ylabel("Residuals (y_test - y_pred)")
plt.title("Residuals vs Fitted")
plt.show()
In polynomial regression, the plot of residuals vs. predicted values should look like a random, evenly scattered points around the horizontal zero line (y=0). As you can see, above plot doesn't show a very evenly scattered points and residuals are forming a curved pattern too.
If degree is increased to 5 in polynomial features
poly_reg = PolynomialFeatures(degree=5, include_bias=False)
and modelling is done then the plot of residuals vs. fitted values looks as given below.
The curve is less pronounced than in the quadratic case. However, there's still some systematic pattern, points are not fully randomly scattered.
That's all for this topic Polynomial Regression With Example. If you have any doubt or any suggestions to make please drop a comment. Thanks!
>>>Return to Python Tutorial Page
Related Topics
You may also like-








