Polynomial Regression With Example

In this post we'll see how to use polynomial regression. With simple linear regression or multiple linear regression, a straight-line (linear) relationship between predictors and target is assumed but that may not be the case always with the real-world data. Also, if the scatterplot of the residuals (y_test - y_pred) versus the predicted values (y_pred) shows a curvature or patterns, it suggests that the relationship between predictors and the response is non-linear.

In such cases, a simple linear regression is inadequate, and a more flexible model like polynomial regression can often improve the fit.

Polynomial Regression

Polynomial regression is a kind of linear regression that allows you to model non-linear relationship between the independent variables (X) and the dependent variable (y) by using the polynomial terms of the independent variable(s).

Polynomial regression model for a single predictor, X, is:

$$ y=\beta _0+\beta _1x+\beta _2x^2+\beta _3x^3+\dots +\beta _nx^n+\epsilon$$

where n is called the degree of the polynomial, so above equation is a n-th degree polynomial. Such a relationship is called quadratic if degree is 2, cubic if degree is 3 and so on. Here

y is the dependent variable.
x is the independent variable.
$ \beta _0, \beta _1, \dots , \beta _n $ are the coefficients of the polynomial terms.
$\epsilon$ is the error term.

If there are multiple predictors (like x₁,x₂), polynomial regression also includes-

Powers of each feature (x₁²,x₂² )
interaction terms (for example, x₁.x₂)

Suppose the predictors are x₁,x₂,x₃. A polynomial regression of degree 2 (quadratic) can be written as:

$$ y=\beta _0+\beta _1x_1+\beta _2x_2+\beta _3x_3+\beta _{11}x_1^2+\beta _{22}x_2^2+\beta _{33}x_3^2+ \\ \beta _{12}x_1x_2+\beta _{13}x_1x_3+\beta _{23}x_2x_3+\epsilon$$

$\beta _0$: intercept
$\beta _i$: linear coefficients
$\beta _{ii}$: quadratic terms (squares of predictors)
$\beta _{ij}$: interaction terms (cross-products between predictors)
$\epsilon$ : error term

Generalized form of polynomial regression is as given below-

For a polynomial of degree d with three predictors:

$$y=\sum _{i+j+k\leq d}\beta _{ijk}\, x_1^i\, x_2^j\, x_3^k+\epsilon $$

One thing to keep in mind about polynomial regression is that, though the features are non-linear transformations of inputs, polynomial regression is still considered linear regression since it is linear in the regression coefficients $\beta _1, \beta _2, \beta _3 … \beta _n$.

Polynomial linear regression using scikit-learn Python library

Dataset used here can be downloaded from- https://www.kaggle.com/datasets/rukenmissonnier/manufacturing-data-for-polynomial-regression/data

Goal is to predict the quality rating based on the given features.

In the implementation code is broken into several smaller units with some explanation in between for data pre-processing steps.

1. Importing libraries and reading CSV file

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
df = pd.read_csv('./manufacturing.csv')

manufacturing.csv file is in the current directory.

2. Getting info about the data.

print(df.info())

Output

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3957 entries, 0 to 3956
Data columns (total 6 columns):
 #   Column                          		Non-Null Count  Dtype  
---  ------                         	 	--------------  -----  
 0   Temperature (°C)               		3957 non-null   float64
 1   Pressure (kPa)                  		3957 non-null   float64
 2   Temperature x Pressure   				3957 non-null   float64
 3   Material Fusion Metric     			3957 non-null   float64
 4   Material Transformation Metric  		3957 non-null   float64
 5   Quality Rating                  		3957 non-null   float64

You can also use the following command to get summary statisctics like mean, standard deviation, min and max values for each columns.

print(df.describe())

3. Removing columns

You can check for duplicate rows in order to remove them if required.

#checking for duplicates
print(df.duplicated().sum()) #0

4. Another check is for missing values

#count the number of missing (null, or NaN) values in each column of a DataFrame
print(df.isnull().sum())

Output

Temperature (°C)                  		0
Pressure (kPa)                    		0
Temperature x Pressure            		0
Material Fusion Metric           	 	0
Material Transformation Metric    		0
Quality Rating                    		0

So, there are no missing values.

5. Checking for multicollinearity

You can also check for multicollinearity by displaying a correlation heatmap which displays the relationships between variables.

Values close to 1 or -1 indicate strong correlations
Values close to 0 indicate weak or no correlations

# check for multicollinearity
correlation_matrix = df.corr()
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()

If you want to remove any columns because of high multicollinearity following code can be used, in this example code no column has been removed.

# select columns with numerical values
v = X.select_dtypes(include ='number')
corr_matrix = v.corr().abs()   # absolute correlations
#corr_matrix
#print(corr_matrix)
upper = corr_matrix.where(
    #upper triangular part of an array
    np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
)
# get the columns having any corr value > .85
to_drop = [column for column in upper.columns if any(upper[column] > 0.85)]
print(to_drop)
X_reduced = X.drop(columns=to_drop)

6. Feature and label selection

X = df.iloc[:, :-1]
y = df.iloc[:, -1]

Explanation-

X = df.iloc[:, :-1] in this

: means "select all rows."
:-1 means "select all columns except the last one."

y = df.iloc[:, -1]

: means select all rows.
-1 means select the last column, uses negative indexing

7. Plotting predictor-target relationship using scatter plot to show that it is not linear

#plot predictor-target relationship using scatter plot
features = X.columns
fig, axes = plt.subplots(1, len(features), sharey=True, figsize=(15, 4))
for i, col in enumerate(features):
    #plt.scatter(df[col], df["Quality Rating"])
    sns.scatterplot(x=df[col], y=df["Quality Rating"], ax=axes[i])

    axes[i].set_xlabel(col)
    axes[i].set_title(f"{col} \nvs Quality Rating")
plt.show()

8. Splitting and scaling data

Splitting is done using train_test_split where test_size is passed as 0.2, meaning 20% of the data is used as test data whereas 80% of the data is used to train the model.

As seen in polynomial regression equation, it creates higher-degree terms (squared, cubic, etc.) from your variables. These variables will increase exponentially in value, which can skew the results. That is why normalizing your features is important, otherwise features with larger numeric ranges can dominate the model.

Note that both fitting and transformation (using fit_transform) is done for training data, whereas only transform() method is used for test data. That's how it should be done.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
#Polynomial Regression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state = 0)

#scaling values
scaler_X = StandardScaler()

X_train_scaled = scaler_X.fit_transform(X_train)

X_test_scaled = scaler_X.transform(X_test)

9. Polynomial Feature

Next thing is to get the degree of the polynomial. With the help of PolynomialFeatures class in scikit-learn library it becomes very easy to transform your existing features into higher-degree terms.

poly_reg = PolynomialFeatures(degree=2, include_bias=False)

x_poly = poly_reg.fit_transform(X_train_scaled)

The parameter include_bias controls whether a bias (intercept) column of ones is added to the transformed feature matrix. When you use PolynomialFeatures together with LinearRegression by default, LinearRegression(fit_intercept=True) already adds an intercept term to the model. So, if you also set include_bias=True in PolynomialFeatures, you'll end up with a redundant constant column of ones in your design matrix.

10. Fitting the model

lin_reg = LinearRegression()
lin_reg.fit(x_poly, y_train)

You may think why LinearRegression is used here. Keep in mind that it is applied to the polynomial features (x_poly).

Once the model is trained, predictions can be made using test data which can then be compared with the actual test data (y_test)

# predicting values
y_pred = lin_reg.predict(poly_reg.transform(X_test_scaled))

11. Comparing test and predicted data

# getting the residual percentage
df_results = pd.DataFrame({'Target':y_test, 'Predictions':y_pred})
df_results['Residual'] = df_results['Target'] - df_results['Predictions']
df_results['Difference%'] = np.abs((df_results['Residual'] * 100)/df_results['Target'])
print(df_results.head(10))

Output

      Target  Predictions  Residual  Difference%
3256  100.00       102.00     -2.00         2.00
142   100.00        99.54      0.46         0.46
2623   99.58       103.70     -4.12         4.14
3741  100.00       100.79     -0.79         0.79
2858   99.58       103.68     -4.10         4.11
3137   95.87        93.66      2.22         2.31
2672  100.00        99.01      0.99         0.99
1420  100.00        99.08      0.92         0.92
1669  100.00        98.94      1.06         1.06
1606  100.00        99.25      0.75         0.75

12. Seeing the model metrics such as R squared, mean squared error and root mean squared error.

#Metrics - R-Squared, MSE, RMSE
print("R2 score", r2_score(y_test, y_pred)) 
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error", mse)
print("Root Mean Squared Error", np.sqrt(mse))

13. Plotting residuals Vs predicted values

# Residuals = actual - predicted
residuals = y_test - y_pred

# Scatterplot: residuals vs fitted
sns.scatterplot(x=y_pred, y=residuals)
plt.axhline(0, color='red', linestyle='--')  # reference line at 0
plt.xlabel("Predicted Values (y_pred)")
plt.ylabel("Residuals (y_test - y_pred)")
plt.title("Residuals vs Fitted")
plt.show()

In polynomial regression, the plot of residuals vs. predicted values should look like a random, evenly scattered points around the horizontal zero line (y=0). As you can see, above plot doesn't show a very evenly scattered points and residuals are forming a curved pattern too.

If degree is increased to 5 in polynomial features

poly_reg = PolynomialFeatures(degree=5, include_bias=False)

and modelling is done then the plot of residuals vs. fitted values looks as given below.

The curve is less pronounced than in the quadratic case. However, there's still some systematic pattern, points are not fully randomly scattered.

That's all for this topic Polynomial Regression With Example. If you have any doubt or any suggestions to make please drop a comment. Thanks!

>>>Return to Python Tutorial Page

Related Topics

You may also like-

Tech Tutorials

Monday, February 23, 2026