Multiple Linear Regression With Example

In the post Simple Linear Regression With Example we saw how to create a Simple linear regression model using the scikit-learn library in Python. In this post we'll see how to create a multiple linear regression model using the scikit-learn library. We'll also go through the steps for data pre-processing and cleaning, feature transformation, encoding categorical data.

Multiple Linear Regression

In simple linear regression we model relationship between one independent variable (predictor) and one dependent variable. Multiple Linear Regression is a fundamental statistical technique used to model the relationship between one dependent variable and multiple independent variables. So, we'll create a model to analyse how multiple features affect the outcome.

Multiple Linear Regression equation

In context of machine learning where we have sample data and we use it to create regression model, multiple linear regression equation is as given below.

$$ \hat {y}=b_0 + b_1x_1 + b_2x_2 + b_3x_3 + ….. + b_nx_n $$

Here $\hat{y}$ is the predicted label - Output

b₀ is the intercept, which tells you where the regression line intercepts the Y-axis. Or you can say it is the value when all the predictors x₁, x₂, x₃, .. , x_n are zero.

b₁, b₂, b_n are the slopes. It tells how much dependent variable changes for one unit change in given independent variable when all the other independent variables are held constant. For example, b₁ represents the estimated change in $\hat {y}$, against per unit increase in x₁ when x₂, x₃, .. , x_n are held constant. To explain it in other words, if you want to interpret b₁, you imagine increasing x₁ by 1 unit while keeping x₂, x₃, .. , x_n unchanged. Then the predicted change in $\hat {y}$ is exactly b₁. Same logic applies for b₂, b₃ and so on.

The residual (difference between actual value and predicted value) term is calculated as $e_i = y_i - \hat{y}_i$.

In the model these slopes (b₁, b₂, …) are chosen to minimize the mean of this residual sum of squares (Mean Squared Error).

$$ L=\frac{1}{n}\sum _{i=1}^n(y_i-\hat {y}_i)^2 $$

The goal of the model is to find the best fit line which has the minimum Mean Squared Error.

Multiple linear regression using scikit-learn Python library

Dataset used here can be downloaded from- https://www.kaggle.com/datasets/hellbuoy/car-price-prediction

Goal is to predict the car price based on the given features.

In the implementation code is broken into several smaller units with some explanation in between for data pre-processing steps.

1. Importing libraries and reading CSV file

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('./CarPrice_Assignment.csv')

2. Getting info about the data. The parameter include='all' includes all the columns otherwise only columns with numerical values are included.

df.describe(include='all')

3. Removing columns

There are 26 columns and 205 rows in the dataset. On analyzing the data, you can make an observation that "car_ID" is inconsequential, "CarName" column has 147 unique values, encoding these many unique categorical values will be a problem so we'll drop these 2 columns. Also, "enginelocation" column has 'front' as value for 202 out of 205 rows, making it as good as a constant, so we can drop this column too.

# dropping columns car_id (just a unique id)
# CarName as there are 147 unique car names, encoding it will add lot of columns
df = df.drop(columns=['car_ID', 'CarName'])

# dropping columns enginelocation (as 202 entries are with value 'front')
df = df.drop("enginelocation", axis=1)

4. Removing outliers

We can also check for outliers in the dependent and independent variables because extreme values can disproportionately affect the regression line.

For that we can plot a distribution, if we do it for 'price'

g = sns.displot(df['price'], kde=True, bins=30)
g.set(xlim=(0, None))

As you can see, there is a positive skew. Let's say we want to avoid top 1% of the prices (treat them as outliers), that can be done using quantile() function in Pandas to return 99th percentile value.

#taking only the 99% records
qnt = df['price'].quantile(0.99)
data_re = df[df['price'] < qnt]

After that if you run the data_re.describe(include='all') code line, you can see that 3 rows are removed and the max price is now 37028. That way we have decreased some of the skewness in the price data.

Same way you can check for some of the independent variables and remove some outliers if needed.

5. Resetting index

If you have removed few records based on quantile values, you can use reset_index() to rearrange the index of a DataFrame back to the default integer index (0, 1, 2, …). As removing random records disturbs the default index.

#resetting indices
data_processed = data_re.reset_index(drop=True)

6. Checking for linear relationship between variables.

You can also plot the independent variables Vs price and verify the scatter plot, if scatterplot looks roughly like a straight line that means a likely linear relationship. In case relationship doesn't look linear we may have to use logarithmic transformation, square root transformation to transform the data.

f, (p1,p2,p3) = plt.subplots(1,3, sharey=True, figsize=(15,3))
p1.scatter(data_processed['enginesize'], data_processed['price'])
p1.set_title('EngineSize and Price')
p2.scatter(data_processed['horsepower'], data_processed['price'])
p2.set_title('HorsePower and Price')
p3.scatter(data_processed['highwaympg'], data_processed['price'])
p3.set_title('HighwayMpg and Price')

At least for these variables relationship looks linear.

7. Feature and label selection

y = data_processed['price']   # dependent variable
X = data_processed.drop(columns=['price'])   # independent variables

8. Checking for multicollinearity

Multicollinearity in linear regression occurs when two or more independent variables are highly correlated, meaning they provide redundant information, making it difficult to quantify the individual contribution of each independent variable to the dependent variable.

For detecting multicollinearity two of the most used options are-

Correlation Matrix

A correlation matrix is a matrix displaying correlation coefficients for all the possible pairs of predictors. That helps to find relationships between independent variables. Look for high correlation coefficients (e.g., >0.7 or 0.8) between predictors.

Variance Inflation Factor (VIF)

The Variance Inflation Factor (VIF) measures how much the variance of an estimated regression coefficient is increased due to collinearity (correlation) among predictor variables. A high VIF of greater than 10 indicates multicollinearity.

If you want to use correlation matrix to find the high correlation (here it is kept as 0.85) then following code will drop the features.

# select columns with numerical values
v = X.select_dtypes(include ='number')
corr_matrix = v.corr().abs()   # absolute correlations
#corr_matrix
#print(corr_matrix)
upper = corr_matrix.where(
    #upper triangular part of an array
    np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
)
# get the columns having any corr value > .85
to_drop = [column for column in upper.columns if any(upper[column] > 0.85)]
print(to_drop)
X_reduced = X.drop(columns=to_drop)

If you want to use VIF (upper limit kept as 10) then following code will drop the features. Note that the statsmodels library provides the variance_inflation_factor function to compute VIF for each variable in a regression model.

from statsmodels.stats.outliers_influence import variance_inflation_factor
# select columns with numerical values
v = X.select_dtypes(include ='number')
#select catgorical features
categorical_features = X.select_dtypes(exclude='number').columns
# create a new dataframe
vif = pd.DataFrame()
vif["VIF"] = [variance_inflation_factor(v.values, i) for i in range(v.shape[1])]
vif["Features"] = v.columns
#get the columns where VIF is less than or equal to 10
valid_numeric = vif.loc[vif["VIF"] <= 10, "Features"]
final_features = list(valid_numeric) + list(categorical_features)
X_reduced = X[final_features]

I have used correlation matrix code in this example which drops the following columns.

['carlength', 'curbweight', 'enginesize', 'highwaympg']

9. Splitting and encoding data

Splitting is done using train_test_split where test_size is passed as 0.2, meaning 20% of the data is used as test data whereas 80% of the data is used to train the model. OneHotEncoder is used to encode categorical data. With OneHotEncoder, drop = 'first' parameter is used so that new columns are created for only n-1 unique values, that helps in avoiding dummy variable trap. The parameter handle_unknown = 'ignore' helps with training vs. transform mismatch. The encoder learns categories from the training set. If new categories appear in the test set, they're "unknown". Parameter handle_unknown='ignore' ensures unseen categories don't break the transform step. They'll be encoded as all zeros.

Note that both fitting and transformation (using fit_transform) is done for training data, where as only transform() method is used for test data. That's how it should be done. When you split data into test data and train data there is a chance some value may not appear in train data and only in test data. Your encoder has learned about the categories from training data, when a new value is encountered while transforming test data it is an unknown value for encoder, that's where handle_unknown='ignore' parameter helps.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.compose import ColumnTransformer
from sklearn.metrics import r2_score, mean_squared_error

X_train, X_test, y_train, y_test = train_test_split(X_reduced, y, test_size=0.2, random_state=0)

ct = ColumnTransformer([
    ('encoder', OneHotEncoder(sparse_output = False, drop = 'first',handle_unknown = 'ignore'), X_reduced.select_dtypes(exclude='number').columns)
],remainder = 'passthrough')
 
ct.fit_transform(X_train)
X_train_enc = ct.fit_transform(X_train)
X_test_enc = ct.transform(X_test)

10. Training the model

From sklearn you import the LinearRegression class. Later you have to create an object of this class and call the fit method to train the model, parameters passed to the fit method are training data (X_train in our case) and target values (y_train in our case).

reg = LinearRegression()
reg.fit(X_train_enc, y_train)

11. Once the model is trained, predictions can be made using test data which can then be compared with the actual test data (y_test)

y_pred = reg.predict(X_test_enc)

12. Comparing test and predicted data

df_results = pd.DataFrame({'Target':y_test, 'Predictions':y_pred})
df_results['Residual'] = df_results['Target'] - df_results['Predictions']
df_results['Difference%'] = np.abs((df_results['Residual'] * 100)/df_results['Target'])
print(df_results.head())

	Target   Predictions     Residual  Difference%
18    6295.0   5691.859375   603.140625     9.581265
171  10698.0  12380.265625 -1682.265625    15.725048
107  13860.0  18371.781250 -4511.781250    32.552534
98   13499.0  15935.093750 -2436.093750    18.046476
178  15750.0  22500.046875 -6750.046875    42.857440

print(df_results.describe())

	 Target   	Predictions     Residual  Difference%
count     41.000000     41.000000    41.000000    41.000000
mean   13564.524390  13802.394436  -237.870046    15.610925
std     7463.439157   6884.600274  2839.126221    11.662503
min     6189.000000   5691.859375 -7951.265625     0.230753
25%     8495.000000   9159.765625 -1665.515625     7.594619
50%    11694.000000  12293.171875  -448.171875    13.121822
75%    15750.000000  16523.468750  1010.000000    21.949032
max    37028.000000  33051.968750  5494.531250    48.020689

As you can see min difference percentage is 0.23 while the max is 48.02.

13. Seeing the model metrics such as R-squared, mean squared error and root mean squared error.

print("R2 score", r2_score(y_test, y_pred)) 
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error", mse)
print("Root Mean Squared Error", np.sqrt(mse))

That's all for this topic Multiple Linear Regression With Example. If you have any doubt or any suggestions to make please drop a comment. Thanks!

>>>Return to Python Tutorial Page

Related Topics

You may also like-

Tech Tutorials

Monday, February 9, 2026