Tech Tutorials

Friday, February 6, 2026

Simple Linear Regression With Example

Regression analysis is one of the most used ways for predictions. With in the regression analysis, linear regression is considered the starting point of the machine learning.

A linear regression is a statistical technique used to find the relationship between variables. There is a dependent variable (target) and one or more independent variables (predictors). In terms of machine learning you want to model the relationship between features and a label. Linear regression assumes this relationship is linear, meaning it can be represented by a straight line.

Simple Linear Regression

Simple linear regression is a linear regression which finds the causal relationship between two variables.

One variable, generally denoted as x is the independent variable or predictor.
Another variable, generally denoted by y is the dependent variable or target.

For example, if we want to find the relationship between years of experience and salary then years of experience is the independent variable and salary is the dependent variable and we want to find the causal relationship between years of experience and salary with the understanding that with increasing experience salary also increases.

If you have years of experience and salary data, you can use it to fit a simple linear regression model. Once that "learning" is done you can predict the salary by passing the years of experience.

Simple Linear Regression equation

In context of machine learning where we have sample data and we use it to create regression model, simple linear regression equation is as given below.

$$ \hat{y} = b_{0} + b_{1}X_{1} $$

Here $\hat{y}$ is the predicted label - Output

b₀ is the intercept, which tells you where the regression line intercepts the Y-axis. Or you can say it is the value when independent variable (x) is 0.

b₁ is the slope. It tells how much dependent variable changes for one unit change in independent variable.

Ordinary Least Squares (OLS) estimation

In the above image, the regression line is labelled as the best fit line. But how do we know that this line is the best fit line. There are many straight lines that can be drawn going through the x values and intercepting the y axis. One way to find the best- fit line is by using the ordinary least squares estimation.

Ordinary least squares work by minimizing the sum of the squared differences between the observed values (the actual data points) and the values predicted by the model (lying on the regression line).

If actual value is y_i and the predicted value is $\hat{y_i}$ then the residual = $y_{i} - \hat{y_i}$.

Squaring these differences ensures that both positive and negative residuals are treated equally. So, the best-fit line is the line for which the sum of the squared of the residuals (RSS) is minimum.

$$ RSS = \sum_{i=1}^{n} (y_{i}-\hat{y}{i})^2 $$

Formula for slope and intercept

The formula for calculating slope is-

$b_{1}=r*\frac{s_{y}}{s_{x}}$

Where: $r$ = Pearson's correlation coefficient between $x$ and $y$.

$s_{y}$ = Standard deviation of the $y$ variable.

$s_{x}$ = Standard deviation of the $x$ variable.

Formula for intercept is

$b_0=\bar{y} - b_1\*\bar{x}$ meaning (Mean of $y$ - Slope $\times $ Mean of $x$)

After replacing the value of b₁

$ b_0=\bar{y} - r\frac{s_{y}}{s_{x}} \times \bar{x}$

Simple linear regression by manually calculating slope and intercept

Though scikit-learn library implements ordinary least squares (OLS) linear regression and that is the way to model simple linear regression in ML using Python but let's try to do it manually by using the above mentioned formulas first. This code still uses other Python libraries like Pandas, Numpy and Matplotlib.

Salary dataset used here can be downloaded from this URL- https://www.kaggle.com/datasets/abhishek14398/salary-dataset-simple-linear-regression

1. Importing libraries and reading CSV file

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# reading dataset from current directory
df = pd.read_csv("./Salary_dataset.csv")
print(df.head())

On printing using head() function first five rows are displayed

Unnamed: 0  YearsExperience   Salary
          0              1.2  39344.0
          1              1.4  46206.0
          2              1.6  37732.0
          3              2.1  43526.0
          4              2.3  39892.0

As you can see there is serial number also for which column name is "Unnamed: 0 ". This column name is not needed so let's drop it.

#Removing serial no column
df = df.drop("Unnamed: 0", axis=1)

2. Calculating the values for equation.

#Calculate mean and standard deviation
mean_year = df['YearsExperience'].mean()
mean_salary = df['Salary'].mean()
std_year = df['YearsExperience'].std()
std_salary = df['Salary'].std()
# correlation coefficient between Years of experience and salary
corr = df['YearsExperience'].corr(df['Salary'])

print(corr) # 0.9782416184887599
# calculate slope
slope = corr * std_salary/std_year
print(slope)
# calculate intercept
intercept = mean_salary - (slope * mean_year)
print(intercept)

3. Predicting values

# get predicted salaries
y_pred = intercept + slope * df['YearsExperience']

# concatenate two panda series for actual salaries and predicted salaries
combined_array = np.column_stack((df['Salary'].round(2), y_pred.round(2)))
# check displayed values 
print(combined_array)

#Predict salary for given years of experience
sal_pred = intercept + slope * 11
print(sal_pred) # 128797.78950252903

4. Plotting the regression line

#Plot regression line
# Scatter plot for actual values
plt.scatter(df['YearsExperience'], df['Salary'], color='blue', label='Actual')
# Plot the regression line
plt.plot(df['YearsExperience'], y_pred, color='red', label='Regression Line')
plt.xlabel('Years of experience')
plt.ylabel('Salary')
plt.title('Years of experience Vs Salary')
plt.legend()
plt.show()

Simple linear regression using scikit-learn Python library

The above example shows how to calculate slope and intercept manually for linear regression but scikit-learn provides in-built support for creating linear regression model. Let's go through the steps.

1. Importing libraries and data pre-processing

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# read CSV file
df = pd.read_csv("./Salary_dataset.csv")
# remove serial no. column
df = df.drop("Unnamed: 0", axis=1)

2. As a second step we do feature selection and splitting the data into two sets; training data and test data. Sklearn has inbuilt support for splitting.

# Feature and label selection
X = df['YearsExperience']
y = df['Salary']

As a convention in the ML code, capital X is used for the input data because it represents a matrix of features, while lowercase y is used for the target because it is typically a vector. Splitting is done using train_test_split where test_size is passed as 0.2, meaning 20% of the data is used as test data whereas 80% of the data is used to train the model.

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score 
# splitting data into test data (80%) and train data (20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

3. Training the model

From sklearn you import the LinearRegression class which is an implementation of Ordinary least squares Linear Regression. Later you have to create an object of this class and call the fit method to train the model, parameters passed to the fit method are training data (X_train in our case) and target values (y_train in our case)

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score 
# splitting data into test data (80%) and train data (20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
#scikit-learn models require a 2D array input for features (X), even for a single feature
#so reshape(-1, 1) is used to convert to 2D array
X_train_reshaped = X_train.values.reshape(-1, 1)
reg = LinearRegression()

# Train the model on the training data
reg.fit(X_train_reshaped, y_train)

# Print intercept and coefficient
print('Intercept (b0) is', reg.intercept_)
print('Weight (b1) is', reg.coef_[0])

Which gives the following output for intercept and coefficient.

Intercept (b0) is 24380.20147947369
Weight (b1) is 9423.81532303098

4. Once the model is trained, predictions can be made using test data which can then be compared with the actual test data (y_test)

# predict values for the test data
y_pred= reg.predict(X_test.values.reshape(-1,1))

combined_data = pd.DataFrame({'Actual Salaries':y_test, 'Predicted Salaries':y_pred})
print(combined_data)

combined_data gives both actual values and predicted values side by side.

    Actual Salaries  Predicted Salaries
27         112636.0       115791.210113
15          67939.0        71499.278095
23         113813.0       102597.868661
17          83089.0        75268.804224
8           64446.0        55478.792045
9           57190.0        60190.699707

5. You can also predict salary by passing year.

#Predict salary for given years of experience
sal_pred =  model.predict([[11]])
print(sal_pred) # 128042.17003281

6. Seeing the model metrics such as R squared, mean squared error and root mean squared error.

print("R2 score", r2_score(y_test,y_pred)) 
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error", mse)
print("Root Mean Squared Error", np.sqrt(mse))

7. Plotting the regression line

# Scatter plot for actual values
plt.scatter(X_test, y_test, color='blue', label='Actual')
# Plot the regression line
plt.plot(X_test, y_pred, color='red', label='Regression Line')
plt.xlabel('Years of experience')
plt.ylabel('Salary')
plt.title('Years of experience Vs Salary')
plt.legend()
plt.show()

That's all for this topic Simple Linear Regression With Example. If you have any doubt or any suggestions to make please drop a comment. Thanks!

>>>Return to Python Tutorial Page

Related Topics

You may also like-

Mean, Median and Mode With Python Examples

This post explains Mean, Median and Mode which are measures of central tendency and help to summarize the data. Here measure of central tendency is a value which identifies the middle position with in a set of data.

Here we'll look at how to calculate mean, median and mode and which one is more appropriate in the given scenario.

Mean

Mean, which is the arithmetic average is calculated by summing all the values in the data set divided by the number of values in the data set. If there are n values ranging from $x_1, x_2, \dots, x_n $ then the mean $ \overline{x} $ (x bar) is calculated as:

$$ \bar{x} = \frac{x_1 + x_2 + \cdots + x_n} {n}$$

Using summation notation same thing can be written as:

$$ \bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i $$

For example, if we have a data set of 10 values as given below-

5, 8, 12, 15, 18, 20, 22, 25, 30, 35

Then the sum of the values is-

5 + 8 + 12 + 15 + 18 + 20 + 22 + 25 + 30 + 35 = 190

And the mean is $ \overline{x} $ = 190/10 = 19

Mean is a better choice when data is normally distributed.

When mean is not a better choice

Mean may not be a best choice when data is skewed as mean is sensitive to outliers. In skewed data, outliers (very high or low values) can drag the mean away from the center.

For example, if values are- 10, 12, 13, 14, 15, 100

Then the mean is- 164/6 = 27.33

As you can see mean is pulled away from the center because of one extreme value 100. In such cases median is better option.

Median

Median is the middle value in an ordered (ascending or descending) set of data. Formula for median is as given below-

If the dataset has an odd number of values, it is the middle value. $$\left(\frac{n+1}{2}\right)^\text{th}\text{value} $$
If the dataset has an even number of values, it is the average of the two middle values. $$ \text{Median} = \frac{\left(\frac{n}{2}\right)^\text{th}\text{value} + \left(\left(\frac{n}{2}\right)+1\right)^\text{th}\text{value}}{2} $$

For example, in order to calculate median for

5, 15, 18, 20, 22, 35, 8, 12, 25, 30

First sort them in ascending order-

5, 8, 12, 15, 18, 20, 22, 25, 30, 35

Number of values is 10 (even) so the median is-

$\frac{\left(\frac{10}{2}\right)\text{th value} + \left(\left(\frac{10}{2}\right)+1\right)\text{th value}}{2} = \frac{5^{\text{th}} \text{ value} + 6^{\text{th}} \text{ value}}{2} $ = (18+20)/2 = 19

So, the median of the dataset is 19.

Median responds well to the skewed data

Earlier we have seen that the mean is sensitive to the outliers whereas median doesn't vary.

For example, if values are- 10, 12, 13, 14, 15, 100

Then the median = (13 + 14)/2 = 13.5

Which is close to center.

Mode

The mode is the most frequent value in the dataset. For example, if we have the following list of values

2, 4, 4, 5, 7, 7, 7, 8, 9, 10

Then 7 is the mode as that has the highest frequency 3.

Mode is not sensitive to outliers.

We may have a scenario where all values appear exactly once meaning no mode. We may also have a scenario where 2 or more values have the same frequency meaning multiple modes.

Mode is the best measure of central tendency when you're dealing with categorical data (non-numerical), or when you want to identify the most common value in a dataset. For example, you want to find the most shopped brand or most preferred colour.

When you want the most typical value, for example most bought shoe size.

Shoe sizes: [7, 8, 8, 8, 9, 10]

Here mode = 8 (most common size)

Calculating mean, median, mode using Python libraries

1. NumPy library has mean and median functions to calculate mean and median. For mode SciPy library provides mode method.

import numpy as np
from scipy import stats
values = [2, 4, 4, 5, 7, 7, 7, 8, 9, 10]
#Mean and Median
mean = np.mean(values)
median = np.median(values)
print('Mean is', mean)
print('Median is', median)
#Mode = returns an array of mode and count
mode = stats.mode(values)
print('Mode is', mode[0], 'count is', mode[1])

Output

Mean is 6.3
Median is 7.0
Mode is 7 count is 3

2. Using Pandas library which has mean, median and mode functions. You can convert list of values to Pandas series and then calculate mean, median and mode.

import pandas as pd
values = [2, 4, 4, 4, 5, 7, 7, 7, 8, 9, 10]
data = pd.Series(values)
mean = data.mean()
median = data.median()
# returns a Series (which can have multiple modes)
mode = data.mode()
print(f"Mean is {mean:.2f}")
print('Median is', median)
print('Mode is', list(mode))

Output

Mean is 6.09
Median is 7.0
Mode is [4, 7]

That's all for this topic Mean, Median and Mode With Python Examples. If you have any doubt or any suggestions to make please drop a comment. Thanks!

>>>Return to Python Tutorial Page

Related Topics

You may also like-