Regression analysis is one of the most used ways for predictions. With in the regression analysis, linear regression is considered the starting point of the machine learning.
A linear regression is a statistical technique used to find the relationship between variables. There is a dependent variable (target) and one or more independent variables (predictors). In terms of machine learning you want to model the relationship between features and a label. Linear regression assumes this relationship is linear, meaning it can be represented by a straight line.
Simple Linear Regression
Simple linear regression is a linear regression which finds the causal relationship between two variables.
- One variable, generally denoted as x is the independent variable or predictor.
- Another variable, generally denoted by y is the dependent variable or target.
For example, if we want to find the relationship between years of experience and salary then years of experience is the independent variable and salary is the dependent variable and we want to find the causal relationship between years of experience and salary with the understanding that with increasing experience salary also increases.
If you have years of experience and salary data, you can use it to fit a simple linear regression model. Once that "learning" is done you can predict the salary by passing the years of experience.
Simple Linear Regression equation
In context of machine learning where we have sample data and we use it to create regression model, simple linear regression equation is as given below.
$$ \hat{y} = b_{0} + b_{1}X_{1} $$
Here \(\hat{y}\) is the predicted label - Output
b0 is the intercept, which tells you where the regression line intercepts the Y-axis. Or you can say it is the value when independent variable (x) is 0.
b1 is the slope. It tells how much dependent variable changes for one unit change in independent variable.
Ordinary Least Squares (OLS) estimation
In the above image, the regression line is labelled as the best fit line. But how do we know that this line is the best fit line. There are many straight lines that can be drawn going through the x values and intercepting the y axis. One way to find the best- fit line is by using the ordinary least squares estimation.
Ordinary least squares work by minimizing the sum of the squared differences between the observed values (the actual data points) and the values predicted by the model (lying on the regression line).
If actual value is yi and the predicted value is \(\hat{y_i}\) then the residual = \(y_{i} - \hat{y_i}\).
Squaring these differences ensures that both positive and negative residuals are treated equally. So, the best-fit line is the line for which the sum of the squared of the residuals (RSS) is minimum.
$$ RSS = \sum_{i=1}^{n} (y_{i}-\hat{y}{i})^2 $$
Formula for slope and intercept
The formula for calculating slope is-
\(b_{1}=r*\frac{s_{y}}{s_{x}}\)
Where: \(r\) = Pearson's correlation coefficient between \(x\) and \(y\).
\(s_{y}\) = Standard deviation of the \(y\) variable.
\(s_{x}\) = Standard deviation of the \(x\) variable.
Formula for intercept is
\(b_0=\bar{y} - b_1\*\bar{x}\) meaning (Mean of \(y\) - Slope \(\times \) Mean of \(x\))
After replacing the value of b1
\( b_0=\bar{y} - r\frac{s_{y}}{s_{x}} \times \bar{x}\)
Simple linear regression by manually calculating slope and intercept
Though scikit-learn library implements ordinary least squares (OLS) linear regression and that is the way to model simple linear regression in ML using Python but let's try to do it manually by using the above mentioned formulas first. This code still uses other Python libraries like Pandas, Numpy and Matplotlib.
Salary dataset used here can be downloaded from this URL- https://www.kaggle.com/datasets/abhishek14398/salary-dataset-simple-linear-regression
1. Importing libraries and reading CSV file
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# reading dataset from current directory
df = pd.read_csv("./Salary_dataset.csv")
print(df.head())
On printing using head() function first five rows are displayed
Unnamed: 0 YearsExperience Salary
0 1.2 39344.0
1 1.4 46206.0
2 1.6 37732.0
3 2.1 43526.0
4 2.3 39892.0
As you can see there is serial number also for which column name is "Unnamed: 0 ". This column name is not needed so let's drop it.
#Removing serial no column
df = df.drop("Unnamed: 0", axis=1)
2. Calculating the values for equation.
#Calculate mean and standard deviation mean_year = df['YearsExperience'].mean() mean_salary = df['Salary'].mean() std_year = df['YearsExperience'].std() std_salary = df['Salary'].std() # correlation coefficient between Years of experience and salary corr = df['YearsExperience'].corr(df['Salary']) print(corr) # 0.9782416184887599 # calculate slope slope = corr * std_salary/std_year print(slope) # calculate intercept intercept = mean_salary - (slope * mean_year) print(intercept)
3. Predicting values
# get predicted salaries y_pred = intercept + slope * df['YearsExperience'] # concatenate two panda series for actual salaries and predicted salaries combined_array = np.column_stack((df['Salary'].round(2), y_pred.round(2))) # check displayed values print(combined_array) #Predict salary for given years of experience sal_pred = intercept + slope * 11 print(sal_pred) # 128797.78950252903
4. Plotting the regression line
#Plot regression line
# Scatter plot for actual values
plt.scatter(df['YearsExperience'], df['Salary'], color='blue', label='Actual')
# Plot the regression line
plt.plot(df['YearsExperience'], y_pred, color='red', label='Regression Line')
plt.xlabel('Years of experience')
plt.ylabel('Salary')
plt.title('Years of experience Vs Salary')
plt.legend()
plt.show()
Simple linear regression using scikit-learn Python library
The above example shows how to calculate slope and intercept manually for linear regression but scikit-learn provides in-built support for creating linear regression model. Let's go through the steps.
1. Importing libraries and data pre-processing
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# read CSV file
df = pd.read_csv("./Salary_dataset.csv")
# remove serial no. column
df = df.drop("Unnamed: 0", axis=1)
2. As a second step we do feature selection and splitting the data into two sets; training data and test data. Sklearn has inbuilt support for splitting.
# Feature and label selection X = df['YearsExperience'] y = df['Salary']
As a convention in the ML code, capital X is used for the input data because it represents a matrix of features, while lowercase y is used for the target because it is typically a vector. Splitting is done using train_test_split where test_size is passed as 0.2, meaning 20% of the data is used as test data whereas 80% of the data is used to train the model.
from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score # splitting data into test data (80%) and train data (20%) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
3. Training the model
From sklearn you import the LinearRegression class which is an implementation of Ordinary least squares Linear Regression. Later you have to create an object of this class and call the fit method to train the model, parameters passed to the fit method are training data (X_train in our case) and target values (y_train in our case)
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
# splitting data into test data (80%) and train data (20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
#scikit-learn models require a 2D array input for features (X), even for a single feature
#so reshape(-1, 1) is used to convert to 2D array
X_train_reshaped = X_train.values.reshape(-1, 1)
reg = LinearRegression()
# Train the model on the training data
reg.fit(X_train_reshaped, y_train)
# Print intercept and coefficient
print('Intercept (b0) is', reg.intercept_)
print('Weight (b1) is', reg.coef_[0])
Which gives the following output for intercept and coefficient.
Intercept (b0) is 24380.20147947369 Weight (b1) is 9423.81532303098
4. Once the model is trained, predictions can be made using test data which can then be compared with the actual test data (y_test)
# predict values for the test data
y_pred= reg.predict(X_test.values.reshape(-1,1))
combined_data = pd.DataFrame({'Actual Salaries':y_test, 'Predicted Salaries':y_pred})
print(combined_data)
combined_data gives both actual values and predicted values side by side.
Actual Salaries Predicted Salaries
27 112636.0 115791.210113
15 67939.0 71499.278095
23 113813.0 102597.868661
17 83089.0 75268.804224
8 64446.0 55478.792045
9 57190.0 60190.699707
5. You can also predict salary by passing year.
#Predict salary for given years of experience sal_pred = model.predict([[11]]) print(sal_pred) # 128042.17003281
6. Seeing the model metrics such as R squared, mean squared error and root mean squared error.
print("R2 score", r2_score(y_test,y_pred))
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error", mse)
print("Root Mean Squared Error", np.sqrt(mse))
7. Plotting the regression line
# Scatter plot for actual values
plt.scatter(X_test, y_test, color='blue', label='Actual')
# Plot the regression line
plt.plot(X_test, y_pred, color='red', label='Regression Line')
plt.xlabel('Years of experience')
plt.ylabel('Salary')
plt.title('Years of experience Vs Salary')
plt.legend()
plt.show()
That's all for this topic Simple Linear Regression With Example. If you have any doubt or any suggestions to make please drop a comment. Thanks!
>>>Return to Python Tutorial Page
Related Topics
You may also like-


