Monday, February 9, 2026

R-squared - Coefficient of Determination

This post tries to explain one of the metrics used for regression models which is R2, known as R-squared or coefficient of determination.

What is R-squared

R-squared is a statistical measure, measuring the goodness of fit in the regression models. It measures how well the independent variables explain the variability in the dependent variable.

Value of the R-squared lies between 0 and 1

  1. Value of 0 means the model doesn't explain the variability at all.
  2. Value of 1 means the model explains all of the variability (perfect fit).

Though a value of 1 would most surely suggests an overfitting.

Then the question is; what is a good R2 value?

Well, that depends a lot on context. In fields like physics or engineering, if you are creating a mathematical model or regression equation that fits experimental or simulation data, values above 0.9 are often expected, while in social sciences or economics, values around 0.3-0.5 can still be considered meaningful. There's no universal cutoff.

For example, If R2=0.85, then 85% of the variability in y is explained by the model, and 15% remains unexplained.

Another question is what does variability of data mean? If we take the simple regression model equation which is-

\[ \hat{y} = b_{0} + b_{1}X_{1} \]

Then R2 is the metrics that tells us how well the whole simple regression model (the combination of x values and coefficients) explains the variability in y.

Imagine predicting salaries when-

  • x = experience in years
  • b1 = salary increase with each year

Then R2=0.80 means: Using "experience in years" as input, the regression line explains 80% of why salaries differ with years of experience. Where as 20% remain unexplained.

Equation for R-squared

The formula for calculating R-squared is

R2 = 1 - RSS/TSS

Where RSS is the residual sum of squares, also called sum of squares of error (SSE). Here, residual is defined as, if actual value is yi and the predicted value is \(\hat{y_i}\) then the residual = \(y_{i} - \hat{y_i}\)

$$ SSE=\sum_{i=1}^{n} (y_i-\hat {y}_i)^2 $$

RSS measures the unexplained variability in the data.

TSS is the total sum of squares, which refers to how spread out the values of the dependent variable (y) are around their mean.

$$ TSS=\sum_{i=1}^{n} (y_{i}-\bar{y})^2 $$

TSS explains the total variability in the data.

R-Squared

In the above image, there is a line drawn for mean and the differences between the observed values and mean is always going to be much greater than the residuals. That is why R-squared should have a value between 0 and 1.

R-squared example

If we take the same salary dataset used in the simple linear regression example, then the regression equation comes out to-

y_hat = 24848 + 9450*x

and the mean of y values is- 76004

TSS = 21794977852

RSS = 938128552

After computing sum of squares R2 = 1 - (938128552/21794977852) = 0.9570

That's all for this topic R-squared - Coefficient of Determination. If you have any doubt or any suggestions to make please drop a comment. Thanks!

>>>Return to Python Tutorial Page


Related Topics

  1. Python Installation on Windows
  2. Encapsulation in Python
  3. Method Overriding in Python
  4. Multiple Inheritance in Python
  5. Mean, Median and Mode With Python Examples

You may also like-

  1. Passing Object of The Class as Parameter in Python
  2. Local, Nonlocal And Global Variables in Python
  3. Python count() method - Counting Substrings
  4. Python Functions : Returning Multiple Values
  5. Marker Interface in Java
  6. Functional Interfaces in Java
  7. Difference Between Checked And Unchecked Exceptions in Java
  8. Race Condition in Java Multi-Threading

No comments:

Post a Comment