This post tries to explain one of the metrics used for regression models which is R2, known as R-squared or coefficient of determination.
What is R-squared
R-squared is a statistical measure, measuring the goodness of fit in the regression models. It measures how well the independent variables explain the variability in the dependent variable.
Value of the R-squared lies between 0 and 1
- Value of 0 means the model doesn't explain the variability at all.
- Value of 1 means the model explains all of the variability (perfect fit).
Though a value of 1 would most surely suggests an overfitting.
Then the question is; what is a good R2 value?
Well, that depends a lot on context. In fields like physics or engineering, if you are creating a mathematical model or regression equation that fits experimental or simulation data, values above 0.9 are often expected, while in social sciences or economics, values around 0.3-0.5 can still be considered meaningful. There's no universal cutoff.
For example, If R2=0.85, then 85% of the variability in y is explained by the model, and 15% remains unexplained.
Another question is what does variability of data mean? If we take the simple regression model equation which is-
\[ \hat{y} = b_{0} + b_{1}X_{1} \]Then R2 is the metrics that tells us how well the whole simple regression model (the combination of x values and coefficients) explains the variability in y.
Imagine predicting salaries when-
- x = experience in years
- b1 = salary increase with each year
Then R2=0.80 means: Using "experience in years" as input, the regression line explains 80% of why salaries differ with years of experience. Where as 20% remain unexplained.
Equation for R-squared
The formula for calculating R-squared is
R2 = 1 - RSS/TSS
Where RSS is the residual sum of squares, also called sum of squares of error (SSE). Here, residual is defined as, if actual value is yi and the predicted value is \(\hat{y_i}\) then the residual = \(y_{i} - \hat{y_i}\)
$$ SSE=\sum_{i=1}^{n} (y_i-\hat {y}_i)^2 $$
RSS measures the unexplained variability in the data.
TSS is the total sum of squares, which refers to how spread out the values of the dependent variable (y) are around their mean.
$$ TSS=\sum_{i=1}^{n} (y_{i}-\bar{y})^2 $$TSS explains the total variability in the data.
In the above image, there is a line drawn for mean and the differences between the observed values and mean is always going to be much greater than the residuals. That is why R-squared should have a value between 0 and 1.
R-squared example
If we take the same salary dataset used in the simple linear regression example, then the regression equation comes out to-
y_hat = 24848 + 9450*xand the mean of y values is- 76004
TSS = 21794977852
RSS = 938128552
After computing sum of squares R2 = 1 - (938128552/21794977852) = 0.9570
That's all for this topic R-squared - Coefficient of Determination. If you have any doubt or any suggestions to make please drop a comment. Thanks!
>>>Return to Python Tutorial Page
Related Topics
You may also like-

No comments:
Post a Comment