Measuring Explanatory Power with the R-squared

If you are looking for a widely-used measure that describes how powerful a regression is, the R-squared will be your cup of tea. A prerequisite to understanding the math behind the R-squared is the decomposition of the total variability of the observed data into explained and unexplained.

A key highlight from that decomposition is that the smaller the regression error, the better the regression.

Now, it’s time to introduce you to the R-squared. The R-squared is an intuitive and practical tool, when in the right hands. It is equal to variability explained by the regression, divided by total variability.

What Exactly is the R-squared?

It is a relative measure and takes values ranging from 0 to 1. An R-squared of zero means our regression line explains none of the variability of the data.

An R-squared of 1 would mean our model explains the entire variability of the data.

Unfortunately, regressions explaining the entire variability are rare. What we usually observe are values ranging from 0.2 to 0.9.

What's the Best Value for an R-squared?

The immediate question you may be asking: “What is a good R-squared? When do I know, for sure, that my regression is good enough?”

Unfortunately, there is no definite answer to that.

In fields such as physics and chemistry, scientists are usually looking for regressions with R-squared between 0.7 and 0.99. However, in social sciences, such as economics, finance, and psychology the situation is different. There, an R-squared of 0.2, or 20% of the variability explained by the model, would be fantastic.

It depends on the complexity of the topic and how many variables are believed to be in play.

Dealing with Multiple Variables

Take your income, for example. It may depend on your household income (including your parents and spouse), your education, years of experience, country you are living in, and languages you speak. However, this may still account for less than 50% of the variability of income.

Your salary is a very complex issue. But you probably know that.

SAT-GPA Example

Let’s check out an SAT-GPA example. We have used it in a previous tutorial so if you want to keep track of what we are talking about, make sure you check it out. If you don’t want to, here’s the regression summary:

The SAT score is one of the better determinants of intellectual capacity and capability. The truth is that our regression had an R-squared of 0.406, as you can see in the picture below.

In other words, SAT scores explain 41% of the variability of the college grades for our sample.

Other Factors

An R-squared of 41% is neither good nor bad. But since it is far away from 90%, we may conclude we are missing some important information. Other determinants must be considered. Variables such as gender, income, and marital status could help us understand the full picture a little better.

Now, you probably feel ready to move on. However, you should remember one thing.

Don’t jump into regressing so easily. Critical thinking is crucial. Before agreeing that a factor is significant, you should try to understand why. So, let’s quickly justify that claim.

Gender

First, women are more likely to outperform men in high school.

But then in higher education, more men enter academia.

There are many biases in place here. Without telling you if female or male candidates are better, scientific research shows that a gender gap exists in education. Gender is an important input for any regression on the topic.

Income

The second factor we pointed out is income. If your household income is low, you are more likely to get a part-time job.

Thus, you’ll have less time for studying and probably get lower grades.

If you’ve ever been to college, you will surely remember a friend who underperformed because of this reason.

Children

Third, if you get married and have a child, you’ll definitely have a lower attendance.

Contrary to what most students think when in college, attendance is a significant factor for your GPA. You may think your time is better spent when skipping a lecture, but your GPA begs to differ.

When to Include More Factors?

After these clarifications, let’s find the bottom line. The R-squared measures the goodness of fit of our model. The more factors we include in our regression, the higher the R-squared.

So, should we include gender and income in our regression? If this is in line with our research, and their inclusion results in a better model, we should do that.

The Adjusted R-squared

The R-squared seems quite useful, doesn’t it? However, it is not perfect. To be more precise, we’ll have to refine it. Its new version will be called the adjusted R-squared.

What it Adjusts for

Let’s consider the following two statements:

The R-squared measures how much of the total variability is explained by our model.
Multiple regressions are always better than simple ones. This is because with each additional variable that you add, the explanatory power may only increase or stay the same.

Well, the adjusted R-squared considers exactly that. It measures how much of the total variability our model explains, considering the number of variables.

Multiple Regressions

Let’s create our first multiple regression to explain this point.

First, we’ll import all the relevant libraries.

This is what we need to code:

import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport statsmodels.api as smimport seabornseaborn.set()

We can load the data from the file ‘1.02. Multiple linear regression.csv’. You can download it from the following link. The way to load it is the following:

data = pd.read_csv(‘1.02. Multiple linear regression.csv’)

An Additional Variable

Let’s type ‘data’ and run the code.

As you can see from the picture above, we have data about the SAT and GPA results of students. However, there is one additional variable, called Rand 1,2,3. We’ve generated a variable that assigns 1, 2, or 3, randomly to each student. We are 100% sure that this variable cannot predict college GPA.

So, this is our model: college GPA is equal to b₀+ b₁ * SAT score +b₂ * the random variable.

Now, let’s write ‘data.describe()’ and see the descriptive statistics.

Creating the Regression

In our case, y is GPA and there are 2 explanatory variables – SAT and Random 1,2,3.

What we can do is declare x1 as a data frame, containing both series. So, the code should look like this:

y = data[‘GPA’]x1 = data[[‘SAT’, ‘Rand 1,2,3’]]

we must fit the regression:

x = sm.add_constant(x1)results = sm.OLS(y,x).fit()

After that, we can check out the regression tables using the appropriate method

results.summary()

You can see the ones from the previously linked tutorial in the box on the right. Keep in mind, that we only have 1 variable there.

The New R-squared

We notice that the new R-squared is 0.407, so it seems as we have increased the explanatory power of the model. But then our enthusiasm is dampened by the adjusted R-squared of 0.392.

We were penalized for adding an additional variable that had no strong explanatory power. We have added information but have lost value.

Important: The point is that you should cherry-pick your data and exclude useless information.

However, one would assume regression analysis is smarter than that. Adding an impractical variable should be pointed out by the model in some way.

Well, that’s true. Let’s take a look at the coefficients table. We have determined a coefficient for the Rand 1,2,3 variable, but its P-value is 0.762!

The Null Hypothesis

The null hypothesis of the test is that β = 0. We cannot reject the null hypothesis at the 76% significance level!

This is an incredibly high P-value.

Important: For a coefficient to be statistically significant, we usually want a P-value of less than 0.05.

Our conclusion is that the variable Rand 1,2,3 not only worsens the explanatory power of the model, reflected by a lower adjusted R-squared but is also insignificant. Therefore, it should be dropped altogether.

Dropping Useless Variables

Dropping useless variables is important. You can see the original model changed from:

ŷ = 0.275 + 0.0017* x₁

ŷ = 0.296+ 0.0017* x₁- 0.0083 * x₂

The choice of the third variable affected the intercept. Whenever you have one variable that is ruining the model, you should not use this model altogether. This is because the bias of this variable is reflected in the coefficients of the other variables. The correct approach is to remove it from the regression and run a new one, omitting the problematic predictor.

Simplicity

There is one more consideration concerning the removal of variables from a model. We can add 100 different variables to a model and probably the predictive power of the model will be outstanding. However, this strategy makes regression analysis futile. We are trying to use a few independent variables that approximately predict the result. The trade-off is complex, but simplicity is better rewarded than higher explanatory power.

How to Compare Regression Models?

Finally, the adjusted R-squared is the basis for comparing regression models. Once again, it only makes sense to compare two models considering the same dependent variable and using the same dataset. If we compare two models that are about two different dependent variables, we will be making an apples-to-oranges comparison. If we use different datasets, it is an apples-to-dinosaurs problem.

As you can see, adjusted R-squared is a step in the right direction, but should not be the only measure trusted. Caution is advised, whereas thorough logic and diligence are mandatory.

What We’ve Learned

To sum up, the R-squared basically tells us how much of our data’s variability is explained by the regression line. The best value for an R-squared depends on the particular case. When we feel like we are missing important information, we can simply add more factors. This is where the adjusted R-squared comes into play. It measures the variability of our model but it also considers the number of variables. Therefore, it is always smaller than the R-squared. Moreover, the adjusted R-squared is the basis for comparing regression models.

After reading this, you probably feel like you are ready to dive deeper into the field of linear regressions. Maybe you are keen to find out how to estimate a linear regression equation. Or maybe you want to know what to consider before performing regression analysis.

Either way, exploring the world of the Ordinary Least Squares assumptions will be right up your street.

***

Interested in learning more? You can take your skills from good to great with our statistics tutorialsand Statistics course.

Next Tutorial:Exploring the OLS Assumptions

FAQs

Measuring Explanatory Power with the R-squared | 365 Data Science? ›

The R-squared measures how much of the total variability is explained by our model. Multiple regressions are always better than simple ones. This is because with each additional variable that you add, the explanatory power may only increase or stay the same.

Discover More Details ›

What is the explanatory power of R-squared? ›

R² measures the percentage of variation in the dependent variable that can be attributed to the independent variable. That is, it is a goodness of fit measure for linear regression models. R² is always a value between 0 and 1.

Get More Info Here ›

What is the interpretation of R-squared in data science? ›

R-Squared (R² or the coefficient of determination) is a statistical measure in a regression model that determines the proportion of variance in the dependent variable that can be explained by the independent variable. In other words, r-squared shows how well the data fit the regression model (the goodness of fit).

Learn More ›

Does R-squared measure predictive power? ›

R-squared is a statistical measure of the variation in the dependent variable as explained by the linear model. By definition, it is only explanatory and not predictive.

Learn More ›

What does this value of R2 tell us about the predictive explanatory power of the model? ›

What Does R-Squared Tell You? R-squared tells you the proportion of the variance in the dependent variable that is explained by the independent variable(s) in a regression model. It measures the goodness of fit of the model to the observed data, indicating how well the model's predictions match the actual data points.

See Details ›

What does a R2 value of 0.9 mean? ›

They believe that higher R-squared is better, and think about it like a scoring system: R-squared greater than 0.9 is an A. R-squared above 0.8 is a B. R-squared less than 0.7 is a fail.

Show Me More ›

What does the R2 value tell you? ›

An R-Squared value shows how well the model predicts the outcome of the dependent variable. R-Squared values range from 0 to 1. An R-Squared value of 0 means that the model explains or predicts 0% of the relationship between the dependent and independent variables.

Tell Me More ›

What is a good R-squared value in science? ›

A R-squared between 0.50 to 0.99 is acceptable in social science research especially when most of the explanatory variables are statistically significant.

How do you interpret predicted R-squared? ›

The predicted R-squared indicates how well a regression model predicts responses for new observations. This statistic helps you determine when the model fits the original data but is less capable of providing valid predictions for new observations. (Read an example of using regression to make predictions.)

Read The Full Story ›

What does a low R-squared but significant mean? ›

However, what if your model has independent variables that are statistically significant but a low R-squared value? This combination indicates that the independent variables are correlated with the dependent variable, but they do not explain much of the variability in the dependent variable.