This is a follow up to my previous article on r-squared’s when modeling with logistic regression. Depending on context in addition to your purpose, a small r squared might be acceptable.
R squared is certainly a worthwhile tool in linear regression because it has a direct translation. It is the amount of explained variance in the y value.
That is not the case with logistic regression. As a result, mathematicians have “created” pseudo r-squared s. Let’s look into the components of the r squared formula and try to discern exactly it conveys.
Let us look at two pictures with their respective r-squareds first.
The first picture (this had an r-squared of 0.18). What is the y-value? It is the blue dot. What did we expect the blue dot to be? The value on the line. It is all relative, but they do not look very ‘close’, relatively speaking.
The r-squared below is much higher, in fact, very close to 1. Let’s keep this picture in mind as we look at the formula.
Here is one derivation of the formula.
Notice that breaking it down in words, it is the squares of the difference of y’s and expected y’s divided by the squares of the differences of the y’s and the mean of the y’s. So, for instance, in the graph right above, notice what makes the r squared value close to 1. Since the values are so close to the regression line, that makes the actual y-values close to the expected y’s (y hat), and the numerator in the fraction above relatively close to zero. Certainly, that fraction will depend on the denominator, but if you run the numbers (the mean is close to 18), you will see the denominator dwarfs the numerator, making the fraction close to 0, and subtracting a value that is close to 0 will result in a value close to 1.
So, lets also think about why this formula make sense. We are trying to get at how well we explain the y-value, correct? Well, if the mean is 18, it makes sense to compare, so to speak, the actual y’s to the predicted y’s in assessing the fit. What if we defaulted to predicting each y value to be the mean? That would not do much, would it? In fact, it does nothing. Anybody could do that.
For example, what if you try to predict the value of each of 100 houses in a neighborhood. You know that the mean value is $200,000. Thus, you predict that each house value is $200,000. What would your r squared be? Since your predicted y value is the SAME as the mean y value, the numerator and denominator will be the same, and therefore will be 1. Thus, r squared will be 0 (and it should be).
So, you see, this formula conveys exactly what it was intended to convey to the extent that the closer the values are to the regression line, the better the fit, and the closer to 1 will be the value.
Let us shift over the logistic regression. Here is the problem. Because of the nature of the y-values (either zero or 1), it is difficult (if not impossible) to get meaning behind an r square value (or something akin to it). This is not to say there is not some meaning to these values.
Let’s assume that we sample 10 4th graders and want to be predict whether they will become a lawyer based on certain characteristics (how much he/she studies, their interests, etc.). Let’s also assume that 1% of the students will become a lawyer. So, you devise a model that tries to predict whether or not a student becomes a lawyer. Even the most likely of kids at that point aren’t going to have more than perhaps a 5% chance. Because many things can happen in life from age 10 to say mid 20’s when one might pass the bar exam. Life gets in the way, doesn’t it? So, if a person does ultimately become a lawyer, even the best of models is likely going to miss that (because they will have a low probability for everyone). Since you know that 99% do not become lawyers, and since you also know that it is very hard to predict who will become a lawyer, your logistic regression model is likely going to have at most a projected 5 to 10% chance for even the most likely of lawyers.
Let’s go back to the formula.
What happens to the formula is that the numerator will be highly dependent on whether or not one of those 10 actually become a lawyer (if somebody does, then the difference between the predicted (close to 0) and the actual (which is 1) is relatively high (keeping in mind that the differences in the denominator are very close to 0.
Here is the crux of this. We could play around with numbers all day long, but the nature of the predictions (are they close to 0 like with becoming a lawyer?) or are they closer to 50-50 (like, for instance, will you graduate from college) will go a long way in determining the r-squareds. It will also matter, at least in the case of becoming a lawyer, whether or not somebody becomes a lawyer (if just one does, that can bring the r squared much closer to 0. Dare I say it, luck matters here (some may take issue with that statement, but we can debate luck at another time 😊).
But, oftentimes, there is a way around this problem. Again, it might depend on your purpose. But generally, we use these models to help us in business, correct? More often than not, we want to make predictions with it. This gets to the importance of a holdout sample.
Assume that you are manufacturing something (a widget) and it must meet a “pass” threshold for it to be sent to a customer. Assume that 90% generally pass. You run a logistic regression and hope to find a model that will find ranges of independent variables that will increase that 90% pass rate. You end up with a model that has two independent variables as significant, but your r-squared is not very high. Should you trust your model?
One thing you can do is to use a holdout sample and create deciles based on predicted success. Then check to see how well those deciles correlate with the actual success.
Assume your holdout has 1000 observations. You create a table that shows the decile, the predicted success, and the actual success. Perhaps it looks something like this.
|decile||predicted success||actual success|
What can we make of this? Well, the predicted success is reasonably close to the actual success. We can work backwards, so to speak, to see what range of values went into the higher predicted successes. This is not a guarantee, but there is a good chance, you can tweak those two independent variables to be in the range of the higher deciles, and perhaps move your success rate to somewhere between 93 and 95%. Even if you move it to 92%, that is a relatively big move. You have moved it 20% of the way to perfection. So, we might be able to work around small r- squareds with logistic regression. Every predictive analyst has his or her own style. It could be argued that modeling is as much as an art as it is a science. Perhaps you have a different way of doing this. But I believe this technique can work well for certain contexts with logistic regression.