Here is a good article about r-squareds (or best proxies for them) for logistic regression.
FAQ: What are pseudo R-squareds?
This is a subject that seems to be barely discussed and poorly understood. (Note that I will refer to proxies as r-squareds for ease of discussion).
The article gives some excellent insight into the nature of these proxies, which do not work as well as for linear regression. There is one part that I slightly disagree with, and that is in regard to the following two sentences.
“In other words, a pseudo R-squared statistic without context has little meaning. A pseudo R-squared only has meaning when compared to another pseudo R-squared of the same type, on the same data, predicting the same outcome.”
So, technically, these statements may be correct, but ‘meaning’ is a relative term, and the fact is that r-squareds do have meaning in that the closer they are to 1, the better the model fits. I think it is a little bit like temperature. While 80 degrees might be hot to some, and not so much to others, we do know that 80 degrees is “pretty hot”. A pseudo r-squared of say, 0.8 is probably more often than not, a pretty reasonable fit, depending upon context. I am not sure if my temperature analogy is a perfect one, but you get the idea.
I would like to elaborate just a bit on one characteristic that may give these metrics a propensity for higher or lower r-squareds. I think that they somewhat depends upon the likelihood for success (or failure). I ran some models with the idea of gaining insight as to whether the likelihood of success matters with respect to the tendencies of low (or high) r-squareds. There are a lot of numbers, and this is difficult to convey in a few paragraphs, so pay careful attention!
Specifically, I looked at four models with the same two independent variables, both coming in significant for all four models. I took a variable that had a mean of 80 and it ranged anywhere from 0 to 1000 (skewed right). I defined a “success” four different ways; namely if that value was at least 50, at least 100, at least 200, or at least 400. Of course, it is going to be at least 50 more often than at least 100, etc. So, there will much fewer successes when the criteria is to be at least 400. I found Somers D values (one of the many pseudo r-squareds) to go from 0.148 to 0.181 to 0.202 to 0.24 for the four models where success was defined for being at least 50, 100, 200, and 400, respectively.
Based on experience, this makes sense as it seems to me that the more polarized the data set is (meaning extreme percentages of successes or failures), the more it will lend itself to higher values of pseudo r-squareds. So, for the case of having a success when the value was at least 400, (mostly ‘failures’ here), it had the highest r-squared.
Keeping in mind that many of these formulas are similar, let us look at Efrons formula. It has in its numerator, the sum of the squares of the difference between the actuals and the predicteds, while the denominator is the sum of the squares of the difference between the actuals and average of the actuals, Let’s assume you have extreme data (i.e., almost always a success, say 99% of the time). I believe the nature of the regression process is going to almost always model a very high probability of success.
For instance, what if you know that 99% of students at a particular high school will go to college? Perhaps one of the factors is that it is an affluent area. For ease of discussion, assume that there were exactly 100 people at the high school in 2020, and 99 of them went to college. One of the significant variables is GPA, and you find out that a person had a GPA in the bottom 10% of his class. Well, there is a real good chance that this person goes to college, and the model, understanding that, will give that person a high probability of going to college. So, for 99 of the values in the numerator, they will be close to zero. Only one value (namely, the person who did not go to college) will contribute to a relatively high numerator (actual will be zero, with likely a high predicted probability).
Let’s compare the difference between these two scenarios. The first one has virtually every observation as a success and a probability of predicted success near 1. That keeps Efrons numerator relatively low. The second scenario, I suspect is going to have many predictions in the ballpark of say, 65%, or perhaps on the other side of 50, say 35%. It is going to make the numerator much higher. That said, because of the average of the successes in both situations, the denominator will also be lower in the extreme case (1st scenario). So, on the surface, we have a ‘stalemate’. For instance, start with the fraction, ½, and compare it do a fraction with a larger numerator and denominator, such as 2/4. Of course, it is the same.
But I think what generally happens, is that in the extreme case, the numerator is sufficiently small to ‘offset’ the relatively high denominator in the other setting. (i.e., Like comparing 3/4 to 12/15. In one case, you get 0.75 and in the other, you get 0.8, and when subtracting from 1, you get 0.25 (extreme case) compared to 0.2 (non-extreme case).
So, I think what happens the great majority of the time is that “all things being equal”, the more extreme situations (actuals being almost always a success or failure) lend themselves to slightly higher r-squareds.
Hope this makes sense.