Let’s break this down. The article points out the potential for high p-values and gives an example with a high false discovery rate of 86%. Without looking at the numbers, it seems counter intuitive that the false discovery rate could be 86% when you have a relatively high specificity (95%) and sensitivity (80%). But let’s look further into why this is the case. It is really the *disparate* sample size between the 9900 that do not have the condition and the 100 that do have the condition. Stemming from that, we get the 495 false positives and the 80 true positives. In other words, because 9900 is so much larger than 100 (99 times bigger), it has the *potential* to have more false positives, and thus will *dwarf* the true positives resulting in the very high false discovery rate.

Let’s change these numbers. What if 50% (5000) people had the condition and 50% did not? Assume the same 95% specificity and 80% sensitivity. Now, we would have 4000 (80% of 5000) where we diagnose with the condition correctly, and 250 (5% of 5000) where we believe they have the condition, but do not . So, our false discovery rate is now 250/4250= 6%. What a difference!

How might that play out with hypothesis tests that we conduct? I think we can make an analogy to the screening condition. Let’s assume that Ford manufacturing has had a car that has been around for many years and they claim to get at least 30 miles to the gallon (assume all driving conditions are the same). A test is done every day for those many years of its existence. Assume that Ford is very good at what they do, and in reality they almost never produce a car that gets less than 30 mpg. What is likely to happen? There will be few times they will really have a car that gets less than 30 mph, yet we are going to find less than 30 mpg quite *often just by random chance, aren’t we*? In these cases, we will have a high type I error. Notice again what drove the high false discovery rate in our screening example. It was that we had so *few *(80) true positives, at least relative to the false positives. The same type of thing is going on with my scenario I just described.

Conversely, if Ford is poor at what they do, and built cars with volatile mpg values, then when we do find less then 30 mpg, it will be more likely that the car really is getting less then 30 mpg, and thus we will have a lower type I error.

Here is the crux of the whole discussion. When we find p-values in any context, much like the Ford example, we don’t enough about the context (hence, the need for conducting a hypothesis test to begin with) and thus we do not know what the real p-value is. But as the article suggests, we should acknowledge that p-values are higher (perhaps much higher) than we give credit for. Perhaps it is a good idea to create a tougher standard (make the critical region harder to hit) to reject the null hypothesis in many instances.

There are quite a few other really good articles at royalsocietypublishing. Check them out. And let me know your thoughts here. Eventually, I think we will need to change what we are teaching in statistics books regarding hypothesis testing 😊