The idea that Most Published Research Findings Are False rocked the world of science when it was proposed in 2005. Since then, however, it’s become widely accepted – at least with respect to many kinds of studies in biology, genetics, medicine and psychology.
Now, however, a new analysis from Jager and Leek says things are nowhere near as bad after all: only 14% of the medical literature is wrong, not half of it. Phew!
But is this conclusion… falsely positive?
I’m skeptical of this result for two separate reasons. First off, I have problems with the sample of the literature they used: it seems likely to contain only the ‘best’ results. This is because the authors:
- only considered the creme-de-la-creme of top-ranked medical journals, which may be more reliable than others.
- only looked at the Abstracts of the papers, which generally contain the best results in the paper.
- only included the just over 5000 statistically significant p-values present in the 75,000 Abstracts published. Those papers that put their p-values up front might be more reliable than those that bury them deep in the Results.
In other words, even if it’s true that only 14% of the results in these Abstracts were false, the proportion in the medical literature as a whole might be much higher.
Secondly, I have doubts about the statistics. Jager and Leek estimated the proportion of false positive p values, by assuming that true p-values tend to be low: not just below the arbitrary 0.05 cutoff, but well below it.
It turns out that p-values in these Abstracts strongly cluster around 0, and the conclusion is that most of them are real:
But this depends on the crucial assumption that false-positive p values are different from real ones, and equally likely to be anywhere from 0 to 0.05.
“if we consider only the P-values that are less than 0.05, the P-values for false positives must be distributed uniformly between 0 and 0.05.”
The statement is true in theory – by definition, p values should behave in that way assuming the null hypothesis is true. In theory.
But… we have no way of knowing if it’s true in practice. It might well not be.
For example, authors tend to put their best p-values in the Abstract. If they have several significant findings below 0.05, they’ll likely put the lowest one up front. This works for both true and false positives: if you get p=0.01 and p=0.05, you’ll probably highlight the 0.01. Therefore, false positive p values in Abstracts might cluster low, just like true positives.
Alternatively, false p’s could also cluster the other way, just below 0.05. This is because running lots of independent comparisons is not the only way to generate false positives. You can also take almost-significant p’s and fudge them downwards, for example by excluding ‘outliers’, or running slightly different statistical tests. You won’t get p=0.06 down to p=0.001 by doing that, but you can get it down to p=0.04.
In this dataset, there’s no evidence that p’s just below 0.05 were more common. However, in many other sets of scientific papers, clear evidence of such “p hacking” has been found. That reinforces my suspicion that this is an especially ‘good’ sample.
Anyway, those are just two examples of why false p’s might be unevenly distributed; there are plenty of others: ‘there are more bad scientific practices in heaven and earth, Horatio, than are dreamt of in your model…’
In summary, although I think the idea of modelling the distribution of true and false findings, and using these models to estimate the proportions of each in a sample, is promising, I think a lot more work is needed before we can be confident in the results of the approach.