Is Medical Science Really 86% True?

By Neuroskeptic | January 24, 2013 6:39 pm

The idea that Most Published Research Findings Are False rocked the world of science when it was proposed in 2005. Since then, however, it’s become widely accepted – at least with respect to many kinds of studies in biology, genetics, medicine and psychology.

Now, however, a new analysis from Jager and Leek says things are nowhere near as bad after all: only 14% of the medical literature is wrong, not half of it. Phew!

But is this conclusion… falsely positive?

I’m skeptical of this result for two separate reasons. First off, I have problems with the sample of the literature they used: it seems likely to contain only the ‘best’ results. This is because the authors:

  • only considered the creme-de-la-creme of top-ranked medical journals, which may be more reliable than others.
  • only looked at the Abstracts of the papers, which generally contain the best results in the paper.
  • only included the just over 5000 statistically significant p-values present in the 75,000 Abstracts published. Those papers that put their p-values up front might be more reliable than those that bury them deep in the Results.

In other words, even if it’s true that only 14% of the results in these Abstracts were false, the proportion in the medical literature as a whole might be much higher.

Secondly, I have doubts about the statistics. Jager and Leek estimated the proportion of false positive p values, by assuming that true p-values tend to be low: not just below the arbitrary 0.05 cutoff, but well below it.

It turns out that p-values in these Abstracts strongly cluster around 0, and the conclusion is that most of them are real:

But this depends on the crucial assumption that false-positive p values are different from real ones, and equally likely to be anywhere from 0 to 0.05.

“if we consider only the P-­values that are less than 0.05, the P-­values for false positives must be distributed uniformly between 0 and 0.05.”

The statement is true in theory – by definition, p values should behave in that way assuming the null hypothesis is true. In theory.

But… we have no way of knowing if it’s true in practice. It might well not be.

For example, authors tend to put their best p-values in the Abstract. If they have several significant findings below 0.05, they’ll likely put the lowest one up front. This works for both true and false positives: if you get p=0.01 and p=0.05, you’ll probably highlight the 0.01. Therefore, false positive p values in Abstracts might cluster low, just like true positives.

Alternatively, false p’s could also cluster the other way, just below 0.05. This is because running lots of independent comparisons is not the only way to generate false positives. You can also take almost-significant p’s and fudge them downwards, for example by excluding ‘outliers’, or running slightly different statistical tests. You won’t get p=0.06 down to p=0.001 by doing that, but you can get it down to p=0.04.

In this dataset, there’s no evidence that p’s just below 0.05 were more common. However, in many other sets of scientific papers, clear evidence of such “p hacking” has been found. That reinforces my suspicion that this is an especially ‘good’ sample.

Anyway, those are just two examples of why false p’s might be unevenly distributed; there are plenty of others: ‘there are more bad scientific practices in heaven and earth, Horatio, than are dreamt of in your model…’

In summary, although I think the idea of modelling the distribution of true and false findings, and using these models to estimate the proportions of each in a sample, is promising, I think a lot more work is needed before we can be confident in the results of the approach.

CATEGORIZED UNDER: FixingScience, papers, science, statistics
  • DS

    How did the authors of this paper assess the adequacy of controls in the literature they sampled? It's not all about p values.

    Systematic error is a biggy!

  • JJJ

    Also these are mostly clinical trials, which means they had to be pre-registered with the analysis methods firmed up in advance–precluding many forms of p-hacking. The papers that Begley & colleagues at Amgen could not replicate were not clinical trials. So this article may only speak to narrow subset of biomedical literature.

  • DS

    So we don't need new data to assess whether previous claims were right or wrong? LOL. This is why I am so disgusted with medical science.

  • Stephen
  • Jeff

    Thanks for the comments. I have posted my response to Gelman's criticism of our paper on my blog:

    I would also point out that while the assumption that “there has been some p-value hacking” is one that will have a lot of people nodding their heads, there is no empirical *data* about the rate or form in which that occurs. But just to be sure, in the above post, we did a sensitivity analysis where we assumed everyone was hacking their p-values in the most egregious and non-ethical way (only reporting the minimum p-value). We still get pretty good estimates.

    I'm 100% for open discussion of ideas and this is a hot topic. But the goal of our paper was only partially to introduce our specific method. It was also to call attention to the fact that the data for these claims that research is true/false has not been evaluated/analysed in any previous paper.

    All of our code is available online. If you have a better approach – please go for it and edit/modify/publish a new version of our code. Until then, I'll continue to respond to these comments as I usually do – show me the data backing your claim please.

  • Neuroskeptic

    Hi Jeff, thanks for the reply; I'll comment on the issues over on your blog. But I disagree with the idea that it's my job to 'show you the data' – the burden of proof is on the originators of a new analysis to show that it's valid (e.g. with the kind of sensitivity analysis you're now doing).

  • C

    Isn´t it time for NIH to issue a big RFA for replicating major findings

  • DS

    Why is this such a controversy? Are the p-values reported in the sampled literature really the p-values for the experiment? If they are not then the well know uniformity of the p-value pdf (which the results of this paper hinge upon) is simply not applicable.

    There are plenty of ways to doctor ailing p-values. Some may stand out as plainly doctored while others may not.

  • DS

    Here is where I think some confusion has entered this controversy.

    Neuroskeptic stated:

    “The statement is true in theory – by definition, p values should behave in that way assuming the null hypothesis is true. In theory.

    But… we have no way of knowing if it's true in practice.”

    That statement is simply not correct. By definition the p-values should be uniformly distributed given the null. End of that story. We then take up the new story. Are the reported p-values actually p-values?

  • JTS

    DS – Consider the case of post-hoc analyses. If hypothesis tests are done after the fact specifically because the investigator noticed a difference in outcomes by chance, then p-values for those kinds of post-hoc tests cannot be assumed to be U(0,1).

    I mean, I suppose you could say at that point they're not really p-values, but they're definitely reported as such.

  • DS

    Yep. They are not p-values.

  • Nadeem

    Well written and informative article.

    Important to know about abstracts:

    “Significant results in abstracts are common but should generally be disbelieved.” ~ Believability of relative risks and odds ratios in abstracts: cross sectional study


    Nadeem J. Qureshi / @NadeemJQ

  • Neuroskeptic

    DS: I would say they are “p values” but they don't mean what they claim to mean (because they're being hacked).

    I suppose it's semantic whether a hacked p is still a p or just masquerading as one.

  • DS


    I don't think it is semantics. The p-value is defined as the probability of observing the randomly drawn data given that null is true. If the data are hacked then what is being reported a p-value is really the probability of observing the randomly drawn data given the null is true AND that the data was hacked in some specific way. That isn't a p-value by definition.

    If one accepts this hacked p-value as a real p-value and applies theorems only applicable to p-values then junk is the expected result.

    I don't think I am saying something to different than what you are saying. Just possibly with a bit more clarity as to where the math breaks down.

  • Deane Alban

    In Dr. Ben Goldacre’s TED Talk Battling Bad Science, he contends that only half of drug studies ever see the light of day and that there is a strong positive bias so the studies that don’t get published are usually anti-pharmaceutical. Here’s a link if anyone would like to watch it:


Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!


No brain. No gain.

About Neuroskeptic

Neuroskeptic is a British neuroscientist who takes a skeptical look at his own field, and beyond. His blog offers a look at the latest developments in neuroscience, psychiatry and psychology through a critical lens.


See More

@Neuro_Skeptic on Twitter

Collapse bottom bar

Login to your Account

E-mail address:
Remember me
Forgot your password?
No problem. Click here to have it e-mailed to you.

Not Registered Yet?

Register now for FREE. Registration only takes a few minutes to complete. Register now »