Last week, the Open Science Collaboration reported that only 36% of a sample of 100 claims from published psychology studies were succesfully replicated: Estimating the reproducibility of psychological science.
A reproducibility rate of 36% seems bad. But what would be a good value? Is it realistic to expect all studies to replicate? If not, where should we set the bar?
In this post I’ll argue that it should be 100%.
First off however, I’ll note that no single replication attempt will ever have a 100% chance of success. A real effect might always, just by chance, not be statistically significant, although with enough statistical power (i.e. by collecting enough data) this chance can be made very low.
Therefore, when I say we should aim for “100% reproducibility”, I don’t mean that 100% of replications should succeed, but rather that the rate of successful replications should be 100% of the statistical power.
In the Open Science Collaboration’s study, for example, the average power of the 100 replication studies was 0.92. So 100% reproducibility would mean 92 positive results.
Is this a realistic goal?
Some people argue that if psychologists were only studying highly replicable effects, they would be studying trivial ones, because interesting psychological phenomena are more subtle. As one commenter put it,
Alan Kraut, executive director of the Association for Psychological Science and a board member of the Center for Open Science, noted that even statistically significant “real findings” would “not be expected to replicate over and over again… The only finding that will replicate 100 per cent of the time is likely to be trite, boring, and probably already known.”
I don’t buy this. It may be true that, in psychology, most of the large effects are trivial, but this doesn’t mean that the small, interesting effects are not replicable. 100% reproducibility, limited only by statistical power, is a valid goal even for small effects.
Another view is that interesting effects in psychology are variable or context-dependent. As Lisa Feldman Barrett put it, if two seemingly-identical experiments report different results, one confirming a phenomenon and the other not,
Does this mean that the phenomenon in question is necessarily illusory? Absolutely not. If the studies were well designed and executed, it is more likely that the phenomenon… is true only under certain conditions.
Now, my problem with this view is that it makes scientific claims essentially unfalsifiable. Faced with a null result, we could always find some contextual variable, however trivial, to ‘explain’ the lack of an effect post hoc.
It’s certainly true that many (perhaps all!) interesting phenomena in psychology are context-dependent. But this doesn’t imply that they’re not reproducible. Reproducibility and generalizability are two different things.
I would like to see a world in which psychologists (and all scientists) don’t just report the existence of effects, but also characterise the context or contexts in which they are reliably seen.
It shouldn’t be enough to say “Phenomenon X happens sometimes, but don’t be surprised if it doesn’t happen in any given case.” Defining when an effect is seen should be part and parcel of researching and reporting it. Under those defined conditions, we should expect effects to be reproducible.
Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science, 349 (6251) PMID: 26315443