Psychology Should Aim For 100% Reproducibility

By Neuroskeptic | September 7, 2015 5:58 am

Last week, the Open Science Collaboration reported that only 36% of a sample of 100 claims from published psychology studies were succesfully replicated: Estimating the reproducibility of psychological science.

A reproducibility rate of 36% seems bad. But what would be a good value? Is it realistic to expect all studies to replicate? If not, where should we set the bar?

In this post I’ll argue that it should be 100%.

fixing_science

First off however, I’ll note that no single replication attempt will ever have a 100% chance of success. A real effect might always, just by chance, not be statistically significant, although with enough statistical power (i.e. by collecting enough data) this chance can be made very low.

Therefore, when I say we should aim for “100% reproducibility”, I don’t mean that 100% of replications should succeed, but rather that the rate of successful replications should be 100% of the statistical power.

In the Open Science Collaboration’s study, for example, the average power of the 100 replication studies was 0.92. So 100% reproducibility would mean 92 positive results.

Is this a realistic goal?

Some people argue that if psychologists were only studying highly replicable effects, they would be studying trivial ones, because interesting psychological phenomena are more subtle. As one commenter put it,

Alan Kraut, executive director of the Association for Psychological Science and a board member of the Center for Open Science, noted that even statistically significant “real findings” would “not be expected to replicate over and over again… The only finding that will replicate 100 per cent of the time is likely to be trite, boring, and probably already known.”

I don’t buy this. It may be true that, in psychology, most of the large effects are trivial, but this doesn’t mean that the small, interesting effects are not replicable. 100% reproducibility, limited only by statistical power, is a valid goal even for small effects.

Another view is that interesting effects in psychology are variable or context-dependent. As Lisa Feldman Barrett put it, if two seemingly-identical experiments report different results, one confirming a phenomenon and the other not,

Does this mean that the phenomenon in question is necessarily illusory? Absolutely not. If the studies were well designed and executed, it is more likely that the phenomenon… is true only under certain conditions.

Now, my problem with this view is that it makes scientific claims essentially unfalsifiable. Faced with a null result, we could always find some contextual variable, however trivial, to ‘explain’ the lack of an effect post hoc.

It’s certainly true that many (perhaps all!) interesting phenomena in psychology are context-dependent. But this doesn’t imply that they’re not reproducible. Reproducibility and generalizability are two different things.

I would like to see a world in which psychologists (and all scientists) don’t just report the existence of effects, but also characterise the context or contexts in which they are reliably seen.

It shouldn’t be enough to say “Phenomenon X happens sometimes, but don’t be surprised if it doesn’t happen in any given case.” Defining when an effect is seen should be part and parcel of researching and reporting it. Under those defined conditions, we should expect effects to be reproducible.

ResearchBlogging.orgOpen Science Collaboration (2015). Estimating the reproducibility of psychological science. Science, 349 (6251) PMID: 26315443

ADVERTISEMENT
  • Chris Chambers

    I agree. The challenge, of course, is that to achieve a reproducibility rate defined primarily by the statistical power of the original study we need to ensure that the original study is as unbiased as possible. And as we all know, this is far from the case in psychology — publication bias and QRPs are the norm, leading to the proliferation of Type I and Type M errors. As long as these biases persist the reproducibility rate will always be some (probably quite low) fraction of the original power.

    What I would say is that when the original study is a registered report, which are as unbiased as we can envisage, the reproducibility rate should hopefully get pretty close to the original power. Once a critical mass of registered reports are published this hypothesis could put to the test: e.g. comparative reproducibility of 100 RRs vs 100 non-RRs (For the uninitiated, more on registered reports here: https://osf.io/8mpji/wiki/home/)

  • Stepan Bahnik

    I have to disagree. There are cases where it is reasonable to publish findings which are not definitive. For instance, you may study a population that is very hard to access. Or you don’t have resources to conduct more studies on a given topic. Your options in these cases are to either publish preliminary findings, or put your data in the file drawer. I think that the former is the better option and you get less than 100% reproducibility in that case. You should of course describe your findings as preliminary, don’t draw far-fetched conclusions from them, etc. But expecting 100% reproducibility would go against scientific progress.

    • Chris Chambers

      Unfortunately, the kind of research you describe pretty much sums up the majority of psycholoy and cognitive neuroscience: underpowered, preliminary studies, the results of which are oversold in the interests of storytelling. I’d rather see a third option to your two: slow down and, when testing hypotheses, invest in well powered, genuinely prospective experiments, pooling resources as necessary to provide more reliable outcomes.

      • Stepan Bahnik

        I agree with your suggestions. It is better to design studies so that they are definitive. Nevertheless, that is not always possible. And since it is not always possible, you will be in some cases put before the decision to publish preliminary results, or put the data in the file drawer. If you choose the former, you cannot expect 100% reproducibility. I agree that there is a problem with reproducibility, but that doesn’t necessarily mean that we should aim for 100% reproducibility.

        • Chris Chambers

          “And since it is not always possible, you will be in some cases put
          before the decision to publish preliminary results, or put the data in
          the file drawer.”

          Or don’t do the study in the first place. Which takes us into the interesting philosophical terrain of whether a partial, biased answer to a question is better or worse than no answer at all.

          Ten years ago I would have argued that a partial answer is better than nothing because we at least have something to build on in generating more definitive studies. Now, though, I realise that in many areas we are waiting for a train that will never come: small preliminary studies are sufficient to achieve publication in the most prestigious journals provided the correct dose of spin is expertly applied, and this sets the upper standard for what the field aspires to. In a careerist system, why would a rational participant do big definitive multi-site studies to answer a question (and get one or two papers out of it) when they can get five equally career-propelling papers out of smaller studies for the same price?

          • Stepan Bahnik

            Why do you assume that the answer is biased? Should you expect 100% reproducibility and definitive answers if you do only registered reports? Sometimes data are messy and people should get used to it. The problem with reproducibility is partly caused by the expectation of clear stories.

          • Chris Chambers

            Because small studies usually present biased estimates of true effects, and the pressure to put a glossy sheen on the outcomes encourages further bias.

            Agree fully that people should get used to messy data – publishing outcomes should ideally be independent of the data altogether (which is what registered reports, and *only* registered reports, currently offer). But equally we need to get used to doing rigorous high powered research or become Bayesians and just accumulate uncertain evidence until it tells us something – that would be good too, but a very long game.

            I’m not sure whether a reproducibility project on RRs would predict 100% reproducibility (according to Neuroskeptic’s definition) but I predict that it would get us as close as possible barring subtle experimenter error. This is a testable prediction.

          • David Lane

            I don’t see why a small study would result in a biased effect size. Small studies selectively published because they contain significant effects would greatly overestimate their effect sizes.

          • Chris Chambers

            Yes, that’s what I mean by biased. Publication bias + low power means that effect size estimates are necessarily inflated. Add to the ingredients a good dose of researcher bias (e.g. p-hacking) to make a neat story and the outcome is a field dominated by Type M and Type I errors.
            The answer to this problem, in my view, is as Neuroskeptic outlines: strengthen the primary research base to increase rigor and eliminate as much bias as possible.

          • David Lane

            My take is that all methodologically-sound studies should be published thus eliminating publication bias. Of course they would have to be published online to reduce publication costs.

          • Chris Chambers

            I agree – the challenge, as always, is how to make this actually happen. Even journals like PLOS ONE, which profess to select papers purely on technical merit, impose publication bias because not all of the thousands of academic editors follow the policy correctly.

            The answer, in my view, is for all journals to offer registered reports because the best way to prevent publication bias is to make editorial decisions before results are even known. Accepting papers in advance also provides the crucial incentive for authors to engage in transparent practices by divorcing the publication process from the results. We can shout “all results should be published” for a hundred years and it won’t happen without changing the incentive structure.

  • Thom Baguley

    I doubt the average power was 92%. Not sure how they did their power calculations but standard ones will overestimate power on average (because they treat the effect as fixed and because small studies produce underestimates of the population SD more often than overestimates).

    • David Lane

      I agree that .92 is totally wrong. The authors apparently used power estimates based on effect sizes in the originally published studies. Because only significant findings are published (maybe with a few exceptions), the effect sizes are way over estimated (differences between means overestimated, sd’s underestimated) as are power estimates based on them. Publication bias can explain the results presented in the article entirely or almost entirely.

      • Thom Baguley

        Indeed, though the .92 would be an overestimate even in the absence of publication bias.

        • David Lane

          Agreed, I thing it would be more like 0.70 from publication bias alone.

      • Stuart Buck

        If original effect sizes are all overestimated, that’s rather the point.

        • David Lane

          But the implication of the article is that the overestimates are due to something substantitive rather than a necessary and incidental byproduct of publication bias. Otherwise, what’s the point of the article?

      • Steffen

        “Publication bias” and “file drawer effect”? Everyone I know has to get money for their experiments from funding sources, and they want reports on what you did with their money (papers!). And I know no one who is in the position to just put the effort of a few months of a PhD student and perhaps also a Master student in a file drawer, particularly not the students themselves, because they depend on the results for their thesis.
        All projects I hear about have significant results. There is no “file drawer” with non-significant results. If there is, it is with those people who had to leave science because they did not get further funding.

  • storkchen

    Regarding the proper reporting of context that you suggest: many people have noted that it is not possible to report all contextual factors in an experiment. But there is a simple rule (that I suggested in http://centerforopenscience.github.io/osc/2014/05/28/train-wreck-prevention/ ) regarding the reporting of context in empirical claims. That rule is: anything that isn’t claimed explicitly as contextual is assumed to be claimed as context-independent. So if you do an experiment with psychology undergrads, and your claim is about psychology undergrads and not about *Dutch* psychology undergraduates, you are thereby claiming that it is a culture-independent effect. If you don’t want that, you should state in your claim that it is about Dutch students. Same thing if it’s only about males,younger than 28, with full moon, during winter, or whatever.

    • http://blogs.discovermagazine.com/neuroskeptic/ Neuroskeptic

      Agreed.

    • matus

      This comment touches on a fine distinction that is at the heart of the replication debate: you don’t wish to replicate methods, experiments or results. You wish to replicate the claims drawn by the original study.

      But then there is other problem. Some researchers claimed that the distinction between the direct and conceptual replication is straight forward. Direct replication copies the methods from the original study. Conceptual replication investigates the claims. However, we now figured out that as part of direct replication we actually don’t care about the methods. We care about the claims. Does the distinction between the direct and conceptual replication make sense afterall? Maybe, we should drop it since conceptual replication is all we want to do?

  • R-Index

    In the Open Science Collaboration’s study, for example, the average power of the 100 replication studies was 0.92.

    This estimate of power is based on the reported effect sizes in the original studies and it is easy to demonstrate that these effect sizes are inflated. Thus, the true power of replication studies was not 92%. If it had been 92%, we could interpret the success rate of 36% as evidence that the replication studies were not exact replication studies (moderators, etc.). However, it is also possible that the true power was much lower than 92% and that the low success rate of 36% reflects the true power of the original studies.

    To make statements about power in the original and replication studies we need to take publication bias into account. Here is a link to a post that does it for social psychology studies in OSF.

    https://replicationindex.wordpress.com/2015/09/03/comparison-of-php-curve-predictions-and-outcomes-in-the-osf-reproducibility-project-social-psychology-part-1/

    and here is one for cognitive psychology

    https://replicationindex.wordpress.com/2015/09/05/comparison-of-php-curve-predictions-and-outcomes-in-the-osf-reproducibility-project-part-2-cognitive-psychology/

    It is clear that the replication studies in OSF did not have 92% power. Actual power estimates are 35% for social and 75% for cognitive psychology.

    • http://blogs.discovermagazine.com/neuroskeptic/ Neuroskeptic

      True, although if the original effect size was an overestimate (which, due to QRPs/publication bias, it probably is), maybe the effect will always fail to replicate – the original claimed effect size will not be reproduced in the replication studies, even if the effect is significant and in the same direction.

  • feloniousgrammar

    I appreciate the math, really I do, but aren’t most U.S. and Western psychological studies carried out on college students? This is much too narrow a population with which to draw general conclusions about human psychology.

    • Thom Baguley

      Yes and no. It does depend on what the study is about and it usually does make sense to do cheap and quick studies before running very expensive, time-consuming cross-cultural comparisons. (Also many studies do use other samples). I think Neuroskeptic is right that reproducibility and generalizability are distinct facets of the problem.

  • Jacob

    This would mean a result shouldn’t/wouldn’t get published until it’s already been replicated, or worse yet, 100 replication attempts have been made and the number which succeed match the power. A better strategy is to have more science be out in the open, publish results as they come in, publish replication attempts, and stop treating “published in a peer reviewed journal” as equivalent to “this effect is real”.

  • http://www.livingwithtn.org Red Lawhern

    It may be that we can be usefully instructed by a quote sometimes attributed to Norbert Wiener [1894-1964], the creator of what we now regard as modern operations research. To paraphrase, “all truly interesting human behaviors are over-determined.”. In more basic terms, for every human outcome, there are multiple plausible”causes” and interactive modifying factors — and we might not even know all of them.

    We should not be surprised that studies in both cognitive and social psychology frequently fail to replicate or generate weaker significance when replicated. Any time a study protocol is imposed on a human population, the natural variability of human attitudes, social behavior, and cognitive outcomes is to some extent artificially altered. This would occur even if study authors did not cherry pick their data or change their protocols at mid-stream to generate the outcomes they desire (which some authors clearly do).

    To the extent that such issues come into play in human behavior and psychological study, we may frequently be asking the wrong questions or failing to ask even more important ones. Most of the time we don’t ask “does cause A produce effect B?” — and that’s mostly a wise omission. But we forget to ask “what is the natural range of variability of outcomes surrounding this complex of behaviors in a representative population and social setting?” Or “which of our observations can we reliably generalize in the real world versus those that appear to be strongly influenced by individual character, habit, or social context?” Or “does this experiment have predictive value, and if so under what circumstances?”

    As Wiener might have pointed out, we should not expect simplified studies to replicate, when the subject of study is complex and multi-factorial. Many human behaviors tend to be that way.

  • D Samuel Schwarzkopf

    There are two issues at play here (both of which are partially discussed already below).

    1. Generalisability:

    I agree (and have frequently argued) that the ‘hidden moderator’ argument is unfalsifiable. Of course there are additional moderators/factors/confounds or other variables that may influence a result. It is totally fine, no, I’d say it is imperative that you postulate them. But if you do, you should carry out experiments to test them. Just saying ‘The reason X didn’t replicate may be because of Y’ and leaving it at that whilst carrying on to believe in X is not acceptable. Instead you should say ‘We tested whether X only replicates under condition Y’ and then report whatever you found – which might be that X just doesn’t replicate. There may always be another reason Z but that goes without saying and unless you have an idea of what Z might be it doesn’t really deserve to be said.

    2. Power:

    You can’t have 100% of the power. You can’t know power posthoc. As Dr R points out below the original effect size estimate is likely to be inflated, especially because of publication bias. Preregistration and generally more sensitive experiments can improve that but they won’t get rid of it. Publication bias will persist even in a world where all research is published – because not all possible research is actually done.

    Moreover, you simply cannot know power posthoc. Power is a pre-experimental concept. It is the proportion of theoretical repeats of the experiment that will correctly detect an effect size you predicted. This implies that you actually predicted any effect size to begin with. In replication studies, you can use the originally reported effect size to calculate power – although you would be wrong to do so (but it could still be a guideline, especially if you correct it for potential bias in the original study).

    However, in your post you’re not talking about replication studies. You are saying that psychology research as a whole should aim for sufficient power. What do you base this calculation on? The only thing that it can do is actually make better (as in, clearer) predictions of what you might expect.

    I find it funny how the more time I spend talking about these issues, the more of a Bayesian I become… 😛

  • Alex Holcombe

    About Kraut’s idea that findings that replicate 100% of the time are probably already known or boring, areas of science with technological innovation that open up new frontiers routinely discover new effects that are large, important and easy to replicate (if you have the new technology). E.g., that humans have multiple retinotopic maps in cortex, or a fusiform face area, is easy to replicate if you have a decent fMRI machine (I don’t have one, but that’s what I have heard). Kraut’s statement seems like a sad commentary on whatever fields he is thinking of. Even old fields, like vision science, occasionally discover new large illusions that always replicate.

    • D Samuel Schwarzkopf

      Occasionally, people actually discover new retinotopic areas. And then they spend decades arguing about whether they exist or not because they do or do not seem to exist in the same way in monkeys 😉

  • Vaughan

    Great post but the idea that claiming an effect is context specific make science falsifiable seems a little bizarre. The ‘hidden moderator’ is just context. Science is about working out under what conditions effects are true, almost by definition.

    It is true you can always claim there is a hidden moderator to explain a null effect but this doesn’t make claims unfalsifiable, because you just cast your specific claim as a hypothesise and test it. This is standard Popper. No positive evidence, just dismissing the nulls.

    It only makes it unfalsifiable if you a) say that the hidden moderator is all contexts except the one in which the original experiment was done (don’t know anyone who’s ever done this); or b) refuse to engage in the scientific process to try and falsify it (much more common).

    If you keep explaining away increasing null findings with increasingly elaborate contextual hypotheses, you may not be convincing as a scientist, but if you’re still testing at each stage, it is still science, even if you’re not being very useful.

    Don’t confuse the impossible for the improper.

    All effects are context dependent to some degree. Science works this out. This is not a bug, it’s a feature.

    • D Samuel Schwarzkopf

      Yes, as I was saying below, I think it’s entirely fine to say there may be a moderator but you can’t just dismiss a failed replication by using this argument and moving on. At the very least this suggests that the effect isn’t as robust as you thought originally.

  • sonia

    Critically appraising a study/review can often flesh out contextual differences and pinpoint the potential sources of bias. For instance, there’s a wide range of critical appraisal tools for assessing validity/reliability etc. in systematic reviews. Same could be used for assessing studies which couldn’t be replicated.

    I’m wondering whether contentious social/administration or other parameters can be virtually simulated and adjusted with existing data. Surely there must be an overall trend between studies that could or couldn’t be replicated. If a certain parameter is suspected of being different, then can’t this be adjusted statistically?

    A point system might work for critically appraising replications. For instance,

    2 points for using the same number of subjects.
    No point because the number of males are different.
    2 points for using the same computer operating system.

    The total points could determine whether the replication is valid or not. A similar system could be used to determine whether a study can be replicated or not and then flesh out why not.

    This replication biashara sounds like the start of something so much more. .

  • Ethan Black

    What’s the reproducibility rate in other sciences? I’ve heard really bad things about it in the world of biology, especially cancer research, but what about neuroscience?

  • polistra24

    The plain fact is that psychology as an academic discipline was never necessary. Advertisers and salesmen have always done a better job of understanding human motivation and behavior. They run controlled experiments in the form of focus groups, and they test their results rigorously by increased or decreased sales.

  • Pingback: Spike activity 11-09-2015 « Mind Hacks()

  • http://jayarava.blogspot.com Jayarava

    “In this post I’ll argue that it should be 100%.”

    This doesn’t seem right. The only way to approach 100% reproducibility is to know that an effect is reproducible before publishing. But without publishing how will other scientists know to try reproducing.

    Let’s not forget that Popper argued for conjecture and *refutation* as the engine for progress in knowledge.

    It makes me wonder how neuroscience would fare held to the same standard. What is your reproducibility rate as a discipline, and more to the point, what is your person rate of reproducibility. How many of your published works have resulted in a replication attempt and how many have succeeded? I’ll trust you to be honest.

  • Pingback: Spike exercise eleven-09-2015 | Posts()

  • Pingback: Open Science and scholarly publishing roundup – September 12, 2015 | Frontiers Blog()

  • Pingback: Spike activity 11-09-2015 | Chatag()

  • Pingback: Weekend reads: Backstabbing; plagiarism irony; preprints to the rescue - Retraction Watch at Retraction Watch()

  • Pingback: Er forskningsfunnene pålitelige? | BT Innsikt()

  • SW

    its clear that many psychologists do not understand science. this seems to confirm surveys of those in the field that have suggested as much. listening to the other comments only reinforces this impression.

  • Pingback: Debunking Advice Debunked | Absolutely Maybe()

  • Pingback: Debunking Advice Debunked | PLOS Blogs Network()

NEW ON DISCOVER
OPEN
CITIZEN SCIENCE
ADVERTISEMENT

Neuroskeptic

No brain. No gain.

About Neuroskeptic

Neuroskeptic is a British neuroscientist who takes a skeptical look at his own field, and beyond. His blog offers a look at the latest developments in neuroscience, psychiatry and psychology through a critical lens.

ADVERTISEMENT

See More

@Neuro_Skeptic on Twitter

ADVERTISEMENT

Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!

Collapse bottom bar
+