Reproducibility Crisis: The Plot Thickens

By Neuroskeptic | November 10, 2015 1:36 pm

A new paper from British psychologists David Shanks and colleagues will add to the growing sense of a “reproducibility crisis” in the field of psychology.

The paper is called Romance, Risk, and Replication and it examines the question of whether subtle reminders of ‘mating motives’ (i.e. sex) can make people more willing to spend money and take risks. In ‘romantic priming’ experiments, participants are first ‘primed’ e.g. by reading a story about meeting an attractive member of the opposite sex. Then, they are asked to do an ostensibly unrelated test, e.g. being asked to say how much money they would be willing to spend on a new watch.

There have been many published studies of romantic priming (43 experiments across 15 papers, according to Shanks et al.) and the vast majority have found statistically significant effects. The effect would appear to be reproducible! But in the new paper, Shanks et al. report that they tried to replicate these effects in eight experiments, with a total of over 1600 participants, and they came up with nothing. Romantic priming had no effect.

So what happened? Why do the replication results differ so much from the results of the original studies?

The answer is rather depressing and it lies in a graph plotted by Shanks et al. This is a funnel plot, a two-dimensional scatter plot in which each point represents one previously published study. The graph plots the effect size reported by each study against the standard error of the effect size – essentially, the precision of the results, which is mostly determined by the sample size.


This particular plot is a statistical smoking gun, and suggests that the positive results from the original studies (black dots) were probably the result of p-hacking. They were chance findings, selectively published because they were positive.

Here’s why. In theory, the points in a funnel plot should form a “funnel”, i.e. a triangle, that points straight up. In other words, the more precise studies at the top should have less spread than the noisier estimates, but they should converge on the same effect size that’s also the average of the less precise measures.

In this plot, however, the black dots form a ‘funnel’ which is seriously tilted to the left. The trend line though these points is a diagonal (the red line). In other words, the more precise studies tended to find smaller mating priming effects. The bigger the study, the smaller the romantic priming.

In fact, the diagonal red trend line closely tracks the line where an effect stops being statistically significant at p < 0.05 – which is marked as the outer edge of the grey triangle on the plot. Another way of expressing this would be to say that p values just below 0.05 are overrepresented. The published results “hug” the p = 0.05 significance line. So each of the studies tended to report an effect just strong enough to be statistically significant. It’s very difficult to see how such a pattern could arise – except through bias.

Shanks et al. say that this is evidence of the existence of “either p-hacking in previously published studies or selective publication of results (or both).” These two forms of bias go hand in hand, so the answer is probably both. Publication bias is the tendency of scientists (including peer reviewers and editors) to prefer positive results over negative ones. P-hacking is a process by which scientists can maximize their chances of finding positive results.

I’ve been blogging about these issues for years, yet still I was taken aback by the dramatic nature of the bias in this case. The studies are like a torrent, rolling down the mountain of significance. The image is not so much a funnel plot as an avalanche plot.


Taken together with the negative results of the eight replication studies that Shanks et al. conducted, the funnel plot suggests that romantic priming doesn’t exist, and that the many studies that did report the effect, were wrong.

This doesn’t mean that the previous romantic priming researchers were consciously trying to deceive by publishing results that they knew were false. In my view, they were probably led astray by their own cognitive biases, helped along by the culture of ‘positive results or bust’ in science today. This system can produce replicated positive results out of nowhere. I don’t think this is a sustainable way of doing research. Reform is needed.

ResearchBlogging.orgShanks DR, Vadillo MA, Riedel B, Clymo A, Govind S, Hickin N, Tamman AJ, & Puhlmann LM (2015). Romance, Risk, and Replication: Can Consumer Choices and Risk-Taking Be Primed by Mating Motives? Journal of experimental psychology. General PMID: 26501730

  • CL

    Gee, studying this in 43 experiments, whereof one with 1600 participants?? It may actually be interesting to look in to who funds this, could be a clue to the nature of the bias

    • Neuroskeptic

      To be clear, there were 43 previous experiments. Then on top of that the new replication study did another 8 experiments which had a total n=1600.

      • CL

        yes, this was not done by a single lab, still i wonder who funds this amount of silly “science”. A very simple explanation is that advertisements typically link attractive individuals with products, so the “romantic priming” effect could just be a carry-over of a well taught association between sex and material things. Either way, it is a classic chicken or egg problem. And then it does not really work anyway when you scrutinize it.

  • Zidane

    “The studies are like a torrent, rolling down the mountain of significance. The image is not so much a funnel plot as an avalanche plot.” -> best quote

    “It may actually be interesting to look in to who funds this, could be a clue to the nature of the bias” -> Nature of the bias? Human, obviously

    • CL

      was more considering reasons for the bias, like wanting to report a positive finding to your financier

  • Rolf Degen

    The sad thing about the original studies was that Evolutionary Psychology hinged its reputation on social priming studies in the first place. I hope this teaches them something. This simply isn’t a theoretically founded research tradition but a slight of hand practice. And it can be easily immunized against refutation. Perhaps the effects materialize in “implicit” measures?

    • Neuroskeptic

      True – although we should note that nebulous operationalization is not required for p-hacking. Even “hard” outcome measures can be p-hacked through exclusion of outliers, choice of statistical covariates etc.

  • ProfessorJericho

    The problems with various social sciences in terms of valid scientific findings (where appropriate) has bothered me for a long time as well. I think many of the issues are fixable, especially with modern technological tools that are more precise. I put down the main arguments on my LinkedIn blog:

    • Thom Baguley

      It is worth noting that none of these problems (e.g., p-hacking, publication bias) are exclusive to social science. Certainly the problems arise in medical research, genetics and neuroscience (or indeed any field that uses similar methods.

      I am also skeptical that big data will solve this (it certainly isn’t immune to p-hacking or publication bias and arguably could be more vulnerable).

      • ProfessorJericho

        I don’t disagree, certainly Big Data methods have their own unique issues (such as over fitting), but at least it can produce testable models or help find hidden variables in a way that current research methods in social sciences don’t. P-values are problematic wherever they’re used, but as I note in my post, the problem is inflated because of the conceptual nature of social science variables. Working with predictive models in learning analytics I’ve seen some remarkable results, but the methodology is still mired in ethical and conceptual problems. What I don’t think can be argued against is that as we become a civilization whose behavior generates massive data, Big Data methods are the only realistic way to turn it into knowledge.

        • Thom Baguley

          I agree there is weak theorising in some areas of social science – but that isn’t reflective of all social science. There is also a separate problem that mathematical modelling of behavioural data is more difficult than for most physical processes.

          I think the claims about Big Data are also over-optimistic – but that is getting off topic.

          • ProfessorJericho

            I agree with your first two points. When it comes to Big Data methods as a research tool I think the proof will be in the pudding. I might well be overly optimistic, but until we see the effectiveness (or lack thereof) of the method over time I have no way of knowing.

          • Thom Baguley

            I think the issue for me is that Big Data is an opportunity, not a method. It has massive potential but also massive challenges.

            For example, once you have a very large data set the measurement error might be inconsequential but that just means all the action is in the bias. In scientific terms handling bias is often harder than handling measurement error.

          • ProfessorJericho

            Relevant to the issue of replication, a Big Data scenario (under the assumption that we get new data to test against) our specific predictive model can be verified against new data or eventual outcomes. I think that’s a better situation than having unreplicated studies that hit Fisher’s arbitrary standard, have very small effect sizes and are never reproduced. Yes, Big Data methods are an opportunity (a well as being a distinct set of analytical approaches), an opportunity to do better than the status quo. Nonetheless, I believe in using the correct tool for the job. If the correct approach is ANOVA or regression and that gets you where you need to go then use that.

      • TomJohnstone

        True to an extent. But the probability that a p<0.05 result reflects a true effect is dependent on the prior probability that the effect exists (this is true even for well-powered studies). If the prior probability is low, then the odds of a false positive can be very high. And when hypotheses in a given field are based on shaky conceptual theories, rather than more solid mechanistic ones, then the prior probability that the effects they predict are real will be low, making the number of false positives higher.

        So there are likely to be differences between scientific fields in this regard. Those fields that have a more solid grasp of the mechanisms that underlie the phenomena they are examining, for example those that routinely develop quantitative predictive models, will be more immune to these problems than those that rely on purely descriptive theories.

        • Thom Baguley

          I think that is to lump all social science research together in an unreasonable way. Except that doesn’t fully explain why you get the same issues in say genetics or medicine where this is good underlying understanding of the physical mechanism. They common problem is the uncertainty arising from interactions of hundreds or thousands of variables.

          • Neuroskeptic

            Yes, the problem is seen across the “p-value sciences”. Although some of them seem to have succesfully dealt with it e.g. physics.

  • Uncle Al

    Psychology and meteorology share the same flaw: After all the scholarly humbug is exhausted, the phenomenon will do what it damn well pleases. Macroeconomics, though apparently within this class, is different. Macroeconomics causes $(USD)trillions/year harm worldwide.

    Now, mix and match. Scientific socialism, Obamacare, human rights, Klimate Kaos – what could possibly go wrong?

    • Ageofwant

      Silly strawman nonsense. Meteorology is hard computable science, Psychology is not.

      The rest of your comment is typical American libertarian crackpot fare: I libertarian, precipitated from raw unadulterated self-actualisation out of the nebulous ether, owing no man nothing. Owner-less fatherless.

      Your right about Macroeconomics of course.

      • Neuroskeptic


      • Noumenon72

        As a libertarian, I will never have kids because this “I owe the world something just for living” is a terrible burden to inflict on someone.

  • Pingback: A mustread by Neuroskeptic: Reproducibility Crisis: The Plot Thickens | From experience to meaning...()

  • Pingback: A mustread by Neuroskeptic: Reproducibility Crisis: The Plot Thickens | Blogcollectief Onderzoek Onderwijs()

  • seethelunatic

    Just demand three sigmas as the standard of significance and much of the problem will solve itself. Positive results will get rarer and draw more scrutiny. Space will be freed up in the journals for more replication studies with fewer positive results getting published. Replication studies will become more attractive to the ambitious if it’s easier to bump off someone else’s published outlier than p-hack your own.

  • TomJohnstone

    The term “p-hacking” is typically used to refer to bad practice by individual researchers or groups. But as long as academic journals selectively publish only “significant” results, one could say that the *field* as a whole is engaged in p-hacking. Because what the field is doing is hundreds of experiments and selectively excluding the non-significant results.

    There is no justification that I can see for the field to be held to different standards than individual researchers. If it’s wrong for individual researchers to only report significant results, or only report on data or analyses that make results significant, then it is also wrong for academic journals – as the “guardians” of research standards, to do likewise.

    Lowering the required p-value to 0.001, requiring increased power, requiring 2-3 replicated results, will no solve this problem. Pre-registering studies will, as will publishing studies based on quality of the methods/procedures used, rather than the significance of the results.

    • Neuroskeptic

      Very true. As I said, p-hacking and publication bias go hand in hand. Publication bias is the motive for p-hacking. But publication bias is largely the product of editors and reviewers, not authors.

    • Bonnie Cramond

      I think you are right about this. I especially agree with your last suggestion: publishing studies based on quality of the methods/procedures used, rather than the significance of the results. I would add one more criterion–that the study have a logical theoretical basis, too.

    • monkeyonakeyboard

      And lets not forget the responsibility of the general public who only want to read publications with sensational articles!

  • Pingback: Latest Journal Of Experimental Psychology News()

  • Pingback: Another Science Fail | SINMANTYX()

  • Paul1234

    Some major takeaways.. Journals prefer to publish papers with p < 0.05…. And it's possible to get an improbable result from time to time even if there is no effect…. Indeed there is a crisis…

  • Pingback: David Ebony Interviews Artist and Editor Walter Robinson on His Parallel Lives()

  • Pingback: 2 – Evidence of p-value hacking - Exploding Ads()

  • Dr. Paul Marsden

    Psychology’s ‘Sokal affair with statistics’ or our own ‘doping scandal with p-hacking’. Pick your poison.

  • Pingback: I’ve Got Your Missing Links Right Here (14 November 2015) – Phenomena: Not Exactly Rocket Science()

  • Pingback: I’ve Got Your Missing Links Right Here (14 November 2015) | Gaia Gazette()

  • Dmitry

    We only should ensure that data in this paper was not cherry-picked too.

  • Richard Denton

    Could it be said that “If it’s social it ain’t science?”

  • Eliot

    Besides all that statistical stuff and likely “p-hacking” If the example of a “primer” given in this article is representative, there is a bigger problem than p-hacking. If the implication is that one would spend more for a watch, that assumption could be wrong. My reaction would be to pay less for a watch so that I would have more money to treat her (in my case) better.

  • lyellepalmer

    Note that 0.10 is substituted for 0.01 on the pyramid.

    • Neuroskeptic

      Hi – actually 0.10 is correct. Notice that the 0.10 threshold is closer to the y-axis (the line of effect size zero) than the 0.05 threshold. Thus, some effects are significant at 0.10 but not at 0.05.

  • lyellepalmer

    Editors could assist readers in framing articles as either “indicative” or “definitive” studies. How many of the prior meta-analysis studies were replicas of earlier studies? Replication is not the same as creating studies that move the phenomenon down the path toward “definitive” evidence and conclusions through refinements of procedures with population subsets. In psychology a definitive conclusion must answer the questions of “How much” and “For whom?”

  • Pingback: Friday links: Jabberwocky Ecology lives, and more | Dynamic Ecology()

  • Pingback: Do Bilingual People Have a Cognitive Advantage? - Neuroskeptic()

  • Pingback: Priggish NEJM editorial on data-sharing misses the point it almost made – gaplogs()

  • Pingback: What does money do to us? Motivation crowding vs. money priming – Backbone of Night()

  • Pingback: Links for April 2016 - foreXiv()



No brain. No gain.

About Neuroskeptic

Neuroskeptic is a British neuroscientist who takes a skeptical look at his own field, and beyond. His blog offers a look at the latest developments in neuroscience, psychiatry and psychology through a critical lens.


See More

@Neuro_Skeptic on Twitter


Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!

Collapse bottom bar