More on False Positive Neuroimaging

By Neuroskeptic | October 14, 2012 11:46 am

Back in June, I warned that the ever-increasing number of clever methods for analyzing brain imaging data could be a double-edged sword:

Recently, psychologists Joseph Simmons, Leif Nelson and Uri Simonsohn made waves when they published a provocative article called False-Positive Psychology – Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant.

It explained how there are so many possible ways to gather and analyze the results of a  simple psychology experiment that, even if there’s nothing interesting really happening, it’ll be possible to find some “significant” positive results purely by chance…

The problem’s not just seen in psychology however, and I’m concerned that it’s especially dangerous in modern neuroimaging research.

In a comment on that post, The Neurocritic pointed out that Michigan PhD student Joshua Carp had put forward the same argument in a conference presentation, several months previously.

Now Carp’s published a paper on the topic: On the plurality of (methodological) worlds: estimating the analytic flexibility of fMRI experiments. It’s free to access, so check it out.

Whereas I just talked the talk by listing lots of possible ways in which you could analyze a given set of data, Carp walked the walk, and actually did loads of analyses. He took a single dataset, the results of a simple experiment and looked at it in almost 7000 different ways. Each set of results was then thresholded to correct for multiple comparisons in 5 ways, for a grand total of 35,000 outputs.

The variants he considered ranged from how much smoothing to apply, to how to correct for head motion, and many more.

What happened? In a nutshell, the different options made a difference – and the variability was the largest in parts of the brain that were most activated (the “blobs” that lit up). In other words, analytic flexibility makes the most difference in the most interesting places. See the picture at the top.

The location of the maximum peak activation also varied. This is not unexpected, and not, in itself, that worrying – the great majority of the peaks clustered in a few small areas. However, it underlines that different options really can make a difference.

Carp concludes:

Nearly every voxel in the brain showed significant activation under at least one analysis pipeline. In other words, a sufficiently persistent researcher determined to find significant activation in virtually any brain region is quite likely to succeed…

If investigators apply several analysis pipelines to an experiment, and only report the analyses that support their hypotheses, then the prevalence of false positive results in the literature may far exceed the nominal rate. However, analytic flexibility only translates into elevated false positive rates when combined with selective analysis reporting. If researchers reported the results of all analysis pipelines used in their studies, then it would not be problematic.

To the author’s knowledge, there is no evidence that fMRI researchers actually engage in selective analysis reporting. But researchers in other fields do appear to pursue this strategy.

In my experience, fMRI researchers are actually fairly conservative in terms of using different analyses, and certainly I doubt anyone has ever run thousands of them just to get the result they want and I’d estimate that most published findings are not the result of more than a handful of ‘attempts’ at most.

However it’s a serious concern that it could happen, and importantly it’s getting ever-easier to do this, with the continuing increase in computer power making running an analysis quicker and cheaper than ever. As to what to do about it, Carp makes several suggestions, and here’s one I made earlier…

ResearchBlogging.orgJoshua Carp (2012). On the plurality of (methodological) worlds: estimating the analytic flexibility of fMRI experiments Front. Neurosci. DOI: 10.3389/fnins.2012.00149

  • Ivana Fulli MD

    After that brave young man paper, at least Mr petrossa will not dare to write anymore that fRMI research studies are a waste of time and money.
    For omg, I am not so sure since she worries about more sophisticated heuristic bias…seems to me.

    Cheers ladies and gentlemen in the field!

  • omg

    Neuroskeptic's a genius. Bravo. People like him change the world.

  • Anonymous

    oh there are people who pursue different analyses to obtain their favorite spot of activation. it is a real problem in some labs where the pressure to affirm the PI´s ego / NIH promises is on

  • Mortimer

    Nice post!

    at the risk of saying something stupid: would the better test be a random contrast (e.g. randomly assigning trials to pseudo-conditions)? and then test all the different analysis pipelines whether any of them yields a significant result? If there is nothing in the data then nothing should come out. Importantly, the raw data itself don't change. I am skeptic that different procedures substantially change this.

    My impression is that sometimes/often effects are a little bit too weak to survive a correction for multiple comparisons, and the researcher tries different procedures to enhance the power. I am not sure whether this is legitimate. Usually the differences are not so dramatic. Weak effects won't become gigantic.

    Apart from that, I find it much more worrying that you can simply increase the threshold (or do something else) to get rid of unwanted activations.


  • DS


    “.. and certainly I doubt anyone has ever run thousands of them just to get the result they want ..”

    To pollute neuroscience with erroneous neuroimaging results is it necessary that anyone run thousands or even a few different analysis methods? Isn't it enough that there are lots researchers running different analysis methods each yielding different results?

  • Nitpicker

    I think the realistic range of pipelines people truly use is much narrower than this. I don't buy that this is such a major concern. Sure, there will be some findings that are capricious and change under different pipelines. As such running several pipelines with parameters based on reasonable assumptions is a nice way to establish the robustness of a finding. But it's more important that the procedures are reported with sufficient detail to permit replication. This is still not the case in many journals especially the high-impact ones.

  • DS

    Nitpicker wrote:

    “I think the realistic range of pipelines people truly use is much narrower than this.”

    But are they valid? How would one test that?

  • Ritu

    I'm not sure about others' motivations for trying various streams/options, but I can tell you mine: being new to fMRI, I did not (& still do not) clearly understand each parameter to be specified. Take smoothing in SPM – should I use 4, 8, or 12? Who knows. This page has good links, but one tends to get lost in the explanations, and it's just easier to try different values until the results “make sense” (which, of course, biases them). Maybe most of us are just not smart enough to be doing fmri? Or there have to be simpler, concrete guidelines? Not sure.

  • DS

    The three biggest problems in fMRI:

    (1) Subject motion
    (2) Movement of subjects
    (3) Subjects not staying still.

  • Anonymous

    just divide your p value criterion with the number of analyses you run

  • Neuroskeptic

    DS: I'm not sure I agree, but I LOLed.

  • Nitpicker

    DS wrote:

    “But are they valid? How would one test that?”

    You make it sound unnecessarily arcane. A lot of common assumptions and parameters are easily validated: first you need to ask yourself if the assumptions are biologically plausible (e.g. what hemodynamic model is applied) and if they make sense mathematically (such as statistical parametric maps without smoothing).

    And beyond that you can test effects for which the ground truth is relatively well established. Show a flickering checkerboard stimulus in a standard block design and I bet you will see activation in occipital and cortex. If you don't (or get it somewhere else, say, in the precuneus) you are doing something wrong.

    Obviously, it become increasingly more difficult to validate protocols the more complex and poorly understood your experimental question is. But that, too, is normal. Extraordinary claims require extraordinary evidence.

  • DS


    I am concerned with the validation of each of the various processing steps. For example – two big ones – motion correction and temporal interpolation. These two processing steps are usually done separately yet this known to not be valid from physical and mathematical arguments.

    Even more problematic is motion correction alone. What is efficacy of motion correction processing. What is the error associated with motion correction? Even simpler, what is the error associated with the measurement of rigid body motion parameters? To my knowledge this is not known and there is plenty of reason to assume that it is rather large.

    This error in motion measurement and motion correction also impacts downstream processing such as the regression of motion-related temporal variation from the fMRI time series data.

    These errors must be estimated and the error must be propagated through the analysis. In physics not doing so would be 100% unacceptable. Hiding behind the messiness of biology, as many do, is not justification for continuing this lack of proper analysis.

  • Nitpicker

    @DS: We seem to have this same discussion every few months… You make a good point although you still haven't pointed us to any paper actually describing this process or at least I missed it if you did.

    However, what I was saying about empirical validation certain still holds also for these issues. It's more problematic for complex questions for which the answer may be more volatile but you can certainly use known effects as sanity checks.

  • DS


    Papers describing what process?

  • Nitpicker

    “These errors must be estimated and the error must be propagated through the analysis.”

    Has anyone done this? If not, why don't you?

  • DS


    Ah. I am working on doing just that. But the horse is out of the barn and it has been placed before the cart. The functional imaging field has science bassackwards. No measurement should ever be performed in scientific endeavours without giving an associated error for that measurement (unless it is well known what that error is and that it is insignificant). But that is what is done in fMRI and it will be the death of fMRI if this mess is not cleaned up … and maybe if it is.

  • Nitpicker

    Interesting, I am looking forward to reading about this (let's hope NS will cover it on this blog 😉

    I still stand by my earlier comment however on how systems with (relatively well-)known ground truth can serve as sanity checks. The more complex a question the more sanity checks are required and generally I agree with you that the more sources of error we can characterize the better. But there are many highly robust effects in the fMRI literature that I doubt will be gravely affected by such improved analyses.

  • DS

    All such gross sanity checks tell you – at best – is if your methodology is grossly different than the methodology of others. It says nothing about the validity of the results associated with the sanity checked data.

  • Nitpicker

    That is only true for situations where we have no prior knowledge to work with. But for many questions, say the response to visual stimuli, we have 100+ years of literature to base a prior on, combining a wide range of measurement modalities. It would be foolish to only base your understanding on fMRI alone for any question. And I'd wager most serious fMRI researchers don't in fact do so.

  • DS


    I truly do not understand what your point is. Would you please state it.

  • daedalus2u

    The even bigger problem which you don't mention (and which is insufficiently appreciated) is that fMRI doesn't measure neuronal activation, what it measures is blood oxygen level dependent magnetic susteptibility (BOLD). What is being measured are differences in magnetic susceptibility due to different volume concentrations of oxyhemoglobin (diamagnetic) and deoxyhemoglobin (paramagnetic). It is purely a hemodynamic parameter.

  • petito best

    Since its arrival in the 70’s the brain imaging technique has found expansive applications. Most significantly it has been used in the case of brain disorder and injuries. Players in the medical field use the technology to locate injuries and traumas in the brain. This has highly enhanced healthcare in hospitals around the world. Neuro-imaging



No brain. No gain.

About Neuroskeptic

Neuroskeptic is a British neuroscientist who takes a skeptical look at his own field, and beyond. His blog offers a look at the latest developments in neuroscience, psychiatry and psychology through a critical lens.


See More

@Neuro_Skeptic on Twitter


Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!

Collapse bottom bar