False Positive Neuroscience?

By Neuroskeptic | June 30, 2012 9:32 am

Recently, psychologists Joseph Simmons, Leif Nelson and Uri Simonsohn made waves when they published a provocative article called False-Positive Psychology

The paper’s subtitle was “Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant”. It explained how there are so many possible ways to gather and analyze the results of a (very simple) psychology experiment that even if there’s nothing interesting really happening, it’ll be possible to find some “significant” positive results purely by chance. Then you could publish those ‘findings’ and not mention all the other things you tried.

It’s not a new argument, and the problem has been recognized for a long time as the “file drawer problem”, “p-value fishing”, “outcome reporting bias“, and by many other names. But not much has been done to prevent it.

The problem’s not just seen in psychology however, and I’m concerned that it’s especially dangerous in modern neuroimaging research.


Let’s assume a very simple fMRI experiment. The task is a facial emotion visual response. Volunteers are shown 30 second blocks of Neutral, Fearful and Happy faces during a standard functional EPI scanning. We also collect a standard structural MRI as required to analyze that data.

This is a minimalist study. Most imaging projects have more than one task, commonly two or three and maybe up to half a dozen, as part of one scan. If one task failed to show positive results, it need never be reported at all, so additional tasks would compound the problems here.

Our study is comparing two groups: people with depression, and healthy controls.

How many different ways could you analyze that data? How much flexibility is there?


First some ground rules. We’ll stick to validated, optimal approaches. There are plenty of commonly used less favoured approaches, like using uncorrected thresholds (and then, which ones?) or voodoo stats, but let’s assume we want to stick to ‘best practice’.

As far as I can see, here’s all the different things you could try. Please suggest more in the comments if you think I’ve missed any:

First off, general points:

  • What’s the sample size? Unless it’s fixed in advance, data peeking – checking whether you’ve got a significant result after each scan, and stopping the study when you get one – gives you multiple bites at the cherry.
  • Do you use parametric, or nonparametric analysis?

Now, what do you do with the data?

  • Preprocessing
    • How much smoothing?
    • Do you reject subjects for “too much head movement”?  If so, what’s too much?
  • Straightforward whole-brain general linear model (GLM) analysis followed by a group comparison.
    • What’s the contrast of interest? You could make a case for Fear vs Neutral, Happy vs Neutral, Happy vs Fear, “Emotional” vs Neutral.
    • Fixed effects or random effects group comparison?
    • Do you reject outliers? If so, what’s an ‘outlier’?
    • Do you consider all of the Fear, Happy and Neutral blocks to be equivalent, or do you model the first of each kind of block seperately? etc.
  • “Region of Interest” (ROI) GLM analysis. Same options as above, plus:
    • Which ROI(s)?
      • How do you define a given ROI?
  • Functional connectivity analysis
    • Whole-brain analysis, or seed region analysis?
      • If seed region, which region(s)?
    •  Functional connectivity in response to which stimuli?
  • Dynamic Causal Modelling?
    • Lots and lots of options here.
  • MVPA?
    • Lots and lots of options here.

But remember we also got structural MRIs, and while they may have been intended to help analyze the functional data, you could also examine structural differences between groups. What method?

    • Manual measurement of volume of certain regions.
      • Which region(s)?
    • VBM.
    • Cortical morphometry.
      • What measure? Thickness? Curvature…?

That’s just the imaging data. You’ve almost certainly got some other data on these people as well, if only age and gender but maybe depression questionnaire scores, genetics, cognitive test performance…

  • You could try and correlate every variable with every imaging measure discussed above. Plus:
    • Do you only look for correlations in areas where there’s a significant group difference (which would increase your chances of finding a correlation in those areas, as there’d be fewer multiple comparisons)?
  • You could define subgroups based on these variables.
So even a very straightforward experiment could give rise to hundreds or thousands of possible analyses. 1 in 20 of these would give a statistically significant result at p=0.05 by chance alone, and even if you throw out half those for being “in the wrong direction” (and that’s subjective in most cases) you’ve got plenty of false positives.

This problem is growing. As computation power continues to expand, running multiple analyses is cheaper and faster than ever, and new methods continue to be invented (DCM and MVPA were very rarely used even 5 years ago.)

I want to emphasize that I am not saying that all fMRI studies of this kind are in fact junk. My worry is that it’s hard to be confident that any given published study is sound, given that papers are written only after all the data has been collected and analyzed.

I’ll also point out that some imaging research, especially what might be called “pure” neuroscience investigating brain function per se rather than “clinical” studies looking at differences between groups, has many fewer variables to play with, but still quite a lot.

As to how to solve this problem, the one solution I believe would work in practice is to require pre-approval of study protocols.

    ResearchBlogging.orgSimmons JP, Nelson LD, and Simonsohn U (2011). False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological science, 22 (11), 1359-66 PMID: 22006061

    • Anonymous

      Very interesting. Another option would be to require an exact replication (same N, same techniques/analyses, same criteria for outliers, etc.) to be reported for every experiment (in an Experiment 1A/1B format).

    • http://petrossa.wordpress.com/ petrossa

      Thanks. I feel much better now. Not like shouting in the desert.

      fMRI is finding a signal in a high noise situation where you don't know what the signal is you are looking for.

      Psychology is just whatever the researcher finds. There is no objective way of determining if what he found is something relevant.

      Beauty is in the eye of the beholder, or in the case of psychology behavior is a pattern established by the observer.

    • Anonymous

      Some more options:

      Motion correction: Which type?

      Regression of motion: Which model?

      Do you use prescan normalized data to remove to some extent the receive field contrast present in multichannel receiver arrays?

    • http://www.blogger.com/profile/08099485960661603080 Matt Craddock

      Here are a couple of EEG ones:

      Reference: average, linked mastoids, nose?

      Filtering: do you use a high pass filter and/or low pass filter? What frequency? What type of filter?

      With ERPS, do you measure peak amplitude, peak latency, mean amplitude?

      What time window do you want to look at? Which electrodes?

      Do you detrend the data?

      Do you want to try source analysis? sLORETA, beamforming, VARETA, LAURE?

      How about topographical mapping of microstates using Cartool?

      If you're doing time-frequency analysis, do you used fixed time windows or do you make them vary with frequency? How many cycles per frequency should you make the window? What frequency resolution do you want? What frequencies do you want to look at? Do you want to use single tapers? Hanning tapers? How about multitapers? How much frequency smoothing/how many tapers? Do you want to use Morlet wavelets? Do you only want to look at activity time and phase locked to stimulus onset, or *all* activity? How about phase locking or connectivity measures? What connectivity measure?

      Actually, I'm going back under my desk before I start looking at my data again…!

    • Julia

      I don't know much about neuroscience or fMRIs, but im really interested in this topic. This type of issue is what makes philosophy important to science. Instead of looking at a bunch of data and seeing what we can find, wouldn't it be better to choose a significant and narrow question and keep looking till we find the answer? If the first dozen experiments fail, it means we weren't looking in the right place, not that we weren't looking at the right question. The questions we ask science to explain are just as important as the science itself. In order to make a small step towards understanding the neuroscience of depression, we need to put just as much effort towards understanding the metaphysics of depression or whatever emotion we are studying. Otherwise, neuroscience just has a blank check, it can say whatever it wants because there's no question operationalizing “depression” to reign it in.

    • Anonymous

      Oh and how could I (previous anonymous) forget temporal interpolation. Do you do it: yes or no? Which type? Do you do it before or after motion correction? Neither order is legitimate because the two operation are not separable. But one or the other is done ubiquitously in fMRI.

    • Anonymous

      I have “made the mistake” of trying to replicate two very exciting findings from my work lately using holdout samples and model testing. It didn't work out so well…. I should have just gone ahead and published the original results!


    • http://www.blogger.com/profile/06647064768789308157 Neuroskeptic

      Matt Craddock: Thanks. EEG analysis has many potential degrees of freedom as well, and again this is a problem that's rapidly growing as new methods become available.

    • http://www.blogger.com/profile/06647064768789308157 Neuroskeptic

      Anonymous: “Very interesting. Another option would be to require an exact replication (same N, same techniques/analyses, same criteria for outliers, etc.) to be reported for every experiment (in an Experiment 1A/1B format).”

      That would certainly help yes, although it would mean that all research cost twice as much (approximately)… the good thing about preapproval is that it would be very easy. In most cases the original protocol is written up as part of the grant application or ethics paperwork, it'd be a small step to make them public.

    • http://www.blogger.com/profile/04277620767760987432 Jo Etzel

      Excellent summary. I listed a few of the MVPA choices on my blog, http://mvpa.blogspot.com/2012/07/many-many-options.html . There are a truly staggering number of options for setting up a MVPA. And it's not helped that it's often explicitly encouraged to explore multiple analysis strategies then use the one that produced the 'best' results. This can be sensible in some cases, but is very, very dangerous with fMRI datasets.

    • http://neuroamer.wordpress.com/ neuroamer

      I hadn't thought too much about the many types of processing. I guess it's not standard practice to do Bonferroni corrections for the number of processing settings tried as well?

      Since these studies are expensive, it seems like you could solve a lot of this by requiring people to formalize their plans in the grant, and correct for multiple comparisons, when they deviate from their plans.

    • Anonymous


      I don't see what significant aspect of FMRI chicanery that would solve. The problem is that the field just applies “corrections” to the data without (1) Understanding whether they are mathematically and physically legitimate (time skew correction, motion correction) and (2) they propagate errors in through the corrections.

      Who cares about statistical significance of some effect where the error associated with the effect is not estimated? Sadly, almost everyone doing fMRI. Variance and measurement error are NOT SYNONYMS!

    • Anonymous

      Should have read:

      The problem is that the field applies “corrections” to data: (1) Without understanding whether they are mathematically and physically legitimate (eg. separate time skew and motion correction) and (2) Without propagating measurement errors through the corrections.

    • Anonymous

      A challenge?

      Would somebody please cite a human fMRI study which is believed not to be significantly contaminated by systematic error.

    • http://www.blogger.com/profile/08010555869208208621 The Neurocritic

      Joshua Carp gave a talk on this issue several months ago in a slide session at the Cognitive Neuroscience meeting. I was struck by his very first sentence: How vulnerable is the field of cognitive neuroscience to bias?

      He identified 4,608 unique analysis pipelines, and 34,560 significance maps using a sample paper. No voxels were significant under all pipelines, and 87.7% of all voxels were significant under at least one pipeline.

      What are some solutions?
      – choose your pipeline a priori
      – register intentions in advance
      – report ALL pipelines used

      After all that neuroimaging nihilism, he ended on an optimistic note and mentioned http://openfmri.org/ and http://openscienceframework.org/

    • http://www.blogger.com/profile/06647064768789308157 Neuroskeptic

      Neurocritic: Thanks. I missed that but it looks like he noticed this issue first. However he only considered the fMRI analysis options. It's even worse really because of the possibility of doing structural MRI, brain-behaviour or gene-brain correlations…

    • Chris

      These are all very important issues, but in the context of this discussion I think it is worth differentiating between hypothesis driven and exploratory research.

      The hypothetical study described and the potential pipelines listed imply a search for any differences between the two groups in any measure i.e. a completely open exploratory study. With an absence of any constraints, the analysis options are indeed endless and there is a real concern if the resulting paper describes only the one analysis that worked out of the hundreds that were attempted.

      However, the situation is more constrained for hypothesis driven research and the analysis options are significantly reduced. That’s not to say that there isn’t still a issue with the range of analysis options available, but I think the concerns are reduced in the context of a priori predictions.

      The critical point is that both hypothesis-driven and exploratory studies are valuable, but the distinction is important and must be kept in mind. To some extent exploratory studies help define the hypotheses that will then be tested in subsequent experiments.

    • Nitpicker

      @Chris: Thanks for pointing this out! There is a recent trend to diss neuroimaging by making it all sound like a fishing expedition. The distinction between hypothesis-driven and exploratory work should be emphasized a lot more.

      This is not to say that it isn't good to raise these issues and to discuss them. It certainly would be a good thing to have more widely accepted conventions for how to conduct these analyses, when to stray from conventions, and more generally to report all the details of experimental protocols.

      But this obsession with false positives and spurious findings is getting ridiculous in my mind. The dead salmon was very funny but then again that poster did not actually show anything people didn't already know, it didn't show anything that was widespread practice in the neuroimaging community, and it had little bearing on hypothesis-driven research.

      Or at least I would be very surprised if they had had any hypothesis that the postmortem Atlantic salmon should show any neural processing.

    • http://www.blogger.com/profile/06647064768789308157 Neuroskeptic

      Chris: Absolutely. All I'd add is that exploratory research needs to be clearly labelled as such, so that readers can adjust their confidence in the findings accordingly.

      nitpicker: I don't see it as dissing neuroimaging to point out that imaging (like any other field of science involving very large datasets) can be a fishing expedition, it would be dissing (and just wrong) to say that it always is, but even the most optimistic imager would have to admit that there is a potential for dodgy stats here.

    • Nitpicker

      @Neurocritic: Hmmm, while I'm here might as well talk about this one too. I also heard about this study from a colleague some time ago. Like the dead salmon, I don't think this is particularly surprising:

      “He identified 4,608 unique analysis pipelines, and 34,560 significance maps using a sample paper. No voxels were significant under all pipelines, and 87.7% of all voxels were significant under at least one pipeline.”

      I think it would be more surprising if there had been any voxels that were significant under all pipelines. I don't know the details of his analyses but it's very unlikely that when applying such an enormous parameter space that all permutations would produce consistent results. It is also hard to discern if there were some corners in that parameter space that were totally unrealistic. For example, one dimension of this space is likely to be related to thresholding. All you need to do is apply some extremely conservative thresholds and – voila – you end up with a map without significant voxels. Add to that a few parameters that are completely nonsensical and you will get a majority of voxels being significant in some pipelines.

      As with the salmon, I don't deny that this is interesting and that it's worth thinking about this. But I would challenge the notion that most neuroimagers just randomly apply countless different parameter permutations in their analysis. In fact, I'd wager that most people will hardly even stray from the default parameters of their analysis package of choice.

    • Nitpicker

      @Neuroskeptic: Don't get me wrong, I agree with you here. There is nothing wrong with raising awareness of these issues and improving the way experimental protocols are reported.

      What I perceive as dissing is not really the original authors of these studies (e.g. the dead salmon or the voodoo guys) but the dangerous consequences such alarmist papers have. All of these reports have already resulted in mainstream media reports that neuroimaging is flawed and full of spurious findings.

      I disagree with that notion and we are shooting outselves in the foot if we allow this to go on unchallenged. I don't doubt that there are some crappy papers out there and I certainly welcome efforts to streamline protocols (for example, your idea of pre-registering protocols is not bad although I remain a bit unconvinced it's feasible).

      As I said above, the distinction between hypothesis-driven and exploratory work should be emphasized more. You hit the nail on the head when you say that exploratory research should be clearly labeled as such. It should also be supported more. At the moment it is relatively impossible to actually secure grant funding for exploratory research without dressing it up in the guise of a hypothesis. This is not an ideal situation. Both kinds of research are important. Exploratory studies should be supported but to minimize spurious conclusions it must also be conducted with great care, strict statistics, and ideally paired with replication attempts to test the hypotheses it generates.

    • Anonymous

      Every instrument – which includes the physical device and associated algorithms – has an error associated with it. The problem with fMRI, at its core, is that errors in measurement are not estimated or reported. As far as I can tell nobody in “the field” even thinks about it. Lots of room for snake oil there.

      Establishing conventions for analysis will not change the fact that error exists in the instrument.

    • Nitpicker

      @error-Anonymous: This assessment of the situation is a bit bleak. People frequently report measurement errors. It is true though that papers showing only t maps do not. You can partly blame the voodoo correlations controversy for the fact that people are even less inclined to show the data now.

      There are certainly ways how this can be improved though and some conventions for estimating the error in the system should really be a part of it.

    • Anonymous


      Can you point to a paper that you think does a reasonable job estimating error?

    • Nitpicker

      Depends on what you mean. I'm apparently missing your point but anyone showing raw averaged time courses or activations is “estimating” error. In fact, anyone conducting a GLM is estimating error inherently (but admittedly nobody reports it).

      If what you're talking about is (I assume from comments above) the fact that nobody confirms that applying steps like motion correction or bandpass filtering don't in fact introduce errors, that's a fair point, although I am not sure I agree that this is thoroughly theoretically sound. At least I would find it surprising if skipping motion correction in a group analysis would alter results in any other direction than largely inflating variance and thus masking significant effects.

      If you're talking about the fact that the reliability of findings isn't confirmed enough, there is some truth to that. Any good MRI center will have a regular QA procedure in place to ensure that data quality is consistent, but that doesn't usually extend to human data. There are numerous highly robust fMRI findings out there though that would be suitable to fill such a role. Indirectly, any repetitions of such experiments are already done so it may simply be a matter of summarizing those data.

    • Anonymous


      You wrote:

      “If what you're talking about is (I assume from comments above) the fact that nobody confirms that applying steps like motion correction or bandpass filtering don't in fact introduce errors, that's a fair point, although I am not sure I agree that this is thoroughly theoretically sound.”

      I big HUH!? How could asking for error estimates for two commonly applied “corrections” not be “theoretically sound”?

      And then you write:

      “At least I would find it surprising if skipping motion correction in a group analysis would alter results in any other direction than largely inflating variance and thus masking significant effects.”

      Well then you will be surprised. A lot of people are realizing that motion correction introduces (by various means which I will not present ellucidate) very significant systematic error into the measurement of the BOLD effect in human subjects. When systematic error conspires with the positive effects filter snake-oil is the result.

    • Nitpicker

      Care to elaborate on the latter? Who are a lot of people and what kind of systematic errors are introduced? I'm not challenging your point. I'm genuinely interested.

      As for the technically sound comment I may answer that when I'm not using the mobile version of this site.

    • Anonymous

      I think that the wishy-washy nature of fMRI is precisely why fMRI has been so well received by the psychiatric community. If you look at recent or upcoming psychiatric conference agendas you will see that they are often filled with fMRI studies, and fMRI is often touted by psychiatrists as the long awaited advance which finally makes a 'science' out of their endeavour. With so much ostensibly 'objective' data combined with so much leeway as to how to interpret said data, you end up with a psychobabbler's dream- supposedly 'objective' findings which are just as subjective as the armchair theorizing behind them.

    • http://www.blogger.com/profile/06832177812057826894 pj

      “fMRI is often touted by psychiatrists as the long awaited advance which finally makes a 'science' out of their endeavour.”

      Au contraire, most practising psychiatrists recognise that fMRI has very little of relevance to tell them at present.

    • DS


      I think Anon may have been referring to research psychologists rather than clinical psychologists. How much fMRI based research is presented at one of the more popular conferences which targets clinical psychologists?

    • http://www.blogger.com/profile/00856481031749235750 Tom Johnstone

      With respect to the countless options available for preprocessing, filtering, modeling and thresholding fMRI data (or any other complex data set involving signals), I don't think it's necessary, nor desirable to specify everything in advance. To do so ignores the extent to which researchers need to adapt their processing to deal with data quality issues on the fly. There is simply no fully automatic way to choose the optimal parameters for preprocessing data; expertise comes into play.

      This is only a problem if researchers check the results of their hypothesised tests/contrasts/comparisons after each new analysis run, because of course that is “cheating” – it will introduce bias. However, if you chose a data quality check that is blind to your hypotheses, then there is nothing wrong with optimising your analysis with respect to that quality check.

      Concretely, if my hypothesis has to do with differences between responses to fearful and neutral facial expressions, it would be wrong for me to tweak my analysis until I maximised the difference between fearful and neutral responses. But if I tweak my analysis to maximise responses to all facial expressions, keeping myself blind to which responses are to fearful and which to neutral faces, then I have done nothing wrong.

    • Anonymous

      Tom wrote:

      “There is simply no fully automatic way to choose the optimal parameters for preprocessing data; expertise comes into play.”

      Which is why instrument error should be propagated through all fMRI analysis. So that we have an estimate of the range of possible results.

      Motion correction and temporal filtering are commonly used in fMRI. Do you propagate the errors associated with the device (the scanner and choosen imaging parameters plus the motion correction algorithm and temporal filtering and temporal interpolation algorithm)? If so then what were your error estimates associated with the violation of the rigid body assumption used in motion correction? What were your error estimates associated with spatial interpolation used in motion correction? With respect to temporal interpolation (assuming you used it) what were your error estimates associated with the fact that temporal interpolation can not be separated from motion correction (which is what almost all algorithms in use do)? Etcetera etcetera …

    • Anonymous

      FYI, research psychologists, clinical psychologists, and psychiatrist are for the most part non-overlapping groups that don't talk to each other as much as you might assume. If you're going to start chucking rotten tomatoes at those “wishy washy” folks at least try to get things straight.



    No brain. No gain.

    About Neuroskeptic

    Neuroskeptic is a British neuroscientist who takes a skeptical look at his own field, and beyond. His blog offers a look at the latest developments in neuroscience, psychiatry and psychology through a critical lens.


    See More

    @Neuro_Skeptic on Twitter


    Discover's Newsletter

    Sign up to get the latest science news delivered weekly right to your inbox!

    Collapse bottom bar