False Positive fMRI Revisited

By Neuroskeptic | December 7, 2015 5:58 pm

A new paper reports that one of the most popular approaches to analyzing fMRI data is flawed. The article, available as a preprint on arXiv, is from Swedish neuroscientists Anders Eklund et al.


Neuroskeptic readers may recall that I’ve blogged about Eklund et al.’s work before, first in 2012 and again earlier this year. In the previous two studies, Eklund et al. showed that the standard parametric statistical method for detecting brain activations in fMRI data is prone to false positives. The new arXiv paper has the same message, but it goes beyond the earlier studies. Whereas Eklund et al. previously showed problems in single-subject fMRI analysis, they now reveal that the same issues affect group based analyses of task-based fMRI.

This is really scary, because almost all fMRI research uses group based analysis. Some people previously downplayed the importance of Eklund et al.’s work, saying that the false positive problem would only affect single subject analyses, which are rare. They hoped that the errors would “cancel out” at the group level. While that seemed plausible, it’s turned out to be false.

Oh dear.

The new scary finding is that “parametric software’s familywise error (FWE) rates for cluster-wise inference far exceed their nominal 5% level” – in other words, the chance of getting at least one false positive result is high, much higher than the 5% level which is expected and considered acceptable.

Eklund et al.’s approach was to take resting-state fMRI data and analyze it as if it were part of a task-based experiment. Because there was no task, there should have been no activations detected. They considered hundreds of variant analyses, testing the three of the most popular parametric fMRI software packages (FSL, SPM, and AFNI) and numerous different parameters for each one (such as initial cluster defining threshold, cluster extent, etc.).

The vast majority of the tested parametric analyses produced too many false positive clusters. A major exception was the “FLAME1” algorithm from the FSL package, which if anything produced too few false positives i.e. it is too conservative. The cluster defining threshold was also important; the false positive problem was especially bad with a threshold of p = 0.01. A p = 0.001 threshold was much better, although still not perfect for most parametric tools, and combined with FLAME1 it was extremely conservative.


So far I’ve been talking about cluster-based analyses. In voxel-based group analyses, Eklund et al. found that false positive rates were much lower. Indeed, most approaches were too conservative. However, voxel-based analyses are rarely used on their own. Neuroscientists are more interested in finding clusters (“blobs”). So false positives are likely to be the major source of error in most research.

The false positive clusters were not evenly spread throughout the brain. Some areas were hot-spots, with the posterior cingulate cortex being the region most prone to false positives, for all software tools. Eklund et al. say that this is probably because the fMRI images have higher local smoothness in this region, which violates the assumptions of fMRI analysis models:

falseclustersThe authors conclude that much of the fMRI literature may be seriously compromised, which “calls into question the validity of countless published fMRI studies based on parametric cluster-wise inference.” To really rub salt into the wound, they point out that all of their analyses were corrected for multiple comparisons, yet “40% of a sample of 241 recent fMRI papers did not report correcting for multiple comparisons” which would make the problem even worse.

Having said that, in many fMRI experiments the key question is not “is there any brain activity at all?” but “where exactly is the activity, and what modulates it?” In other words, most fMRI studies are not intended to test the null hypothesis that the brain is completely unresponsive (even if, statistically, this is part of the analysis process.) To put it another way, in many experiments the existence of a spurious blob in the posterior cingulate would not be considered important, because the focus was elsewhere. Other studies compare the magnitude of an activation across two groups, with the existence of the activation being already known. Not all fMRI studies are “blob fishing expeditions”.

But many are, so this is clearly a major problem. What can we do about it? Eklund et al. say that the answer is non-parametric permutation analyses. They tested these, and show that these are the only analysis methods that give the correct level of false positives (5%):

A non-parametric permutation test, for example, is based on a small number of assumptions, and has here been proven to yield more accurate results than parametric methods. The main drawback of a permutation test is the increase in computational complexity… but the increase in processing time is no longer a problem; an ordinary desktop computer can run a permutation test for neuroimaging data in less than a minute.

In other words, there is no excuse for not using them.

ResearchBlogging.orgAnders Eklund, Thomas Nichols, & Hans Knutsson (2015). Can parametric statistical methods be trusted for fMRI based group studies? arXiv arXiv: 1511.01863v1

  • Xin Di

    The basic assumption of the analysis is that if you use a block-designed task function to analyze resting-state data, there will be no activated clusters. Unfortunately, this assumption is not true. There are BOLD fluctuations going on during resting-state, and the frequency is mainly between 0.01 to 0.08 Hz. If you correlate resting-state BOLD time series with a task design as used in the paper, say 30 s on and 30 s off, the task design function will be occasionally phase-locked to the low-frequency fluctuations. This will result in a correlation between task design and BOLD time series, i.e. you will detect statistically significant clusters in the brain. In resting-state, the set of regions that showed high level of fluctuations coincide the default mode network. This is why the “… the posterior cingulate cortex being the region most prone to false positives”. By carefully designing many box-car functions, i.e. using a Fourier set, we can recover the whole default mode network (Di & Biswal, 2014, doi:10.1016/j.neuroimage.2013.07.071). The fact that the PCC is “most prone to false positives” is not surprised at all, and it represents real resting-state fluctuations.

    • Anders Eklund

      We tested both block based and event related paradigms. The figure showing the spatial distribution of false clusters was generated using a randomized event related design, not a fixed block based design.

      • Xin Di

        The same logic applies. The mean ITI is 8 s, with a frequency of 0.125 Hz. The frequency is a little higher than the conventional low-frequency fluctuation range (0.01 – 0.1 Hz), but very close. Again, the high rate of false positive in the PCC supports my explanation. My point is, the base rate of detecting “activation” using your method should not be 0. And the base rate varies in different brain regions.

  • CLS

    Xin Di In other words the “noise” in task fMRI has strong spatial and temporal structure. A consequence is that occasionally, you will get strong, random, association with the stimulus. That does not make them “real”. This issue has long been acknowledged and SPM/FSL/AFNI are supposed to account for that. Only most of these procedures are doing it inappropriately, it seems.

    @wandebob @neuroskeptic One of the conclusion of the paper is that non-parametric (permutation) tests should be preferred to parametric tests. The permutation tests seem to only play a critical role to get cluster-level statistics (in the sense of finding the significance of a “blob”). The problem thus does not seem to be the parametric nature of the test, but the model of spatial dependencies (“random field”). Instead of looking for blobs after generating tests at every voxel, there are plenty of clustering methods available to generate good group functional parcels based on full time series. The resulting brain parcels are more likely to capture the spatial structure of task fluctuations than blobs defined through 3D spatial connexity in a thresholded t-stats map. A simple group parametric GLM applied at the parcel (instead of voxel) level, combined with a classical FDR procedure, should offer a tight control of false positives. Parcel-level statistics would avoid to rely on computationally expensive permutation statistics, while offering better functional units for reporting than blobs.

  • Pingback: Markierungen 12/09/2015 - Snippets()

  • Bob Cox

    Attached is a snapshot of our results thus far for the resting-state analyses “as if they were task” à la Eklund et al. We analyzed the 198 Beijing datasets with AFNI, then did the 1000 random 2-sample t-tests with 1-sided thresholding (at p=0.01 and 0.001, per voxel), and then found the false positive rate for 2 different types of 3dClustSim runs: the “classic” way (Gaussian ACF estimated via 1st differences, as in the Forman 1995 paper), and the “new” way (using the non-Gaussian ACF estimated via multiple differences). You can see that for p=0.001, the “new” ACF method (on top) is pretty close to the nominal band for p=0.001 and for blurring > 4mm. For p=0.01, there is still room for improvement. On the bottom are the corresponding graphs for the “classic” method. The change from “classic” to “new” for p=0.01 is pretty large. Also notable is the major break between the block designs and the event-related times, visible in all 4 graphs. This break may be due to the effect that Xin Di refers to, or to other effects in the “noise”. There is a difficult-to-answer question: “Is resting state data statistically representative of the noise component in task data?”

    • http://blogs.discovermagazine.com/neuroskeptic/ Neuroskeptic

      Thanks very much for these extremely constructive comments!

    • http://www.nil.wustl.edu/labs/kevin Kevin Black

      The SPM results at initial threshold of. 001 are all at our just below 10%. Did the authors use a 2-tailed test? If so that would mean SPM is doing it right–by default SPM reports t tests as one-tailed. Yes, the community often “cheats” by not arguing with the one-tailed tests, but at least SPM tells you exactly what it’s doing.

  • Pingback: Morsels For The Mind – 11/12/2015 › Six Incredible Things Before Breakfast()

  • Pingback: Monday Open Question: what are the current controversies in neuroscience? | neuroecology()

  • Anders Eklund

    A nice overview by Jeanette Mumford https://www.youtube.com/watch?v=liVpwHIpjrU

    • http://blogs.discovermagazine.com/neuroskeptic/ Neuroskeptic

      Thanks! Great summary

  • http://www.nil.wustl.edu/labs/kevin Kevin Black

    Did the authors use a 2-tailed test?
    If so that would mean SPM is doing it right. The SPM results at initial threshold of .001 are all at our just below 10% in the graph. By default SPM reports t tests as one-tailed. Yes, the community often “cheats” by not arguing with the one-tailed tests, but at least SPM tells you exactly what it’s doing.

    • Bob Cox

      I can’t speak for the authors and for SPM in particular, but for AFNI they used a 1-sided test — we are sure of that because we downloaded their scripts and re-ran the analyses (for the Beijing data). We continue to refine our approach, and I have submitted an abstract hopefully to be presented at HBM in Geneva in June 2016.

      • Anders Eklund

        Looking forward to reading it.

    • Anders Eklund

      We used one-sided tests. A permutation test can only estimate the null distribution for one side at a time, so to use two-sided tests would take twice as long for the non-parametric analyses.

      • http://www.nil.wustl.edu/labs/kevin Kevin Black

        Thank you

  • Pingback: fMRI and False Positives: A Basic Flaw? - Neuroskeptic()

  • Pingback: Sotto il segno di Ioannidis - Ocasapiens - Blog - Repubblica.it()

  • Pingback: False-Positive fMRI Hits The Mainstream - Neuroskeptic()

  • Pingback: Years Of fMRI Research May Be Wrong – Perfect Hour()

  • Pingback: Years Of fMRI Research May Be Wrong | Civil Attorney Team()

  • Pingback: Years Of fMRI Research May Be Wrong | Civil Attorney Group()

  • Pingback: Years Of fMRI Research May Be Wrong – Para-SciFi()

  • Pingback: Years Of fMRI Research May Be Wrong | Divorce Lawyer Team()

  • Pingback: Years Of fMRI Research May Be Wrong | Personal Injury Lawyer Team()

  • Pingback: Academic clickbait, FCC doesn’t use economics, and tobacco surcharges don’t work.()

  • Pingback: Academic clickbait, FCC doesn't use economics, and tobacco surcharges don't work. - Ugly Research()



No brain. No gain.

About Neuroskeptic

Neuroskeptic is a British neuroscientist who takes a skeptical look at his own field, and beyond. His blog offers a look at the latest developments in neuroscience, psychiatry and psychology through a critical lens.


See More

@Neuro_Skeptic on Twitter


Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!

Collapse bottom bar