Inflated False Positives in fMRI: SPM, FSL and AFNI

By Neuroskeptic | May 7, 2015 5:52 am

Back in 2012 I discussed an alarming paper showing very high rates of false positives in single-subject fMRI analyses. Swedish researchers Anders Eklund and colleagues had tested the performance of one popular software tool for the statistical analysis of fMRI data, SPM8.

But what about other analysis packages?

Now, Eklund et al. are back with a new study, which has not been published yet, but was presented last month at the International Symposium on Biomedical Imaging (ISBI). This time around they compared three popular packages, SPM8, FSL 5.0.7, and AFNI – and they show that all three produce too many false positives. Edit: the conference paper is available here.

Broadly speaking, FSL had the highest false positive rate, approaching 100% in some cases. AFNI was slightly better than SPM, but even AFNI gave 10 – 40% false positive rates, depending upon the parameters. The desired rate is 5%. As in the 2012 paper, the problem was most serious for block designs.

Here’s the results. These graphs show the proportion of single-subject analyses showing at least one significant cluster, at a nominal familywise error corrected (FWE) alpha of 0.05. Higher is worse:

eklund_fsl_spm_afni

Eklund et al.’s data were resting state fMRI scans from two centers, Cambridge and Beijing. They analyzed these data as if they were part of a task design, in which stimuli were presented at certain times. Since there was in fact no task, and no stimuli, no stimulus-triggered activation ‘blobs’ should have been seen.

The authors remark that

It is clear that all three software packages give unreliable results… It is disappointing that the standard software packages still give unreliable statistical results, more than 20 years after the initial fMRI experiments.

We should note that Eklund et al. only considered single-subject analyses. It’s not clear if group level analyses are also affected.

Why is the false positive rate so high? The authors say that the problem lies in the assumptions made by each package about the statistical properties of the noise found in fMRI data. Each package has its own problem: SPM has a “too simple noise model” while FSL “underestimates the spatial smoothness”, for instance.

Eklund et al. conclude that given the problems with parametric statistics approaches to fMRI – as represented by SPM, FSL and AFNI – it may be time for neuroscientists to embrace nonparametric analysis, which makes fewer assumptions.

Anders Eklund kindly agreed to answer some of my questions about these results and what they could mean. Here’s what he said:

Q: Could there be a way to optimize parametric approaches and make them more valid? Or should we all just move to non-parametric methods?

A: The SPM group is currently working on an improved noise model for SPM12, it would be interesting to test if it gives lower familywise error rates compared to SPM8.  Even if parametric approaches were optimized, it would still be hard to use parametric approaches for multivariate statistical methods, which have more complicated null distributions.

A non-parametric approach, like a permutation test, can thereby solve two problems in fMRI. First, to give familywise error rates that are closer to expected values. Second, to enable multivariate statistical methods with complicated null distributions, which may give a higher statistical power.

Q: All three packages gave more than 5% false positives, but from it seems that FSL had an even higher error rate than the other two, especially for block designs. Do you think that this would hold for other datasets, or might it be specific to this study?

A: Hard to say, we are not sure what the main problem with FSL is, except that the FSL software gives a lower smoothness estimate compared to SPM and AFNI.

According to some researchers, resting state data is not optimal for testing false positives (since it has different characteristics compared to task data). An alternative approach could be to analyze task data, using a regressor that is orthogonal to the true paradigm.

One could for example analyze task data from the HCP, which were collected with a short TR (0.72 s) and a multiband sequence. According to our 2012 Neuroimage paper, a short TR is very problematic for the SPM software, due to its simple noise model. It would be interesting to see if such a short TR is also problematic for FSL and AFNI.

Q: How does this new analysis differ from your Eklund et al. 2012 Neuroimage paper?

A: We only looked at 396 instead of 1484 rest datasets. We only considered cluster level inference, and not voxel level inference. In the previous paper we looked at both. We tried two cluster defining thresholds; the threshold that is applied to all voxels to form clusters. We tried p = 0.01 (z-score of 2.3, the default in FSL) and p = 0.001 (z-score of 3.1, the default in SPM).

We noticed that the cluster defining threshold has a very large impact on the familywise error rates; a lower threshold (z = 2.3) gives higher familywise error rates compared to a high threshold (z = 3.1). This is consistent with a recent paper (Woo et al. 2014). A possible reason for this is that a lower threshold is more sensitive to the assumption that the spatial smoothness is constant in the brain.

ADVERTISEMENT
  • practiCalfMRI
    • http://blogs.discovermagazine.com/neuroskeptic/ Neuroskeptic

      That’s another very interesting paper that I’m going to blog soon! There’s a whole surge of them this week!

    • Cyril

      that reminds me of that tweet I sent you about floating point during preprocessing because scanner don’t use all the bits available ..

  • Pingback: 1p – Inflated False Positives in FMRI: SPM, FSL and AFNI | Profit Goals()

  • Pingback: 1p – Inflated False Positives in FMRI: SPM, FSL and AFNI | Exploding Ads()

  • Cyril

    quick one to say SPM12 noise model is better with AR(1) (pooled estimates) + local white noise, and it has a dedicated filter for multiband – so of course if the authors tried AR(1) for multiband that would not work ..

    beside this, clearly something is a the structure of the noise ; a few years back @ChrisFiloG tested for his thesis many procedures, but on using clean simulated data, and the the FWER was well controlled.

  • DS

    Neuroskeptic wrote: “The authors say that the problem lies in the assumptions made by each package about the statistical properties of the noise found in fMRI data.”

    (1) What are the statistical assumptions about the noise?

    (2) What are the noise sources that are being modeled?

    • Anders Eklund

      1)

      SPM uses a global AR(1) model, a single AR parameter for the entire brain (set to 0.2).

      FSL estimates an AR model in each voxel, and then uses a Tukey-Taper to smooth the model in the frequency domain.

      AFNI uses a voxel-wise ARMA(1,1) model.

      2)

      Yes, the errors of the general linear model (GLM). Y = XB + e, the errors e are assumed to be independent, but is in fMRI modelled using different approaches.

      • DS

        Thanks Anders

        AR being an AR(0) model?

        • Anders Eklund

          AR(0) means white noise

          FSL estimates an AR model with a number of parameters that depends on the data, I don’t know how many AR parameters they use.

          I’ve put the full paper here

          https://dl.dropboxusercontent.com/u/4494604/article.pdf

          If you look at figure 3, you see that SPM has the worst noise model (the power spectra of the residuals will be flat if the residuals are uncorrelated). FSL has an intermediate noise model and AFNI seems to have the best noise model.

          In figure 1 you can see that SPM gives a higher estimate for the smoothness of the data, which “saves” SPM compared to FSL, that gives a much lower smoothness estimate.

  • Martin Hebart

    This is definitely a very interesting finding. But I’m not sure whether this overemphasizes the problem a little and points in the wrong direction: How many people actually use the residual variance at the subject-level in their analyses? The large majority of studies report results at the group-level. However, with the same argument put forward by Anders, classical group-level or “second”-level random-effects analyses should – if any – be *less* sensitive when noise is misspecified. Misspecification at the subject-level may increase the variance of parameter estimates (“betas”), but does not bias them away from 0. More variability in parameter estimates reduces statistical power at the group-level. Since the majority of studies discards the subject-level variability, but uses non-optimal unbiased parameter estimates, I would even say that we commonly have more false *negatives* than expected (provided that there is not another source of bias). The story might be different for mixed-effects analyses that utilize subject-specific noise variance.

  • Scirus E

    Doing my PhD in same lab with Dr Eklund I have to say his work was always rock solid. Cudos to him for finally benchmarking the most popular fMRI softwares in this meaningful way! Next step is to fix these problems or actually start to use a nonparametric fMRI software package.

    • adam dvorak

      I had a handful of coworkers attempt to murder me. I witnessed 3 of them trying to lure me into one of three traps, the other two traps where just surrounded by highly suspicious circumstances. I would love to sign up as a test subject to prove to the police I’m not fabricating any of my account of events. Even better would be to get those three individuals into a fmri scanner.

  • Euler

    Off topic, but I’m curious about your take on the results of the recent attempt to check reproducibility in psychology:

    http://www.nature.com/news/first-results-from-psychology-s-largest-reproducibility-test-1.17433

    It doesn’t seem good at first, but there is a lot to consider that could make it better or worse than it seems, but it seems like the sort of thing you would be interested in.

  • http://blogs.discovermagazine.com/neuroskeptic/ Neuroskeptic

    That’s extremely interesting – please do report back! (and don’t worry if your comments get caught in the spam filter, I will approve them asap.)

  • Bob Cox

    After running a number of variations on the Beijing datasets, I’ve got a couple of tentative things to report (and puzzle over). First, there are a few datasets — one shown in my previous comment — that have huge “active” clusters in white matter. There are 4-5 such datasets (2% of so of the 198). These can be “tamped” down by using the WM average time signal as a regressor of no interest. This change, however, doesn’t strongly affect other datasets’ results — neither did using the first 2-3 principal components of the WM and CSF masks. I also used the “blur to a fixed FWHM” method in AFNI to try to blur each dataset to a uniform spatial smoothness. This also did not make much difference. I restricted the analysis to GM (where the threshold cluster sizes are of course smaller than for the whole brain mask) — this also didn’t change the percentages much — 15-20% false positives instead of 5%.

    After doing some simulations and looking closely at the smoothness estimates as a function of time, I’ve come to the tentative conclusion that the problem lies with longer-than-expected tails in the statistic distribution (as evinced in Eklund’s Fig 4). And I think this arises from some non-stationarity (thru time) in the (a) variance [heteroskedasticity], (b) temporal correlation, and/or (c) spatial correlation [smoothness]. These are too hard to model directly — so I’m now leaning towards a resampling approach to get the testing done for single subjects.

    Alas, I’m too busy preparing for my internal review to push this forward now (I’d like to keep my job). I have some ideas, but they’ll have to wait.

    Finally, I’d like to guess that the inter-subject averaging inherent in group analysis will wash all the sins of single-subject statistical apostasies away. And that I’m still planning on working on this issue if/when I get past the review, since IMHO single-subject analysis for biomarkers is something that may be important some day — and getting a handle on the statistics is a part of that.

  • Pingback: False Positive fMRI Revisited - Neuroskeptic()

  • Pingback: fMRI and False Positives: A Basic Flaw? - Neuroskeptic()

  • Pingback: fMRI and False Positives: A Basic Flaw? – NewsKon 新聞控()

  • Pingback: False-Positive fMRI Hits The Mainstream - Neuroskeptic()

  • Pingback: False-Positive fMRI Hits The Mainstream – BlogON.cf()

  • Pingback: Is There An fMRI Crisis? | Freedom Fire()

NEW ON DISCOVER
OPEN
CITIZEN SCIENCE
ADVERTISEMENT

Neuroskeptic

No brain. No gain.

About Neuroskeptic

Neuroskeptic is a British neuroscientist who takes a skeptical look at his own field, and beyond. His blog offers a look at the latest developments in neuroscience, psychiatry and psychology through a critical lens.

ADVERTISEMENT

See More

@Neuro_Skeptic on Twitter

ADVERTISEMENT

Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!

Collapse bottom bar
+