The Reliability of fMRI Revisited

By Neuroskeptic | January 10, 2014 11:57 am

A new paper brings worrying news for neuroscientists using fMRI to study memory:

Across-subject reliabilities were only poor to fair… for novelty encoding paradigms, the interpretation of fMRI results on a single subject level is hampered by its low reliability. More studies are needed to optimize the retest reliability of fMRI activation for memory tasks.

The researchers, David Brandt and colleagues from Marburg, Germany, scanned 15 healthy volunteers twice each.  In order to measure the neural activation associated with memory formation, the subjects were shown word and picture stimuli that they’d never seen before.


In a group-level analysis, when all the volunteer’s data was averaged together, several brain areas were activated by the memory task (and these were pretty much the areas one would expect.)

However, the degree of activitation in different brain areas was not very stable within individuals. Comparing the two activity patterns from the scans a month apart, the ICC  for each voxel of the brain was poor (the median ICC was at best 0.35, which is low, and that was in the most favorable of the several conditions of the task.)

In other words if an individual shows a huge activation of the hippocampus in one session, it doesn’t mean that will happen whenever they get scanned. Which implies that we shouldn’t ‘read too much into’ a given individual’s degree of activation – as it might be quite different next time around.

You might remember that four years ago, I blogged about a review finding that the test-retest reliability of fMRI was modest: Can We Rely On fMRI? That paper reported that the mean test-retest ICC for fMRI is 0.50 which means that memory activations are well below average.

An interesting discussion followed my tweeting about the new paper.

Simon W Davis pointed out that, four years ago, Miller et al (2009) reported comparable results. The two studies are very similar in some ways, both involving fMRI of volunteers scanned twice, several weeks apart, with memory tasks, although the paradigms were different (novelty encoding in Brandt, vs episodic retrieval, semantic retrieval, and working memory tasks in Miller.)

However, as Kirstie Whitaker observed, there was a difference in interpretation:

Miller et al span it as “brains are different to each other”, Brandt et al as “neuroimaging is noisy”.

To which Davis commented that

This is what 3 years of neurobashing gets you: same study, more conservative interpretation.

ResearchBlogging.orgBrandt DJ, Sommer J, Krach S, Bedenbender J, Kircher T, Paulus FM, & Jansen A (2013). Test-Retest Reliability of fMRI Brain Activity during Memory Encoding. Frontiers in Psychiatry, 4 PMID: 24367338

  • Xin Di

    Resting-state fMRI has higher reliability than task fMRI. :)

    The resting brain: unconstrained yet reliable.

    • DS

      Maybe so but rs-fMRI suffers from systematic artifacts and reliably wrong is not much better than unreliably wrong.

  • Rolf Degen

    I just read the book “Controversies in Cognitive Neuroscience” by Scott Slotnick, which also centers on the weaknesses of fMRI. According to Slotnik, this method has a poor temporal resolution of about two seconds. “With regard to brain function, this is an eternity.” Many cerebral functions happen within milliseconds. Due to these limitations, fMRI can only identify which regions were active during the entire process, and no interactions that develop over time. One can easily imagine how this might cloud the reliability of the measurements.

    • DS

      This is a shortcoming known to all neuroscientists working in this field. But there are many states of the brain that do evolve on a time-scale of a second or so and those are the states studied.

      • Rolf Degen

        Sounds good. But this would take for granted that for every function you study with fMRI, you already know the local neural activation pattern in the millisecond domain. How can you know this, if you do not have single cell recordings?

  • DS

    My guess, as usual, is that this is a motion-related problem. Having said that someone is likely to rebut that the motion can not be that different between test and retest. Maybe. Maybe not. But there are many reasons to suspect that the means by which we correct for motion are unreliable and therefore could deliver different results despite the motion (assumed to be significant) being similar in quality and magnitude for the test and the retest.

    So to test my hypothesis we should go to great lengths to constrain the head, verify that the means are effective by non MRI methods, and then do the motion-constrained test/retest.

    Is there really any other way?

  • vadim

    Worth adding (in)consistency of functional localizers of visual areas. Briefly, e.g. the FFA persists across scans, there is a large variability with regard to which voxels will pass the threshold.

  • Guido Biele

    I think it makes sense to look at the author’s measure of reliability.
    The ICC compares within to between group variability which I believe is problematic for (at least) two reasons here:
    – the within group variability is calculated based on 2 data points. that strikes me as not very reliable.
    – it is easy to get a low ICC in a scenario where all participants have similar t value in one session and all participants have again similar but on average higher t values in the second session (e.g. because they moved less …).

    more generally, it seems not obvious that the ICC the right measure for the task here.
    one simple alternative would be to model the group effects of the two sessions together while including session as dummy variable. then one could check where one finds a systematic effect of session for the whole group (though this analysis obviously would not catch cases where negative and positive deviations cancel each other. for this one could take the difference between sessions, take the absolute value of the differences and run a ttest on it).

    having said that, I have to admit that after having seen lots of fMRI data and the noise in it, I intuitively believe that individual level results of fMRI analyses are not very reliable (except one collects lots of data in a session).

  • Pingback: 2014-01-10 Spike activity « Mind Hacks()

  • Pingback: 2014-01-10 Spike activity | NYC Neurofeedback()

  • Pingback: 2014-01-10 Spike activity | Connecticut Neurofeedback()

  • Jon Brock

    So how does test-retest reliability compare to split-half reliability? That might give a sense of how much of the variability is due to inherent unreliability of fMRI vs genuine variation in the brain response from session to session.

  • Ben Seymour

    Novelty encoding does seem an odd choice of task about which to make general inferences about test-retest reliability. It would seem rather likely that there could be interactions with session order effects, which would confound any interpretation about test-retest reliability.

    • Neuroskeptic

      Good point. The second session is not as “novel” as the first one.

  • Pingback: Morning Feature – Memory Lane, Part II: Do We ‘Recall’ or ‘Recollect?’ | BPI Campus()



No brain. No gain.

About Neuroskeptic

Neuroskeptic is a British neuroscientist who takes a skeptical look at his own field, and beyond. His blog offers a look at the latest developments in neuroscience, psychiatry and psychology through a critical lens.


See More

@Neuro_Skeptic on Twitter


Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!

Collapse bottom bar