Unreliability of fMRI Emotional Biomarkers

By Neuroskeptic | May 24, 2017 9:21 am

Brain responses to emotion stimuli are highly variable even within the same individual, and this could be a problem for researchers who seek to use these responses as biomarkers to help diagnose and treat disorders such as depression.

That’s according to a new paper in Neuroimage, from University College London neuroscientists Camilla Nord and colleagues.

Nord et al. had 29 volunteers perform three tasks during fMRI scanning. All of the tasks involved pictures of emotional faces, which are commonly used in emotion biomarker studies. Each volunteer performed the fMRI tasks four times: twice on one day, and then twice again on another day roughly two weeks later.

The reliability of the neural responses to emotions was defined using the intraclass correlation coefficient (ICC). Higher values indicate more consistency within individuals.

The results showed poor reliability of the activation in key ’emotional’ brain areas, the amygdala and the subgenual cingulate cortex (sgCC). The emotion face tasks did cause activation in these regions, but the strength of the effect wasn’t consistent within individuals.

By contrast, activation in the fusiform face area (FFA), a brain area that responds to faces, rather than emotions, was pretty consistent. Here are the reliabilites between the two scanning days:


The reliability scores within each scanning day were little better than those between days. An ICC of below +0.4 is generally considered ‘poor’, and most of the ICCs were well below this, except for the FFA.

Nord et al. comment that these results suggest that the amygdala and the sgACC may not make ideal fMRI biomarkers:

We observed surprisingly low within-subject reliability of three putative fMRI biomarkers, the amygdalae and sgACC response to emotional faces… our results suggest that many of the task-evoked responses assumed to be stable within individuals may not in fact be viable fMRI biomarkers, at least not using these three common paradigms.


I asked senior author Jon Roiser for some additional comments on what these results mean. He kindly replied:

Neuroskeptic: You used an optimized fMRI sequence in a 1.5 Tesla MRI scanner. What do you think are the implications of these results for studies at higher field strengths, or using other sequences?

Jon Roiser: I am not sure that we can extrapolate these results to higher field strength scanners, or even predict whether we would expect the reliability results to be better or worse. In part it would depend on the optimisation of the sequence.On the one hand scanners with higher field strength inherently have greater signal-to-noise ratio which should improve measurement reliability. On the other hand, echo planar images from higher strength scanners are also more affected by dropout caused by the susceptibility artifact, which is a concern in the specific regions we assessed.

Plichta et al. (2012) reported even worse low reliability than we found at 3T in the amygdala for one of the same tasks as we used. Sauder et al. (2013) reported similar amygdala ICCs to us, again at 3T. Based on these results, I would say that moving to higher field strength wouldn’t necessarily result in improved reliability.

Neuroskeptic: What’s driving the low reliability you saw in the amygdala and sgACC? Is it a “real” biological fact that activity in these areas varies within-subject, or is the variability a product of the fMRI measurement in some way?

Jon Roiser: I don’t know. It’s not simply an issue to do with the measurement of haemodynamic responses per se, as we generally saw excellent reliability in the FFA – again similar to Sauder et al. (2013). But on the other hand FFA is a superficial region in which measurements will benefit particularly from the 32 channel head coil we used. The amygdala and sgACC are deeper structures than the FFA, and therefore they’re further from the headcoil, so it’s possible this would affect the MRI measurement. However, I think it’s also perfectly possible that the underlying neuronal responses are quite variable over time.

ResearchBlogging.orgNord CL, Gray A, Charpentier CJ, Robinson OJ, & Roiser JP (2017). Unreliability of putative fMRI biomarkers during emotional face processing. NeuroImage PMID: 28506872

CATEGORIZED UNDER: fMRI, papers, select, Top Posts
  • Harrison

    The strong negative ICCs are weird, especially when the negative effects are so strong in some regions.

    From a quick skim, they don’t seem to talk about it much, except to say ‘A negative ICC is usually interpreted as a reliability of zero (Bartko, 1976), since the theoretical limits of the ICC are 0–1 (although negative values can occur, when the within-groups variance exceeds the between group variance, this is outside the theoretical range (Lahey et al., 1983))’

    I don’t see the graph in the paper — might want to double check the numbers, or else there is something off about this.

    • Daniel Ozer

      Negative ICC’s arise when the between group variance is smaller than would be expected given the within group variance, as when the F ratio is less than 1.00. Typically, the negative variance component is set to 0 so the resulting ICC is 0. If this is not sampling error (n=29 is pretty small for talking about reliability estimation, at least in behavioral contexts) something very strange is happening to create data like this.

      • Johan Carlin

        The CI’s for the ‘negative’ bars include zero so I think noise is a pretty compelling explanation.

        If I’m not mistaken, the blue bar for FFA is non-significant as well.

        Good example of how compelling apparently-large effects can be when error bars are omitted from figures.

        • http://blogs.discovermagazine.com/neuroskeptic/ Neuroskeptic

          Blame me for that, not the authors – I made the figure from their Table 3.

    • http://blogs.discovermagazine.com/neuroskeptic/ Neuroskeptic

      The figure is mine based on Table 3 from the paper. I think I extracted the values correctly


      • Harrison

        looks right.

  • Yoni Ashar

    Wonder whether the subjective emotional experience was reliable over time. If not (ie, stimuli evoke a weaker response at second viewing), that could be contributing to observed results, and we’d need to develop reliable behavioral tasks first

  • Tom Johnstone

    The Nord et al. article is informative as to the probably very low test-retest reliability in amygdala activation using the types of acquisitions, data extraction and analyses that many researchers are likely to use. However, many of the points we made in our 2005 paper ( https://doi.org/10.1016/j.neuroimage.2004.12.016 ) centred around the fact that in studies of treatment effects (and by extension, use of amygdala activation as a biomarker) requires a very careful consideration of task design, fMRI sequence and slice prescription, and the way signal changes would be quantified and extracted.

    Some main points:

    i) all the tasks in the Nord et al. paper were active tasks, but there is evidence from substantial body of research that active tasks can reduce amygdala activation to faces (due to top-down inhibitory effects, for example, which might themselves not be reliable and are likely very state-dependent) – we discussed this in our 2005 article.

    ii) the amygdala habituates, which is why we used brief, masked presentations – all the tasks in Nord et al. were longer duration, unmasked presentations. Amygdala “habituation” is possibly due to top-down regulatory effects that are more likely to affect amygdala signal with loner, explicit duration. If the purpose is to use “bottom-up” amygdala reactivity as a biomarker, then brief, masked faces make more sense.

    iii) 32 channel head coils, as used by Nord et al., while potentially offering better SNR in the cortex, might well reduce SNR in subcortical structures (e.g. http://www.ncbi.nlm.nih.gov/pubmed/21618334). https://practicalfmri.blogspot.co.uk/2013/07/12-channel-versus-32-channel-head-coils.html has a great discussion on 32 versus 12 channel coils in fMRI

    iv) you can optimise the scan protocol for either vmPFC/sgACC or for amygdala, but if you try to do both with one protocol, you are compromising both as well. If the purpose is to use fMRI of the amygdala as a biomarker, you really need to optimise just for the amygdala – that’s why we used titled, partial brain coronal acquisition and reported SNR and signal coverage

    v) Nord et al. didn’t discuss the issue of whether to use raw parameter estimates (what they used), z-scores or % signal change. We examined and discussed this, with % signal change yielding better reliability

    vi) 8mm smoothing is possibly too much for a small structure (with even smaller sub-regions) such as the amydgala, with the potential for partial volume effects

    vii) our atlas-based anatomical ICCs were substantially lower than those extracted from a functionally-defined ROI, which was obtained using a voxelwise threshold and extracting the mean contrast from all amygdala voxels exceeding that threshold. For a functionally-defined ROIs, Nord et al used a 4mm spherical extraction based on the max voxel position. But the position of the max voxel possibly has little reliability itself, and a 4mm sphere will miss some activated voxels and include some non-activated voxels (as well as potentially non-amygdala voxels). I think our functionally-defined ROI approach was better (and continue to think that spherical ROIs based on max voxel positions should not be used at all), but these days for targeted studies of well-defined subcortical regions like the amygdala I think we should we should be moving towards individual-subject extractions of activated voxels if interested in test-retest or treatment effects.

    viii) It is worth noting that our *average measure* ICCs were substantially higher than our single measure ICCs – so it might be worthwhile conducting repeated scan sessions to obtain reliable amygdala activation estimates

    Imaging the amygdala is difficult, and if amygdala activation is to be used as a biomarker, there is no real alternative than to use a dedicated amygdala acquisition, data extraction and analysis protocol with a very careful consideration of what task to use (which should be based on what it is you’re trying to measure, e.g. bottom-up reactivity to threat-relevant stimuli). Using protocols designed fully or even partially for other purposes risks producing unreliable measures, as the Nord et al. paper nicely demonstrates.

  • Tom Johnstone

    Hmmm. I posted quite a detailed comment this morning but it seems it’s been marked as “spam” – any chance that could be reinstated? Thanks.

    • http://blogs.discovermagazine.com/neuroskeptic/ Neuroskeptic

      Oops, I’ve now approved it!

  • Pingback: Biases in fMRI studies – What is behavioral? A blog of recent updates to behavioral economics()

  • Pingback: Post Of The Week – Sunday 28th May 2017 | DHSB/DHSG Psychology Research Digest()

  • stmccrea

    Don’t know why this is surprising to anyone. Different people obviously react differently to the same thing, even at different times. People aren’t machines, and fMRI is never going to “pinpoint” emotional experiences, because they are not “pinpointable.” If that’s a word…



No brain. No gain.

About Neuroskeptic

Neuroskeptic is a British neuroscientist who takes a skeptical look at his own field, and beyond. His blog offers a look at the latest developments in neuroscience, psychiatry and psychology through a critical lens.


See More

@Neuro_Skeptic on Twitter


Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!

Collapse bottom bar