# “Voodoo Correlations” in fMRI – Whose voodoo?

It’s the paper that needs little introduction – Ed Vul et. al.’s “Voodoo Correlations in Social Neuroscience”. If you haven’t already heard about it, read the Neurocritic’s summary here or the summary at BPS research digest here. Ed Vul’s personal page has some interesting further information here. (Probably the most extensive discussion so far, with a very comprehensive collection of links, is here.)

Few neuroscience papers have been discussed so widely, so quickly, as this one. (Nature, New Scientist, Newsweek, Scientific American have all covered it.) Sadly, both new and old media commentators seem to have been more willing to talk about the implications of the controversy than to explain exactly what is going on. This post is a modest attempt to, first and foremost, explain the issues, and then to evaluate some of the strengths and limitations of Vul et al’s paper.

[Full disclosure: I’m an academic neuroscientist who uses fMRI, but I’ve never performed any of the kind of correlational analyses discussed below. I have no association with Vul et al., nor – to my knowledge – with any of the authors of any of the papers in the firing line. ]

1. Vul et al.’s central argument. Note that this is not their only argument.

The essence of the main argument is quite simple: if you take a set of numbers, then pick out some of the highest ones, and then take the average of the numbers you picked, the average will tend to be high. This should be no surprise, because you specifically picked out the high numbers. However, if for some reason you forgot or overlooked the fact that you had picked out the high numbers, you might think that your high average was an interesting discovery. This would be an error. We can call it the “non-independence error”, as Vul et al. do.

Vul et al. argue that roughly half of the published scientific papers in a certain field of neuroscience include results which fall prey to this error. The papers in question are those which attempt to correlate activity in certain parts of the brain (measured using fMRI) against behavioural or self-report measures of “social” traits – essentially, personality. Vul et al. call this “social neuroscience”, but it’s important to note that it’s only a small part of that field.

Suppose, for example, that the magnitude of the neural activation in the amygdala caused by seeing a frightening picture was positively correlated with the personality trait of neuroticism – tending to be anxious and worried about things. The more of a worrier a person is, the bigger their amygdala response to the scary image. (I made this example up, but it’s plausible.)

The correlation coefficient, r, is a measure of how strong the relationship is. A coefficient of 1.0 indicates a perfect linear correlation. A coefficient of 0.4 would mean that the link was a lot weaker, although still fairly strong. A coefficient of 0 indicates no correlation at all. This image from Wikipedia shows what linear correlations of different strengths “look like”.

Vul’s argument is that many of the correlation coefficients appearing in social neuroscience papers are higher than they ought to be, because they fall prey to the non-independence error discussed above. Many reported correlations were in the range of r=0.7-0.9, which they describe as being implausibly high.

They say that the problem arises when researchers search across the whole brain for any parts where the correlation between activity and some personality measure is statistically significant – that is to say, where it is high – and then work out the average correlation coefficient in only those parts. The reported correlation coefficient will tend to be a high number, because they specifically picked out the high numbers (since only high numbers are likely to be statistically significantly different from zero.)

Suppose that you divided the amygdala into 100 small parts (voxels) and separately worked out the linear correlation between activity and neuroticism for each voxel. Suppose that you then selected those voxels in which the correlation was greater than (say) 0.8, and work out the average: (say) 0.86. This does not mean that activity across the amygdala as a whole is correlated with neuroticism with r=0.86. The “full” amygdala-neuroticism correlation must be less than this. (Clarification 5.2.09: Since there is random noise in any set of data, it is likely that some of those correlations which reached statistical significance were those which were very high by chance. This does not mean that there weren’t any genuinely correlated voxels. However, it means that the average of the correlated voxels is not a measure of the average of the genuinely correlated voxels. This is a case of regression to the mean.)

Vul et. al. say that out of 52 social neuroscience fMRI papers they considered, 28 (54%) fell prey to this problem. They determined this by writing to the authors of the papers and asking them to answer some multiple-choice questions about their statistical methodology.This chart shows the reported correlation coefficients in the papers which seemed to suffer from the problem (in red) vs. those which didn’t (in green); unsurprisingly, the ones which do tended to give higher coefficients. (Each square is one paper.)

That’s it. It’s quite simple. but….there is a very important question remaining. We’ve said that non-independent analysis leads to “inflated” or “too high” correlations, but too high compared to what? Well, the “inflated” correlation value reported by a non-independent analysis is entirely accurate – in that it’s not just made up – but it only refers to a small and probably unrepresentative collection of voxels. It only becomes wrong if you think that this correlation is representative of the whole amygdala (say).

So you might decide that the “true” correlation might be the mean correlation over all of the voxels in the amygdala. But that’s only one option. There are others. It would be equally valid to take the average correlation over the whole amygdalo-hippocampal complex (a larger region). Or the whole temporal cortex. That would be silly, but not an error – so long as you make it clear what your correlation refers to, any correlation figure is valid. If you say “The voxel in the amygdala with the greatest correlation with neuroticism in this data-set had an r=0.99”, that would be fine, because readers will realize that this r=0.99 figure was probably an outlier. However, if you say, or imply, that “The amygdala was correlated with neuroticism r=0.99” based on the same data, you’re making an error.

My diagram (if you can call it that…) to the left illustrates this point. The ovals represent the brain. The colour of each point in the brain represents the degree of linear correlation between some particular fMRI signal in that spot, and some measure of personality.

Oval 1 represents a brain in which no area is really correlated with personality. So most of the brain is gray, meaning very low correlation. But a few spots are moderately correlated just by chance, so they show up as yellow.

Oval 2 represents a brain in which a large blob of the brain (the “amygdala” let’s call it) is really correlated quite well i.e. yellow. However, some points within this blob are, just by chance, even more correlated, shown in red.

Now, if you took the average correlation over the whole of the “amygdala”, it would be moderate (yellow) – i.e. picture 2a. However, suppose that instead, you picked out those parts of the brain where the correlation was so high that it could not have occurred by chance (statistically significant).

We’ve seen that yellow spots often occur by chance even without any real correlation, but red ones don’t – it’s just too unlikely. So you pick out the red spots. If you average those, the average is obviously going to be very high (red). i.e. picture 2b. But if you then noticed that all of the red spots were in the amygdala, and said that the correlation in the amygdala was extremely high, you’d be making (one form of) the non-independence error.

Some people have taken issue with Vul’s argument, saying that it’s perfectly valid to search for voxels significantly correlated with a behaviour, and then to report on the strength of that correlation. See for example this anonymous commentator:

many papers conducted a whole brain correlation of activation with some behavioral/personality measure. Then they simply reported the magnitude of the correlation or extracted the data for visualization in a scatterplot. That is clearly NOT a second inferential step, it is simply a descriptive step at that point to help visualize the correlation that was ALREADY determined to be significant.

The academic responses to Vul make the same point (but less snappily).

The truth is that while there is technically nothing wrong with doing this, it could easily be misleading in practice. Searching for voxels in the brain where activation is significantly correlated with something is perfectly valid, of course. But the magnitude of the correlation in these voxels will be high by definition. These voxels are not representative because they have been selected for high correlation. In particular, even if these voxels all happen to be located within, say, the amygdala, they are not representative of the average correlation in the amygdala.

A related question is whether this is a “one-step” or a “two-step” analysis. Some have objected t that Vul implies it is a two-step analysis in which the second step is “wrong”, whereas in fact it’s just a one-step analysis. That’s a purely semantic issue. There is only one statistical inference step (searching for significantly correlated voxels). But to then calculate and report the average correlation in those voxels is a second, descriptive step. The second step is not strictly wrong but it could be misleading, not because it introduces a new, flawed analysis, but because it would be a misinterpretation of the results of the first step.

2. Vul et al.’s secondary argument The argument set out above is not the only argument in the Vul et. al. paper. There’s an entirely separate one introduced on Page 18 (Section F.)

The central argument is limited in scope. If valid it means that some papers, those which used non-independent methods to compute correlations, reported inappropriately high correlation coefficients. But it does not even claim that the true correlation coefficients were zero, or that the correlated parts of the brain were in the wrong places. If one picks out those voxels in the brain which are significantly correlated with a certain measure, it may be wrong to then compute the average correlation, but the fact that the correlation is significantly greater than zero remains. Indeed, the whole argument rests upon the fact that they are!

but…this all assumes that the calculation of statistical significance was done correctly. Such calculations can get very complex when it comes to fMRI data. It can be difficult to correct for the multiple comparisons problem. Vul et al. point out that some of the papers in question (they only cite one, but say that the same also applies to an unspecified number of others), the calculation of significance seems to have been done wrong. They trace the mistake to a table printed in a paper published in 1995. They accuse some people of having misunderstood this table, leading to completely wrong significance calculations.

The per-voxel false detection probabilities described by E. et al (and others) seem to come from Forman et al.’s Table 2C. Values in Forman et al’s table report the probability of false alarms that cluster within a single 2D slice (a single 128×128 voxel slice, smoothed with a FWHM of 0.6*voxel size). However, the statistics of clusters in 2D (a slice) are very different from those of a 3D volume: there are many more opportunity for spatially clustering false alarm voxels in the 3D case, as compared to the 2D case. Moreover, the smoothing parameter used in the papers in question was much larger than 0.6*voxel size assumed by Forman in Table 2C (in E. et al., this was >2*voxel size). The smoothing, too, increases the chances of false alarms appearing in larger spatial clusters.

If this is true, then it’s a knock-down point. Any results based upon such a flawed significance calculation would be junk, plain and simple. You’d need to read the papers concerned in detail to judge whether it was, in fact, accurate. But this is a completely separate point to Vul et al.’s primary non-independence argument. The primary argument concerns a statistical phenomenon; this secondary argument accuses some people of simply failing to read a paper. The primary argument suggests that some reported correlation coefficients are too high, but only this second argument suggests that some correlation coefficients may in fact be zero. And Vul et al. do not say how many papers they think suffer from this serious flaw.

These two arguments seem to have gotten mixed up in the minds of many people. Responses to the Vul et al. paper have seized upon the secondary accusation that some correlations are completely spurious. The word “voodoo” in the title can’t have helped. But this misses the point of Vul et al.’s central argument, which is entirely separate, and seems almost indisputable so far as it goes.

3. Some Points to Note

- Just to reiterate, there are two arguments about brain-behaviour correlations in Vul et al. The main one – the one everyone’s excited about – purports to show that 54% of the reported correlations in social neuroscience are weaker than they have been claimed, but cannot be taken to mean that they are zero. The second one claims that some correlations are entirely spurious because they were based on a very serious error stemming from misreading a paper. But at present only one paper has been named as a victim of this error.
- The non-independence error argument is easy to understand and isn’t really about statistics at all. If you’ve read this far, you should understand it as well as I do. There are no “intricacies”. (The secondary argument, about multiple-comparison testing in fMRI, is a lot trickier however.)
- How much the non-independence error inflates correlation sizes is difficult to determine, and it will vary in every different case. Amongst many other things the degree of inflation will depend upon two factors: the strictness of the statistical threshold used to pick the voxels (a stricter threshold = higher correlations picked); and the number of voxels picked (if you pick 99% of the voxels in the amygdala, then that’s nearly as good as averaging over the whole thing; if you pick the one best voxel, then you could inflate the correlation enormously.) Note, however, that many of the papers that avoided the error still reported pretty strong correlations.
- It’s easy to work out brain activity-behaviour correlations while avoiding the non-independence problem. Half of the papers Vul et al. considered in fact did this (the “green” papers). One simply needs to select the voxels in which to calculate the average correlation based on some criteria other than the correlation itself. One could, for example, use an anatomy textbook to select those voxels making up the amygdala. Or, one could select those voxels which are strongly activated by seeing a scary picture. Many of the “green” papers which did this still reported strong correlations (r=0.6 or above).
- Vul et al.’s criticisms apply only to reports of linear correlations between regional fMRI activity and some behavioural or personality measure. Most fMRI studies do not try to do this. In fact, many do not include any behavioural or personality measures at all. At the moment, fMRI researchers are generally seeking to find areas of the brain which are activated during experience of a certain emotion, performance of a cognitive process, etc. Such papers escape entirely unscathed.
- Conversely, although Vul et al. looked at papers from social neuroscience, any paper reporting on brain activity-behaviour linear correlations could suffer from the non-independence problem. The fact that the authors happened to have chosen to focus on social neuroscience is irrelevant.
- Indeed, Vul & Kerwisher have also recently written an excellent book chapter discussing the non-independence problem in a more general sense. Read it and you’ll understand the “voodoo” better.
- Therefore, “social neuroscience” is not under attack (in this paper.) To anyone who’s read & understood the paper, this will be quite obvious.

4. Remarks: On the Art of Voodoo Criticism Vul et al. is a sound warning about a technical problem that can arise with a certain class of fMRI analyzes. The central point, although simple, is not obvious – no-one has noticed it before, after all – and we should be very grateful to have it pointed out. I can see no sound defense against the central argument: the correlations reported on the “red list” papers are probably misleadingly high, although we do not know by how much. (The only valid defense would be to say that your paper did not, in fact, use a non-independent analysis.)

Some have criticized Vul et. al. for their combative or sensationalist tone. It’s true that they could have written the paper very differently. They could have used a conservative academic style and called it “Activity-behaviour correlations in functional neuroimaging: a methodological note”. But no-one would have read it. Calling their paper “Voodoo correlations” was a very smart move – although there is no real justification for this, it brilliantly served to attract attention. And attention is what papers like this deserve.

But this paper is not an attack on fMRI as a whole, or social neuroscience as a whole, or even the calculation of brain-behaviour correlations as a whole. Those who treat it as such are the real voodoo practitioners in the old-fashioned sense: they see Vul sticking pins into a small part of neuroscience, and believe that this will do harm to the whole of it. This means you, Sharon Begley of Newsweek : “The upcoming paper, which rips apart an entire field: the use of brain imaging in social neuroscience…”. This means you, anyone who read about this paper and thought “I knew it”. No, you didn’t, you may have thought that there was something wrong with all of these social neuroscience fMRI papers, but unless you are Ed Vul, you didn’t know what it was.

There’s certainly much wrong with contemporary cognitive neuroscience and fMRI. Conceptual, mathematical, and technical problems plague the field, just a few of which have been covered previously on Neuroskeptic and on other blogs as well as in a few papers (although surprisingly few). In all honesty, a few inflated correlations ranks low on the list of the problems with the field. Vul’s is a fine paper. But its scope is limited. As always, be skeptical of the skeptics.

Edward Vul, Christine Harris, Piotr Winkielman, Harold Pashler (2008). Voodoo Correlations in Social Neuroscience Perspectives on Psychological Science