Recently, psychologists Joseph Simmons, Leif Nelson and Uri Simonsohn made waves when they published a provocative article called False-Positive Psychology
The paper’s subtitle was “Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant”. It explained how there are so many possible ways to gather and analyze the results of a (very simple) psychology experiment that even if there’s nothing interesting really happening, it’ll be possible to find some “significant” positive results purely by chance. Then you could publish those ‘findings’ and not mention all the other things you tried.
It’s not a new argument, and the problem has been recognized for a long time as the “file drawer problem”, “p-value fishing”, “outcome reporting bias“, and by many other names. But not much has been done to prevent it.
The problem’s not just seen in psychology however, and I’m concerned that it’s especially dangerous in modern neuroimaging research.
Let’s assume a very simple fMRI experiment. The task is a facial emotion visual response. Volunteers are shown 30 second blocks of Neutral, Fearful and Happy faces during a standard functional EPI scanning. We also collect a standard structural MRI as required to analyze that data.
This is a minimalist study. Most imaging projects have more than one task, commonly two or three and maybe up to half a dozen, as part of one scan. If one task failed to show positive results, it need never be reported at all, so additional tasks would compound the problems here.
Our study is comparing two groups: people with depression, and healthy controls.
How many different ways could you analyze that data? How much flexibility is there?
First some ground rules. We’ll stick to validated, optimal approaches. There are plenty of commonly used less favoured approaches, like using uncorrected thresholds (and then, which ones?) or voodoo stats, but let’s assume we want to stick to ‘best practice’.
As far as I can see, here’s all the different things you could try. Please suggest more in the comments if you think I’ve missed any:
First off, general points:
- What’s the sample size? Unless it’s fixed in advance, data peeking – checking whether you’ve got a significant result after each scan, and stopping the study when you get one – gives you multiple bites at the cherry.
- Do you use parametric, or nonparametric analysis?
Now, what do you do with the data?
- How much smoothing?
- Do you reject subjects for “too much head movement”? If so, what’s too much?
- Straightforward whole-brain general linear model (GLM) analysis followed by a group comparison.
- What’s the contrast of interest? You could make a case for Fear vs Neutral, Happy vs Neutral, Happy vs Fear, “Emotional” vs Neutral.
- Fixed effects or random effects group comparison?
- Do you reject outliers? If so, what’s an ‘outlier’?
- Do you consider all of the Fear, Happy and Neutral blocks to be equivalent, or do you model the first of each kind of block seperately? etc.
- “Region of Interest” (ROI) GLM analysis. Same options as above, plus:
- Which ROI(s)?
- How do you define a given ROI?
- Functional connectivity analysis
- Whole-brain analysis, or seed region analysis?
- If seed region, which region(s)?
- Functional connectivity in response to which stimuli?
- Dynamic Causal Modelling?
- Lots and lots of options here.
- Lots and lots of options here.
But remember we also got structural MRIs, and while they may have been intended to help analyze the functional data, you could also examine structural differences between groups. What method?
- Manual measurement of volume of certain regions.
- Which region(s)?
- Cortical morphometry.
- What measure? Thickness? Curvature…?
That’s just the imaging data. You’ve almost certainly got some other data on these people as well, if only age and gender but maybe depression questionnaire scores, genetics, cognitive test performance…
- You could try and correlate every variable with every imaging measure discussed above. Plus:
- Do you only look for correlations in areas where there’s a significant group difference (which would increase your chances of finding a correlation in those areas, as there’d be fewer multiple comparisons)?
- You could define subgroups based on these variables.
This problem is growing. As computation power continues to expand, running multiple analyses is cheaper and faster than ever, and new methods continue to be invented (DCM and MVPA were very rarely used even 5 years ago.)
I’ll also point out that some imaging research, especially what might be called “pure” neuroscience investigating brain function per se rather than “clinical” studies looking at differences between groups, has many fewer variables to play with, but still quite a lot.
Simmons JP, Nelson LD, and Simonsohn U (2011). False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological science, 22 (11), 1359-66 PMID: 22006061