Back in April a paper came out in Nature Reviews Neuroscience that shocked many: Katherine Button et al’s Power failure: why small sample size undermines the reliability of neuroscience
It didn’t shock me, though, skeptic that I am: I had long suspected that much of neuroscience (and science in general) is underpowered – that is, that our sample sizes are too small to give us an acceptable chance of detecting the signals that we claim to be able to isolate out of the noise.
In fact, I was so unsurprised by Button et al that I didn’t even read it, let alone write about it, even though the authors list included such neuro-blog favorites as John Ionaddis, Marcus Munafo and Brian Nosek (I try to avoid obvious favouritism, you see).
However this week I took a belated look at the paper, and I noticed something interesting.
Button et al took 49 meta-analyses and calculated the median observed statistical power of the studies in each analysis. The headline finding was that average power is small.
I was curious to know why it was small. So I correlated the study characteristics (sample size and observed effect size) with the median power of the studies.
(Details: I extracted the data on 49 meta-analyses from Button et al’s Table 2. There are two kinds of effect size, Cohen’s d and relative risk (RR), corresponding to continuous vs categorical outcomes. Each meta-analysis has one, not both. As I was unsure of how to convert one into the other, I treated this as two separate populations of meta-analyses. Because the direction (sign) of effects is irrelevant for power calculation, I made all of the effect sizes ‘positive’ by taking the Abs(d) or, for the case of RR, by taking 1/RR for those RR’s below 1. Because it is the square root of sample size that determines the power, I used Sqrt(N) in all correlations)
I found that median power in a given meta-analysis was not correlated with the median sample size of those studies (d on the left, RR on the right):
Note also that the “RR” meta-analyses tended to include much larger studies (mean median N=353) than the “d” ones (mean median N=58) yet despite this, they had significantly (p=.002) less median power (RR 0.243 vs d 0.524).
Median power was rather predicted by the summary effect size:
In other words, the many underpowered studies tended to be the ones which estimated smaller effects, but not those with smaller samples.
What does this mean? I confess I’m stumped.
Are we neuroscientists to conclude that sample size isn’t the main problem? Is Nature herself the culprit for not providing us with bigger effect sizes? Yet how can size not matter, when power is defined as a function of both sample size and effect size? It’s like saying the area of a rectangle is only correlated with its width, not its height – yet this is what I am saying.
Or have I just done the math wrong?
If you have any ideas, please let me know in the comments.
Button KS, Ioannidis JP, Mokrysz C, Nosek BA, Flint J, Robinson ES, & Munafò MR (2013). Power failure: why small sample size undermines the reliability of neuroscience. Nature reviews. Neuroscience, 14 (5), 365-76 PMID: 23571845