Is Neuroscience Really Too Small?

By Neuroskeptic | August 10, 2013 5:41 am

Back in April a paper came out in Nature Reviews Neuroscience that shocked many: Katherine Button et al’s Power failure: why small sample size undermines the reliability of neuroscience

It didn’t shock me, though, skeptic that I am: I had long suspected that much of neuroscience (and science in general) is underpowered – that is, that our sample sizes are too small to give us an acceptable chance of detecting the signals that we claim to be able to isolate out of the noise.

In fact, I was so unsurprised by Button et al that I didn’t even read it, let alone write about it, even though the authors list included such neuro-blog favorites as John Ionaddis, Marcus Munafo and Brian Nosek (I try to avoid obvious favouritism, you see).

However this week I took a belated look at the paper, and I noticed something interesting.

Button et al took 49 meta-analyses and calculated the median observed statistical power of the studies in each analysis. The headline finding was that average power is small.

I was curious to know why it was small. So I correlated the study characteristics (sample size and observed effect size) with the median power of the studies.

(Details: I extracted the data on 49 meta-analyses from Button et al’s Table 2. There are two kinds of effect size, Cohen’s d and relative risk (RR), corresponding to continuous vs categorical outcomes. Each meta-analysis has one, not both. As I was unsure of how to convert one into the other, I treated this as two separate populations of meta-analyses. Because the direction (sign) of effects is irrelevant for power calculation, I made all of the effect sizes ‘positive’ by taking the Abs(d) or, for the case of RR, by taking 1/RR for those RR’s below 1. Because it is the square root of sample size that determines the power, I used Sqrt(N) in all correlations)

I found that median power in a given meta-analysis was not correlated with the median sample size of those studies (d on the left, RR on the right):

Note also that the “RR” meta-analyses tended to include much larger studies (mean median N=353) than the “d” ones (mean median N=58) yet despite this, they had significantly (p=.002) less median power (RR 0.243 vs d 0.524).

Median power was rather predicted by the summary effect size:

In other words, the many underpowered studies tended to be the ones which estimated smaller effects, but not those with smaller samples.

What does this mean? I confess I’m stumped.

Are we neuroscientists to conclude that sample size isn’t the main problem? Is Nature herself the culprit for not providing us with bigger effect sizes? Yet how can size not matter, when power is defined as a function of both sample size and effect size? It’s like saying the area of a rectangle is only correlated with its width, not its height – yet this is what I am saying.

Or have I just done the math wrong?

If you have any ideas, please let me know in the comments.

ResearchBlogging.orgButton KS, Ioannidis JP, Mokrysz C, Nosek BA, Flint J, Robinson ES, & Munafò MR (2013). Power failure: why small sample size undermines the reliability of neuroscience. Nature reviews. Neuroscience, 14 (5), 365-76 PMID: 23571845

  • Jona Sassenhagen

    I’m not sure I get the math here either, but isn’t this a result of the “significance filter”/Button’s “winner’s curse”? The smaller the sample size of a *published* study, the more it can be expected to overestimate the effect size (Type M error). So sample size should (inversely) predict effect size, and since power (very roughly) equals sample size by effect size, if sample size is negatively correlated with effect size, you could expect to see what you did.

    After all, do we really believe effect sizes are so vastly different between cognitive and social neuroscience studies? Cohen’s original investigations of low power in psychology simply assumed a flat d ~ .5.

    Does anything change if you use the mean over the median?

    Looking forward to see what statisticians have to say here.

  • Vijay

    Is there a possibility the same thing is true in genetic analysis of groups/races? sample sizes of 10/100 lead to false estimation of ages old admixture?

  • Jake Westfall

    Your finding is kind of odd, but not a contradiction. This kind of thing can happen in data with a multilevel structure, as we have here (studies nested within meta-analyses). A relationship that is true at the study level — e.g., larger studies having more power — need not necessarily hold at the meta-analysis level (meta-analyses with higher *average* sample size might not necessarily have higher *average* power). The assumption that such statistical relationships must be the same at different data levels is a type of “ecological fallacy.” If you plug the terms “ecological fallacy” or “Simpson’s paradox” into Google image search, you can find a lot of good examples of scatter plots showing situations like this.

    • Neuroskeptic

      Thanks for the comment. But is this really a case of the ecological fallacy?

      Button et al used the summary effect size of all studies in the meta-analysis, to calculate power. So that was the same for all studies in a given meta.

      The only thing that varied across studies was the sample size. For each meta I plotted the median sample size against the median power – and I think this implies that those were the same study because if a study is the median on one, it has to be median on the other (given that effect size is a constant within a meta).

      So what my scatterplots show is that for a given set of studies (the “medians”) sample size is not correlated with power even though it is the sample size of those very studies that determines power…

      I think.

      • Jake Westfall

        As far as I can tell, yeah, this is still an ecological fallacy/Simpson’s paradox type of situation. I guess I see your reasoning here with the median thing — the idea as I understand it being that we can consider each meta-analysis to be, in some sense, one individual study (rather than a collection of studies), and in this way we conceptually eliminate the multilevel structural and attendant interpretational concerns entirely — but I really just don’t think it works that way statistically. Choosing simply to use medians instead of means to construct the aggregated x- and y-variables does not make these issues go away. Indeed, such a strategy would be truly remarkable if it really worked.

        Consider a toy example with two meta-analyses. Let’s say in meta A the sample sizes range from 10 to 50, and in meta B they range from 40 to 200. It will certainly be true that, within each meta-analyses, the larger studies will tend to have greater power. This follows from the mathematical definition of statistical power. But now let’s say that meta A was summarizing an area of the literature in which the typical effect size is large, d = 0.8, while meta B was summarizing an area of the literature in which the typical effect size is more moderate, say d = 0.4. So at the level of meta-analyses, in this toy example, typical sample sizes (defined however you like) tend to be *negatively* correlated with typical effect sizes. One could very easily imagine how, due to this inverse relationship, meta A and meta B might end up having the exact same typical estimated power in this situation. So at the level of meta-analyses, we would have *zero* correlation (technically pearson’s r would not even be defined) between sample sizes and power estimates, even though we had a clear relationship beween the two at the study level!

        Now, our choice of one or the other way of defining a “typical” sample size, effect size, or power estimate (e.g., mean, median, mode) would likely push things around somewhat. Maybe for a given dataset, we end up with a slight negative correlation when we choose means, a slight positive correlation when we choose medians, etc. But I don’t think these fluctuations would be expected to happen in any generally predictable way. Certainly I don’t see why simply choosing medians for the “typical” values should guarantee that within-group relationships also hold at the between-group level.

        • Guido Biele

          Jake, I agree with what you’ve written and would only add that in a world where null hypothesis significance testing rules the game, a negative correlation between sample size and effect size is actually expected (as Jona pointed out).
          So your toy example is I think not simply a toy example, but a good description of how we should expect the data to look like.

        • Jona Sassenhagen
  • Greig de Zubicaray

    I recommend folks read Ashton’s response to the Button et al. article, also published in Nature Reviews Neuroscience:

    And then for further enlightenment, read the 1967 article by Paul Meehl that Ashton cites:

    Meehl, P. E. (1967). Theory testing in psychology and physics: A methodological paradox. Philosophy of Science, 34, 103-115.

    Meehl’s article should also be required reading for the current debate about study pre-registration, highlighting as it does the “main problem”.

  • Guido Biele

    A quick thought:
    I think we only can expect sample size to be a strong predictor of power if the different meta analyses had similar effect sizes. The reason that the RR studies have lower power despite having larger samples might be due to RR studies investigating smaller effects.
    One way to look at the data would to run a multiple regression that uses sample size and power to predict power (The plots already indicate that effect size is a stronger predictor, but we should see that sample size also predicts power after we have “controlled for” effect size).

    PS: I think the meta analysis data are in table 1 of Button et al, not Table 2.

    • Jake Westfall

      Guido, yes, exactly. This is a far simpler and more concise way of saying what I was trying to say.

  • Pingback: Is Neuroscience Really Too Small? - Neuroskepti...

  • John McIntire

    Great video about p-values and reliabiilty via replication in psychology:

  • Wouter

    “It’s like saying the area of a rectangle is only correlated with its width, not its height”

    Probably completely unrelated, but this reminded me of another phenomenon: the holographic principle (derived from black hole thermodynamics), which states that the total information stored in a black hole roughly equals the event horizon, i.e. the radius squared, and not the radius cubed, which one would naively expect.

    Anyway, for a more pragmatic solution/interpretation I’d go for Guido Biele’s answer.

  • Neuroskeptic

    Thanks for the comments, everyone.

    Upon looking at the data in more detail, the pattern of results I found can be explained by the fact that the effect sizes vary more (relatively) than the sample sizes.

    In particular, the effect sizes come very close to zero, while the median sample sizes do not.

    So because they vary more, they explain more variance. Sounds simple… and I am now kicking myself for finding it so baffling earlier (the ecological fallacy was deep in my mind, messing up my intuitions).

    There is no significant correlation between sample size and effect size. Although there’s a weak trends towards a negative correlation, it is sensitive to outliers.

    So: in this dataset, effect size is independent of sample size, and is the primary determinant of power differences across studies.

  • Peter

    Looks to me like an effect of incremental data gathering – i.e. keep amassing subjects until something/anything hits p<0.05 and then publish. If you do that, then pretty much by definition all your effect sizes will be at the low end of detectability for your sample size. Turning that round the other way, it's equivalent to saying that the sample size was barely big enough to detect the supposed effect – i.e. the experiment was underpowered.

    The only surprise is that it applies to the large studies as well as the small ones – for a large study you might expect n to have been set in advance. It's plausible that people might be analysing data mid-collection and then stopping gathering data if they've found something significant. Or it could be that the effects just aren't real, and people are fudging them over the magic p<0.05 barrier by (e.g.) not doing proper multiple testing correction.

  • andrew oh-willeke

    Large effects always get published and have statistical power regardless of sample size. People don’t try to publish small effects involving small samples because they know the concerns are obvious, but do try to data mine for small effects in samples that are as big as they are going to get given practical constraints.

  • Pingback: Articles of Interest August 16th, 2013 « National Creativity Network

  • ArcusTangens

    I fail to see how these findings are surprising, since they seem to follow from the logic of sample size calculation.
    Consider that power (ex ante power, not ex post) is a function of expected effect size and number of observations. A rule of thumb often adopted in science is to aim for sample sizes enabling a power of 80% with a 95% confidence interval (that is an 80% chance of achieving statistical significance given that the actual effect measured is equal to the expected). With this goal being widespread, effect sizes and sample sizes will be inversely related (fix power at 80 and vary expected effect size, and sample size will vary in the opposite direction), and estimated power will be essentially uncorrelated with sample sizes, which seems to match up with your findings.
    Furthermore, since power is a positive function of effect size, these will clearly be correlated. The findings seem to match nicely up with what should be expected.

    The problems debated in the Button article are real, though. Bachetti’s answer, bemoaning the demand for larger sample sizes and pointing to stringent adherence to scientific principles as a solution to publication bias seems to disregard the fact that publication bias is a very real problem. Yes, considered in isolation, a small study with a specific reached p value is _as significant_ as a larger study with the same p value. But this is only true if observed in isolation – that is assuming that there are no other studies performed for which the results are not reported. Unlike large studies (with narrower confidence intervals and reduced risk of unusual results), small studies are plentiful. Most small studies of small effects find nothing of statistical significance, and are usually left at that by the scientists – they’re not worth the trouble, since publication is highly unlikely. When a significant result is reached, though, publication is much more likely for a myriad of reasons. For large studies, publication is likely despite non-significant results, so the problem is smaller.

    Assume a true effect of 0 with an SD of 1. If we follow the full cohort of small and large studies of whatever phenomenon we are interested in, bearing in mind that the number of studies is a negative function of size, we will see that the probability of spuriously significant findings being published is much greater for small studies than for large. This is a special case of survivorship bias, since for the researcher achieving significant results, there is no way of discerning the graveyard of similar studies observing no effect. It is difficult to see how the problem can be mitigated without demands to sample size.


Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!


No brain. No gain.

About Neuroskeptic

Neuroskeptic is a British neuroscientist who takes a skeptical look at his own field, and beyond. His blog offers a look at the latest developments in neuroscience, psychiatry and psychology through a critical lens.


See More

@Neuro_Skeptic on Twitter

Collapse bottom bar

Login to your Account

E-mail address:
Remember me
Forgot your password?
No problem. Click here to have it e-mailed to you.

Not Registered Yet?

Register now for FREE. Registration only takes a few minutes to complete. Register now »