Failed Replications: A Reality Check for Neuroscience?

By Neuroskeptic | November 19, 2014 1:39 pm

An attempt to replicate the results of some recent neuroscience papers that claimed to find correlations between human brain structure and behavior has drawn a blank.

preregistrationThe new paper is by University of Amsterdam researchers Wouter Boekel and colleagues and it’s in press now at Cortex. You can download it here from the webpage of one of the authors, Eric-Jan Wagenmakers. Neuroskeptic readers will know Wagenmakers as a critic of statistical fallacies in psychology and a leading advocate of preregistration, which is something I never tire of promoting either.

Boekel et al. attempted to replicate five different papers which, together, reported 17 distinct positive results in the form of structural brain-behavior (‘SBB’) correlations. An SBB correlation is an association between the size (usually) of a particular brain area and a particular behavioral trait. For instance, one of the claims was that the amount of grey matter in the amygdala is correlated with the number of Facebook friends you have.

To attempt to reproduce these 17 findings, Boekel et al. took 36 students whose brains were scanned with two methods, structural MRI and DWI. The students then completed a set of questionnaires and psychological tests, identical to ones used in the five papers that were up for replication.

The methods and statistical analyses were fully preregistered (back in June 2012); Boekel et al. therefore had no scope for ‘fishing’ for positive (or negative) results by tinkering with the methodology.

So what did they find? Nothing much. None of the 17 brain-behavior correlations were significant in the replication sample. Using Bayesian statistics, Boekel et al. say that

For all of the 17 findings under scrutiny, Bayesian hypothesis tests indicated evidence in favor of the null hypothesis [i.e. that there is no correlation.] The extent of this support ranged from anecdotal (Bayes factor less than 3) to strong (Bayes factor greater than 10).

Does that mean that the original five SBB papers were wrong? Boekel et al. are noncommittal on this point, saying only that

From the above discussion, one might be tempted to conclude that the SBB correlations tested here simply may not exist. However, as previously mentioned, a single replication cannot be conclusive in terms of confirmation or refutation of a finding.

They conclude by calling for more pre-registered replications, like the one they just did

We believe that in order to establish correlations between behavior and structural properties of the brain more firmly, it is desirable for the field to replicate SBB correlations, preferably using preregistration protocols and Bayesian inference methods.

However, not everyone is convinced by Boekel et al.’s negative claims. In particular, I heard from Ryota Kanai of UCL, who was the first author on two of the five papers under scrutiny. He cites limitations of Boekel et al.’s replication attempts

1. One of the findings was clearly replicated using a method close to our original study.
2. The region-of-interest (ROI) approach adopted by the authors underestimate correlation (because of spatial uncertainty).
3. The suboptimal methods could not be corrected because it was a pre-registered study.
4. There was no stage where the methods were formally reviewed for this study.
5. The study was generally underpowered, and most Bayes factors were less than 3, indicating that the negative results were anecdotal.
6. Given the above, the main claim that none of the results were replicated sounds a bit too strong.

Kanai also says that the publication and review process has room for improvement:

I [was] one of the reviewers for this paper… Originally, I asked the Cortex action editor (Chris Chambers) to let me write a commentary on this paper so that the readers can see both sides of story at once. But now, Wagenmakers already made their paper available online and you are about to write a blog about this study.

It is true that the authors sent me their pre-registration document before they started their project. But at that time, it was not clear that I had to read it critically as I would for formal reviews. This should not be a problem for future work, if we use the pre-registration mechanisms such as Registered Report in the journal Cortex.

I’m still planning to write a full commentary when the paper is officially out. However, I think your blog would be a great place to discuss this topic further. I hope this is going to be a constructive discussion for the neuroimaging field.

Nonetheless, Kanai says that he supports the publication of Boekel et al. because

I recognize the importance of pre-registration in this field to tackle the issues of the small N problem, publication bias and flexible p-hacking by post hoc changes of analysis strategies.

I encountered a few problems when examining Boekel et al.’s manuscript [but] I thought it would be more beneficial for the field to discuss this in a post-publication peer discussion so that we can improve the practice of pre-registered studies specially in the context of neuroimaging.

I agree with Kanai’s attitude here. I think Boekel et al.’s important paper is proof that truly rigorous, preregistered science is not only possible, but publishable in a major journal. We need more of this. Boekel et al.’s negative findings are certainly concerning.

That said, it would be rash to write off all of the 17 claims as disproven just yet. Many of the null results were only ‘anecdotal’ in terms of the Bayesian statistical evidence. Boekel et al.’s sample was relatively small, and the methodological limitations noted by Kanai, while not obviously fatal ones, aren’t easily dismissed.

However, like Kanai I’m confident that an open discussion is the best way forward. Preregistration itself helps facilitate this. Kanai’s commendable approach is a world away from the kind of hostile, defensive reactions that, sadly, sometimes greet failures to replicate.

ResearchBlogging.orgBoekel, W, Wagenmakers, E-J, Belay, L, Verhagen, J, Brown, S, & Forstmann, BU (2014). A purely confirmatory replication study of structural brain-behavior correlations Cortex

  • Lil Kong

    Two things. First, as has been shown in previous brain-behavior correlation studies with large sample size (N > 200), the effect size was normally around 0.2~0.3. This study only had 36 participants, which largely underpowered the analyses. That can explain why they found the absence of the reported correlations. Second, they claimed that there were significant differences between the effects size they found and those reported previously. Why? Obviously, the effect sizes reported previously were non-independent which has been realized. So, it is unreasonable to compare these two things. In sum, the claim made by the authors was too strong.

    • Neuroskeptic

      I agree about the N. I’m not sure what you mean about “the effect sizes reported previously were non-independent”?

      • Lil Kong

        Most of these brain-behavior correlation studies are exploratory and based on whole-brain voxel-wise analyses. They searched the whole brain and found brain areas which showed (the largest) effects sizes. So the effect sizes reported in these studies is oversized. The problem has been called Voodoo Correlations. It is unreasonable to compare with these effect size.

        • Neuroskeptic

          Ah right – got you.

          But surely that’s a problem with the original papers. You can’t hold the replication to blame for the fact that the original studies are reporting exaggerated effects…

          • Lil Kong

            Actually, i was not blaming to the replication. In addition to the sample size problem, I just wanted to say to comparing with these non-independent effects would be unreasonable and the findings would be unconvincing. For example, when the effects do exist, the authors would also found the differences in effects size if they compare with non-independent effect size reported. So how can they claim whether the reported effects were replicated or not?

        • D Samuel Schwarzkopf

          This is not entirely correct (and oddly a good example of how things like voodoo correlations get misinterpreted). I don’t know all the original studies in this replication but the VBM ones used whole brain analysis to identify peak coordinates and then an independent sample to replicate that effect at that coordinate. This isn’t ‘voodoo’. Also note that this aspect is mentioned in the replication paper. Both the whole brain and the replication effect size are shown.

          • D Samuel Schwarzkopf

            Sorry I should be more precise: out of 8 VBM correlations in the paper 5 were independent.

            But another point that is often lost is that even if whole brain analysis is used, the inference isn’t voodoo when it is based on prior hypothesis. I don’t know in how far this is the case in some of these studies but it is a general point often misconstrued in these kinds of discussions.

          • Neuroskeptic

            The inference isn’t voodoo in that case but the effect size may be exaggerated if you only consider the significant voxels, even so.

          • D Samuel Schwarzkopf

            Not sure what you mean by ‘only consider the significant voxels’ here? If you have a specific hypothesis about an effect and you test it this can’t exaggerate the effects. Of course, publication bias will if only those significant effects are reported. That’s related but a different problem.

          • Lil Kong

            I agree with @Neuroskeptic:disqus that the effect size would much likely to be exaggerated (what I meant with the word voodoo) as long as it is from a whole-brain voxel-wise analysis. This is the fact, regardless of whether you have prior hypothesis or not. Here I should add, the effect sizes (0.2~0.3) reported in the replication study would be much more closer to the truth-value of effect sizes in brain-behavior correlation studies. Further replications with large sample size (> 200?) are needed.

          • D Samuel Schwarzkopf

            Sorry but this is simply not true. In 5 of the tests reported here the results were replicated in a new sample. This effect size is therefore not exaggerated as it is by definition independent.

            And with a prior hypothesis the effect is also not exaggerated with a whole brain analysis. Why should it? It doesn’t matter that the test is also run on 10 others comparisons or 10 million. This is also not really what the voodoo/double-dippling articles commented on. Their point was that one can’t make statistical inferences on effects after having made statistical inferences using the same data.

            I agree with you about the sample size being too low in this replication.

          • D Samuel Schwarzkopf

            Case in point, in 5 out of the 6 tests reported in this replication study where the original studies included a replication the effect size in the replication was higher than that obtained by the original whole-brain analysis (Figure 8). Of course the CIs are wider because these replications were done on smaller samples than the original (although still larger than the sample in this present replication study if my memory doesn’t fail me).

            (I’m happy to continue this debate next week but I’ll be on leave until Tuesday and will be turning my work email off until then. So knowing the internet the discussion will have moved on into a different dimension by then… 😉

          • Lil Kong

            This is very interesting. Maybe, they were very lucky.
            I am not having any bias to defend them, but the two things I have just raised are aimed at the conclusion made by the replication. I think the evidence for the claim might be insufficient and might even misleading. I think the phenomena of small effect size (0.2~0.3) in brain-behavior correlation studies would be a good point.
            Surely, I agree with the importance of pre-registration replication.

          • D Samuel Schwarzkopf

            Okay I’m back now so a last reply to this topic from me:

            I don’t know if they were lucky. There are several reasons why an original effect was stronger than that of a replication, including that
            the original finding was a false positive. In the case of these VBM studies this latter possibility seems relatively unlikely considering that most of them were replicated in the original studies and for one there is also an independent published replication. The way I see it the most likely explanation for the discrepancy is that the methods of this replication were inappropriate. But only further replication attempts can reveal this.

            However, it has nothing to do with circular/non-independence. I’m sorry to harp on about this but this is precisely the problem I see with these articles about non-independent analysis, multiple comparisons, etc. You are implying that the use of whole brain analysis is by definition non-independent and this is simply incorrect. I think the original articles about circular inference made a very valid point but it is extremely counterproductive if this gets widely misunderstood. Once someone (a scientist no less) asked me “Aren’t your analyses non-independent because you use retinotopic mapping?” The mind boggles.

            I am happy to discuss this further with you outside of this thread. If you wish, I can also send you Matlab simulations to illustrate why these analysis are not circular. I think we should end this bit here though as it’s detracting from the main discussion.

          • Lil Kong

            I agree with your worry about misunderstanding of non-independent analysis. I may have to correct the misuse the word ‘non-independent’.
            But it is really hard for me to believe that an effect size larger than 0.3 in brain-behavior correlation (i.e., ~10% variance was explained by a single measure and a single region) would be true or not exaggerated.
            As we can see in the replication, the replicated effect sizes fitted the original ones around 0.2 quit well. Do you have any thoughts of showing that effect sizes around 0.2 would be acceptable and more likely to be replicated?
            Happy to discuss with you and we can contact by email, and here is mine: kongxiangzheng

          • D Samuel Schwarzkopf

            Thanks I will email you (probably tomorrow). In answer to your question, I agree that one might expect a relatively small effect size for brain-behaviour correlations although that presumably also depends very much on the behavioural trait being correlated. As far as VBM is concerned, at present it remains pretty unclear what biological parameters that actually reflects (and I’d say the same probably applies to cortical thickness measures) which makes it pretty difficult to know what effect size to expect.

          • Lil Kong

            Sorry for not making it clear. As you may know, actually, there was not much solid direct prior for VBM studies. As I know, most of the ‘prior’ was based on other imaging modalities (e.g., fMRI) or other related behaviors. I think this seems to be true, even now. That’s why I said that the reported effect sizes would also much likely be exaggerated with ‘prior’. Taken together with the non-independent effect sizes (3/8), even the effects do exist, the authors would also found the differences in effects size if they compare with exaggerated effect size reported. So how can they claim whether the reported effects were replicated or not?

            I agree with you that a new sample (if with a large sample size) would give a more accurate estimate of the effect size. Just like in the replication, the reported effect sizes (0.2~0.3) would be more closer to the truth-value.

  • Geraint Rees

    I applaud the replication attempt, which provides a very interesting stimulus to think about why the studies in question did not replicate.

    One factor that immediately springs to mind is the sample. It’s very small, and in several cases smaller than the study it attempts to replicate. This isn’t very consistent with the general advice that replication samples should be (often much) larger than the study they seek to replicate. So the replication attempt may be a very underpowered study, which is consistent with the BF being very small.

    It’s also important to observe that the sample of participants is non-independent in the sense that the same brain scans are used to attempt to replicate all findings. So if there is a systematic issue with either the sample (for example, they are different in age, country of origin, variance in political attitudes and doubtless other measures) or indeed with the measurement approach (different MRI sequence, different processing pipeline as set out by Ryota) then that would quite possibly affect all the replication attempts because of the non-independence.

    The BF broadly suggest that ‘more data are needed’ to decide either in favour of the null or the alternate hypothesis, so it may be a little bit unreasonable to claim that this is a non-replication. It may just be that a low powered replication study needs more data to conclude whether the null or alternate hypothesis is better supported.

    Nevertheless, I applaud both the replication attempt and the even-handed framing of your blog post which invites us to explore the scientific questions rather than just jump to conclusions.

  • Dave Langers

    Somehow I do get the idea that VBM in particular is especially vulnerable to this replication problem. (See also dois 10.1016/j.neuroimage.2012.12.045, 10.1007/s00429-012-0385-6, and many more, for some stunning examples.)

    We’ve been trying to assess the reliability of VBM findings in the tiny subfield of tinnitus (a very common hearing disorder). Literature has been booming lately. Most individual studies report a plethora of significant effects. These also tend to be consistent within studies from the same lab. But when accumulating all evidence, results are utterly inconsistent and contradictory even.
    We published a review recently in Neuroscience and Biobehavioral Reviews, doi 10.1016/j.neubiorev.2014.05.013. We are about to submit a study on a large number of participants (100+), more or less copying the same methods and regions-of-interest employed by other authors before, similarly finding nothing worth mentioning (except for ageing effects and some isolated foci that we judge spurious).
    This all suggests that tinnitus doesn’t affect brain morphology (much). Perhaps that isn’t strange because it is a somewhat subtle disorder. But that holds for many studies out there (including the mentioned Facebook friends, etc.).
    I think preregistration is great for confirmatory studies in which there is a clear prior hypothesis to test. However, I believe we must keep room for exploratory studies that are trying to seek for trends that may inform future hypotheses. Sometimes one doesn’t yet know where to look or what to look for exactly. Therefore, I would encourage preregistration mechanisms to be available, but I would strongly discourage those becoming mandatory at any point in the future. We just need to realise that exploratory and confirmatory studies have a different character and different role to play in science (just like case studies are not the same as clinical trials, etc.).

    • D Samuel Schwarzkopf

      I agree that pre-registration can play a role in the future of our field. I think in particular for the purpose of replication and for large-scale multi-site projects this is probably an excellent. However, I don’t believe it should be mandatory and it isn’t the panacea for enhancing reproducibility that it is often made out to be. I will write about this in some form in the future so I will not go into much depth on this now.

      But briefly, I agree with the previous commenter that there must be room for purely exploratory studies and hybrid ones. This replication study for instance could have included several more exploratory analyses (some of which the original author whose emails are included in the post suggested) and this may well have changed the conclusions. Pre-registered protocols can only be useful if they are actually adequate and include sufficient detail. They should probably be reviewed properly (and even then we can’t necessarily be sure about them as the general problems with peer review apply).

      In the end I don’t think there is a problem with pre-registration but with replication. When you are replicating a previous result the pre-registration of the protocol already exists in the form of the original publication. It doesn’t require any new infrastructure to compare the methods of a replication to those of the original. As far as this replication is concerned, it appears to me to be a fairly large departure from the original methods, so I don’t think one can draw very strong conclusions from it.

      • ferkan

        Indeed, pre registration should not prevent exploratory studies. We just have to make sure that are labeled as such.

        • D Samuel Schwarzkopf

          To be honest I still think that’s not strictly needed. I think exploratory analyses in a pre-registered study should (must?) be labelled clearly (as was done in this replication study). However, I don’t think all hypothesis-driven studies need to be pre-registered. As I said in my previous post, any first publication on a phenomenon will serve as a “pre-registration” of said phenomenon. As such we should then simply accept that there may have been some exploration and flexibility because that is a natural part of science.

          What I believe we should encourage strongly is more replication of initial findings. For that we can then use the original methods as the “design document” and there is no need to pre-register anything (again, I don’t mean to say pre-reg shouldn’t be an option – but for this case it is not necessary).

          What should also be encouraged (or enforced, ideally) is that departures from the original methods are clearly labeled as exploratory or justified properly if not. This is where this present study seems to fall short because some of the fundamental methods (imaging protocol, coregistration, ROI analysis) were not matched 0 and two out of three of these examples were -not- labelled as discrepancies in the paper.

          (Quick apology to Dave Langers, my original comment was meant to be a general reply to the blog post)

          • Neuroskeptic

            I see your point. But suppose that someone gets a grant, on the basis of an application which says “It is important that someone test hypothesis X, please give us money and we’ll test X”.

            They then get the money and test X.

            I would say that it’s unethical for them not to preregister that study and explain that they plan to test X and what their original methodology and analysis plans are.

            And I think grant funders ought to require that.

            Of course if halfway through the study it turns out that X is no longer an interesting hypothesis, they could still change their approach. The preregistration wouldn’t force them to do anything, but it would ensure transparency.

          • D Samuel Schwarzkopf

            Isn’t this now a somewhat different motivation than what is normally discussed (questionable research practices, researcher degrees of freedom etc)? Grant applications typically already contain experimental protocols. Probably not in the level of detail one might envision for pre-reg protocols but this is certainly something that could be changed easily. It is true that often grant proposals are a long way from what is actually done. You could probably try to police adherence to the protocol more.

            Although I also hear that it’s common people submit grant proposals for research they’ve already done. In fact, I am still unclear on how pre-reg models aim to prevent that happening in a more massive scale.

          • Neuroskeptic

            Well, people could do the work and then ‘preregister’ it, but that would be fraud. Because when you preregister something you’re attesting to the fact that it will happen in the future.

            It would be an outright lie, just as if I submitted a paper with made-up data.

            Now it does happen with grant applications, and personally I consider that to be fraud too; but it’s fraud in private because the only one being defrauded is the grant funder in question.

            If you did the same thing in the case of a public preregistration you’d be defrauding the entire scientific community. And it would be in public – anyone could call you out on it. Anyone who knows that you’ve already done some certain work, could read the preregistration and accuse you of fraud.

          • D Samuel Schwarzkopf

            While some may say corporations are people, most grant funders aren’t corporations and they aren’t standalone entities really. Most of our grants are publicly funded so I wouldn’t say you’re only defrauding the funder but the larger public. You’re also defrauding the scientific community because scientists are usually the ones who make the decisions on grants (at least up to a point).

            To be fair, I think it’s actually acceptable to have preliminary data in grant applications – in fact that can help support the feasibility of the proposal. It’s more of a question of how much is too much to be called a proof of concept.

            As for enforcing people stick to their grant projects I think this is really up to the funder. If they don’t mind you going off on a tangent then that’s their decision (again, in the end a public grant funder could be held accountable for that by the taxpayer though).

            Anyway, I’m not against pre-reg per se. I may just do it myself at some point ;). I just don’t see that any of the pre-reg models really guarantee that design documents are solid, sufficiently detailed, and that they can’t be gamed just as easily as our current system. The only two pre-reg studies I presently know are this one, and the “telepathy” study I reviewed recently. Neither fulfill me with much confidence that pre-registration has improved the science.

          • ferkan

            As I said elsewhere, it’s clearly not beyond he wit of wo/man to create a good system. No system is infallible. However, at the moment, it’s just too easy to slip into slight dishonesty when writing papers, especially given limited word counts and pressure to publish in ‘respected’ journal etc.

            To my mind, prereg is not to prevent outright fraud, it’s to prevent succumbing to these easy temptations. If somebody is determined to cheat they can, but as neurosceptic says, preregistration would make it easier to confront some types of fraud.

            In my case, my research has always required human participants. If I ‘preregistered’ after I’d done my experiments, it would be pretty obvious to anyone I was working with, or had talked to about my work. With open access, I’d hope that some of my participants would read the studies as well. So I’d be pretty stupid to try and game the system in this way.

          • ferkan

            “I just don’t see that any of the pre-reg models really guarantee that design documents are solid, sufficiently detailed.’

            At the current time, the system is so broken that any improvement would be a leap forward. A simple declaration of primary hypotheses would prevent a lot of the dodgy stuff I’ve seen.

            Deciding on the exact analysis would be even better. However, I can understand why the latter is complicated. In studies such as the ones I’ve conducted it’s not uncommon for data collection to take 5 years, in which time MRI analyses techniques have generally moved on.

            Still, one can always say, we had planned to do analysis X, but it was out of date (or discredited), so we used analysis Y.

    • Dave Langers

      Adding to this useful discussion regarding preregistration:
      The latest Declaration of Helsinki specifies that the design of every research study be preregistered in a publicly accessible place before the first research participant is even enrolled.Before, this used to be the case for Randomised Controlled Trials only. The field hasn’t quite crystallised on how this requirement is supposed to be interpreted (e.g. what constitutes a database, and what level of detail is required to be entered). However, if journals are just going to impose this without further thinking, then all exploratory research on humans is essentially forbidden. I would say that leaves behind the dotting of i’s but kills most of the original science. Furthermore, most studies follow after a period of piloting. This is useful because it is informative for further study designs. I think we can all agree that the outcomes of major pilots are often also of interest to other research groups, and would thus be valuable to be shared. Not so much because of the obtained data perhaps, but because of the obtained experience. Only allowing preregistered studies to be published prevents that. I believe that would be very bad.

      • ferkan

        What would stop one preregistering an exploratory study?

        As for piloting.. there would be no need to register a pilot surely… unless one wanted to publish it. And if one wanted to publish it, there should be a space to do that, with full acknowledgements that it was a exploratory pilot.

        • Dave Langers

          Q: “What would stop one preregistering an exploratory study?”
          A: Not knowing in advance what exact analyses (including full parameters etc.) are going to be performed.

          The point of preregistration is that it should be sufficiently precise to avoid the author being left with any room to “tweak” the analysis in any way. In exploratory studies, one devises new analyses to investigate apparent patterns one notices in the data, so “tweaking” (without the negative connotation) is essentially all one does.

          When you say there should be room to publish a non-preregistered study with full acknowledgements that it was an exploratory pilot, that is fine with me. I guess that was my point. It is not clear to me whether the Declaration of Helsinki leaves that room.

          • D Samuel Schwarzkopf

            “In exploratory studies, one devises new analyses to investigate apparent
            patterns one notices in the data, so “tweaking” (without the negative
            connotation) is essentially all one does.”

            I don’t think this is entirely true. It is certainly true about some exploratory studies but not all. I think (as I mentioned in a previous comment) that some tweaking actually makes sense in hypothesis driven studies. My main point is that those ‘tweaks’ then ought to be copied precisely in replication attempts – or manipulated specifically to show why they matter.

          • ferkan

            Indeed. For me the precise point of preregistration is that one should have no leeway to pass off an exploratory analysis and an a priori analysis.

            I see not reason why we can’t have system to ensure this, that also allows exploratory analyses. They are in no way contradictory… and certainly, often we need to start with exploratory analyses.

            We need to fight for a GOOD system. If the Declaration of Helsinki does not allow for a good system, we should change it!

          • TomJohnstone

            I disagree with this. The point of preregistration is that the author can’t claim to have made analysis decisions a priori when in fact they were made post hoc. So a preregistered study can be completely exploratory. Or a mix of both. e.g. ” We plan on testing Hypothesis 1 using the following precise protocol… In addition we will explore whether any relationships exist between A and B….”

            In addition, it is possible to preregister techniques for optimising the methods in a post hoc data-driven way. For example, one might fine-tune data preprocessing after having collected the data, in a way that is blinded to the experimental conditions and designed not to lead to biases. Preregistering such data-driven optimisation procedures allows for flexibility in data processing without biasing results or permitting fishing expeditions. An additional benefit is that such formal methods for optimisation usually yield objective measures of sensitivity, which can aid in comparison with other studies, particularly future replication attempts.

          • Dave Langers

            Likely our “disagreement” has to do with what can be called “exploratory”. [Compare
            For me, exploratory research by definition does not allow the approach to be defined beforehand. You gave a nice example, in which the analysis method was chosen “post-hoc” from an array of methods based on predefined unbiased criteria. To me, that is just an example of an analysis that is agreed “pre-hoc”, if that term exists, except that it is a very complicated recipe.
            In my view, in exploratory studies you explicitly leave room beforehand to change your approach to deal with unforeseen circumstances, allowing yourself to try avenues you might never have considered when you were designing the study. This should precisely be disallowed for preregistered studies.
            Of course one could register a study while stating “we will perform an exploratory analysis” (or perhaps allow some outcomes to be predefined and not others), but that defeats the point of preregistration.
            And just to be clear, of course I do not mind if people preregister and predefine approaches, and it should be required for confirmatory studies like clinical trials. My point would be that there should be room for the “pilot-type” exploratory studies to be published as well, and I fear that that will be impossible in the future.

          • TomJohnstone

            Maybe a bit of misunderstanding here, in that my description of unbiased, data-driven optimisation procedures was not intended as an example of exploratory research – such methods are very well suited to hypothesis driven research.

            Yet exploratory research can still be preregistered, simply by stating what the intentions are beforehand, even to the degree of saying “I have no a priori plan for data collection or analysis”. It is unlikely that studies that are completely like that would be accepted as a preregistered report in a journal, but at least part of such a study might be like that. e.g. “In addition, we will ask participants to complete a variety of potentially relevant questionnaires (selection of which will be determined based on time constraints and other factors TBD) and mine the data for potential associations, with appropriate correction for multiple comparisons.” I suppose there wouldn’t be any need for such a statement, except that the inclusion of results from such an analysis might be more acceptable to reviewers if they had been flagged in the first place.

            Either way, I don’t think the move to preregistration is endangering exploratory research. Potentially quite the opposite. If it results in more honesty about research that is, at least in part exploratory, perhaps it will lead to a much needed change in attitudes towards exploratory research. Parts of the scientific community clearly believe that hypothesis-driven research is the only legitimate type (reflected in the way undergraduate students are taught, as well as the pressure placed on authors to write a nice, clear hypothesis-focussed story in paper introductions). Yet the testing of hypotheses that are based on little prior knowledge because of a lack of underlying exploratory research is probably a major factor as to why so many studies seem to fail to replicate, or equally why so many researchers feel unable to commit to specific testing procedures.

          • Neuroskeptic

            I agree with TomJohnstone. If you have no hypothesis and just plan to play with the data and see what you find, great! You should be able to preregister that intention and then your exploration would be 100% transparent.

            You might find that someone with a good idea to share reads your preregistration and helps you out.

          • Dave Langers

            Agreed. I would find it acceptable if preregistrations could remain “vague” to some extent to allow exploratory studies, although that would then diminish the value of preregistration as a “stamp of approval” for e.g. clinical trials (because I don’t suspect that every reader would go through the trouble of comparing a paper against its preregistration every time).
            Still, in my view, a better situation would be if studies were divided into a class of “confirmatory, preregistered” ones which would then provide strong evidence for their findings, and a separate class of “exploratory, unregistered” ones which would then provide inspiration and ideas but very weak evidence. Of course, that distinction would have to be clear in journals (akin study/review/case-study/meta-analysis/… now). Various journals have started to experiment with a separate class of preregistered papers, which is great. My worry is that the declaration of Helsinki can be read as not allowing anything else but that.

            But now I am starting to repeat myself, so I shall stop.

    • Geraint Rees

      One clear challenge with VBM is that the detail of the methods, and particular the registration (DARTEL etc) do matter a lot. This shouldn’t be surprising or seen as a weakness of the approach; we are trying to identify subtle grey matter differences between participants, so it’s not surprising that the accuracy and appropriateness of between-participant registration matters.

      As a consequence I wonder if VBM appears more vulnerable to non-replication because it is easier to change a detail of the approach, introduce noise in the form of spatial variability, and subsequently “fail” to replicate.

      In the study under discussion, the processing pipeline was significantly different from the original study, as was the analytic approach. And as Ryota points out, when the analytic methods for the original study were applied to the ‘failed replication’ data in the case of CFQ, the finding actually replicated.

      Unfortunately it looks like this was an underpowered replication attempt with methodological concerns that were unaddressed in whatever peer review went on for the preregistration protocol (it’s very unclear to me what methods the journal used – if any – to formally evaluate the preregistration), and possible solutions appear not to have been taken up by the authors during the review process. It particularly concerns me that Ryota reports in the peer review process a replication using the authors’ data using methods drawn from the original work that they then have not apparently mentioned in the final manuscript.

      The positive is that the work can be used as a stimulus to think and discuss more generally about replication and the important issues raised here. But I am not sure it has added much to our understanding of the underlying biology.

  • ferkan

    I completed my PhD in brain imaging. One of the reasons I left the field, was that many of my colleagues seemed to be engaged in fishing trips. While some of them were quite scrupulous about applying multiple comparison corrections for each specific analysis, they did absolutely nothing about the many different and unplanned analyses they ran. They did not get the basic concept that each analysis counts, even if you discount it later. In addition, replication was virtually unheard of. Similar, but not identical, studies were run…. and differences in results were generally explained away, by…well.. the slightly different methods and samples.

    Given that this was the norm at a pretty well respected institution (one I think the author is familiar with), I found that it was impossible to trust any of the research that I was reading. I’d go and talk the with statisticians and physicists… they’d shrug their shoulders in despair, but it was ‘c’est la vie’. And when that is the case, how can one plan one’s own research, which must be based on the research of others? Psychological data is noisy enough without researchers making it worse.

    Thus we must push much much harder for preregistration. I’d go as far as to say, until we have preregistration, we simply can’t trust science. Making it mandatory would level the playing field. I for one, would be very hesitant to return to the field of brain imaging in the absence of preregistration (not perhaps a great loss to the field).

    • D Samuel Schwarzkopf

      I was torn whether or not I should respond to that. My instincts tell me no, my emotions tell me yes ;). I’ll keep it brief.

      Neuroimaging gets a bad rap. There are certainly problems of the kind that you describe but they are by no means specific or even particularly common in the imaging field compared to others. In all honesty, the fact that we are even talking about this so much w.r.t. neuroimaging is because this field is actively discussing statistical issues whereas many other fields (behavioural psychology, single neuron electrophysiology) are frankly lagging behind. It is of course true that neuroimaging can have a particularly severe multiple comparisons problem due to the many voxels (although this does not apply to all neuroimaging). However, even that is not specific to neuroimaging. Genetics certainly has similar problems and so can single neuron electrophysiology.

      Neuroimaging also gets repeatedly compared to phrenology, which further adds to the image problems. People are keen to point out that ‘correlation does not imply causation’ but somehow fail to explain why this is a problem for neuroimaging but not for behavioural or physiological studies. No matter how clearly you describe your prior hypothesis, just using the term ‘brain-behaviour correlation’ somehow translates to ‘fishing expedition’ in many people’s minds. I agree that pre-reg could help alleviate this problem but as I’ve argued below, I don’t think this is required. What we need is more direct replication – and I think this can be part of on-going research and not simply a goal in itself (although that is of course also welcome).

      • Neuroskeptic

        In my experience, everyone thinks that p-fishing is particularly bad in whichever field they happen to work in themselves.

        Because that’s where they’ve seen the fishing up close.

        So I understand ferkan’s feelings. But I don’t think neuroimaging is really any worse than any of the other “p-value sciences“.

        • D Samuel Schwarzkopf

          Yes I think that’s true. But I think we can also say that awareness for this issue has been increasing a lot in recent years, both specifically in imaging and in psychology in general. And while there is more to be done, I personally think this actually puts our field in a good light as opposed to how this situation is sometimes distorted in the mainstream media. You won’t believe how often I hear people mentioning dead salmons or voodoo w.r.t. to imaging when it really is completely irrelevant. Neuroskepticism is important (skepticism is at the heart of science as I’ve written more than once…) but I do welcome the resurgence of Neurocomplimenting 😉

          • Neuroskeptic

            I agree. I feel responsible for the Dead Salmon because I helped to publicize it when it came out. And I’m glad I did, because it made an important point about multiple comparisons correction… but there’s no doubt that it has been oversold and misinterpreted by the media who missed the point.

            See my post Don’t Throw Out The Baby With The Dead Salmon.

          • D Samuel Schwarzkopf

            Don’t feel guilty about the Dead Salmon. As far as I remember your coverage of this was detailed and accurate. You can’t be held responsible for every misunderstanding of it provided you don’t mislead people.

            And besides, one has to admit that it’s funny. I’d like to see more Dead Salmon studies. In fact, someone should do a Dead Salmon experiment on that EEG telepathy experiment I reviewed recently :)

      • Hellson

        I generally agree with this post (and all your others), but I feel I need to object to pointing out behavioral psychology as a particularly problematic area. To me the problem is that psychology has become less behavioral, and more cognitive, and moved its relevant target variables behind the empirical barrier. I see those areas in psychology that are still actually focused on behaviors and environmental contingencies, and less on assumed underlying processes that can neither be observed nor proven and remain in the realm of metaphors (as of this moment), as less “ill”.

        • D Samuel Schwarzkopf

          I’m not entirely sure what you mean by processes ‘in the realm of metaphors’. I think what you say makes a certain degree of sense but I find it difficult to comment on that specifically. I am a neuroscientist with physiology background and while I study cognition and behaviour I have always viewed it from this angle.

          • Hellson

            Maybe I didn’t phrase it correctly. I certainly did not mean to attack you or anyone else specifically should that be the impression I gave. What I meant to say is that many areas in psychology have moved their “target variables” behind the empirical barrier, and in appropriately infer from the observable to the nonobservable. The latter is often being described in metaphors or similes from observable areas, and I think this is because they make nice stories readers can easily imagine. We have seen that in particular in the early stages of the cognitive revolution where the brain was often compared to computers, althought his comparison is neither accurate nor appropriate.

            Or take the area of ego depletion, for example, which I assume you know represents the idea of a limited mental resource or energy that can be depleted or replenished. Here, willpower is often described as a “muscle” that can suffer from fatigue, although there is absolutely no empirical basis for this characterization. But it sure sounds nice.

          • D Samuel Schwarzkopf

            No worries, didn’t take any offense. I was just curious as to what you meant.

          • Christopher Chatham

            “Here, willpower is often described as a “muscle” that can suffer from fatigue, although there is absolutely no empirical basis for this characterization.”

            I don’t know what you mean. There have been many demonstrations that effortful tasks rely on depletable resources, some of which one might call “willpower”. Here’s one among many demonstrations:

          • Hellson

            Sure. Just like someone might call it mana points. This study, like many others, measured performance in the form of accuracy or reaction times in a sequence of tasks with different demands, not the existence of a “mental resource” or “muscle”. I don’t know why these metaphors for something that has not been observed but is tacidly assumed to underly the observation, are even necessary to describe the obtained effects. Indeed, one can define the word “willpower” to meet operationalizable constructs. But the words “energy” and in particular “muscle” are used to invoke the image of something that is observable or measurable (in physics and biology, respectively). Which part of this study suggests to you that one brain segment is a willpower muscle?

          • Christopher Chatham

            I wouldn’t have thought manapoints would fatigue, but besides that small issue, manapoints doesn’t seem to make useful predictions about what to test next. Those who use these analogies are merely trying to gain traction on a poorly-understood system by extrapolating from other more well-understood systems. Sure, there is a risk of reifying our metaphors; map vs. territory issues can be thorny. But are you rejecting altogether the utility of metaphors for understanding?

          • Hellson

            To communicate complex system to lay audiences? Maybe. But not to describe scientific theories and processes that need clearly defined terms and *operationalizable* constructs. In these cases, I find metaphors to be misleading and distracting from the relevant information, in particular because they can be used to cover up the fact that a system is poorly understood (compared to the more well understood system the metaphor is taken from).

          • Christopher Chatham

            I’m not sure yours is a majority or even tenable position, given the demonstrable utility of metaphors for problem solving in general and in science in particular. E.g., Regardless I suppose we’re getting far afield from the other more on-topic discussions taking place here, so perhaps you and I could take this one up another time.

          • Hellson

            Oh absolutely, I never said or assumed my opinion represented the majority. Even though you expressed doubts my position is a viable one, I enjoyed this short exchange and would certainly continue it some other time.

  • Ryota Kanai

    As quoted in this post, I think it’s important to consider how we should conduct a pre-registered confirmatory study both in terms of statistical methods and also in terms of peer review procedures. I truly believe that this particular case can be used as a way to refine our practice of conducting preregistered studies.

    I feel I need to elaborate on my points further to spur the discussion. For that purpose, I was preparing to write a commentary on the Boekel study as a Reply in Cortex in the hope that my responses would be read simultaneously as the Boekel paper.

    Since the accepted manuscript was made public on the authors’ website, I did not have a chance to reply. I really appreciate that Neuroskeptic gave me a chance to express my opinions in his blog in a timely manner.

    However I still need some more time (a week) to finish writing a full rebuttal so that the difficulties I encountered in reviewing the pre-registered study are shared and discussed within the community.

    While I’m writing the full reply, I would like to share the comments I wrote as a reviewer for this manuscript, because they illustrate the kind of concerns about the methods and the difficulties in reviewing such a study I encountered.

    There were two rounds of reviews – the initial submission and a revision. My comments are copied below. I’m aware it is unusual to share the comments this way. But I think my original comments would be useful for illustrating the points quoted in this blog. Here I share only my own comments and also decided not to include the authors’ response. I hope they post their response letter here as well so that the readers of this blog can also see what kind of exchanges took place in the review process.

    Review of “A purely confirmatory replication stay of structural brain-behavior correlations” by Boekel and colleagues.

    Ryota Kanai (signed)

    This study attempted to replicate previously published five neuroimaging studied that identified relationships between brain structure and cognitive traits. I agree with the authors about the importance of such an attempt considering the current awareness of the issues related to the practice of post hoc interpretations in publications. A purely confirmatory study as presented in this study makes an important example of a pre-registered study for psychological and neuroimaging experiments. I truly believe that replication studies as the current study should be encouraged.

    As a reviewer, however, the assessment of a pre-registered confirmatory study also revealed a new challenge. It was not clear to me whether the analysis proposed in the pre-registeration could be criticised at this stage. In my opinion, the most important aspect that should be evaluated by the reviewer is whether the authors have adhered to the protocol in the pre-registration document. However, I noticed the necessity of a reviewing process for evaluating the methods to be used when a study is pre-registered. Otherwise, the methodological validity cannot be assessed in any stage. I’m aware that Cortex has a Registered Report, which allows the examination of methodological soundness before a study is conducted. In the case of the current study, I do not think the study went through such a process. However, the online documentation of the planned experimental protocols serves for that purpose to some extent.

    I wish I had an opportunity to examine the protocols before the study was conducted, as the ROI approach used in this study would lack the sensitivity and underestimate correlations. In this study, the VBM analyses were based on the grey matter volume within the cluster defined by uncorrected p < 0.001. This approach does not take spatial uncertainty of significant voxels into account. Instead, the analysis assumes that the results should be reproduced at the exactly identical location within the brain. But this assumption is incorrect, because there is always inherent uncertainty about the exact location of correlated voxels due to smoothing and randomness across samples. From my own experiences with replicating results from VBM studies, peak voxels tend to be slightly different in a replication sample, but within a reasonable spatial range (e.g. within a sphere with 8mm radius). Because peak voxels and the exact shape of a cluster fluctuate across samples, the mean grey matter within the ROI as estimated in the present study likely add noise and results in underestimation of correlations.

    Outside this review process, I had an opportunity to discuss this concern with the authors, and they have kindly provided me with the raw data to test this idea. As I suspected, the CFQ results were replicated when I used a more conventional small volume correction approach which tested whether there is a significant voxel showing the expected correlation within a small volume (i.e. within a sphere with 8mm radius) near the original peak coordinate. While I agree the ROI approach provides a good proxy for estimating the grey matter volume at the reported coordinate (and a cluster around it) in the original study, what is being replicated in the current study seems conceptually slightly different from what statistics in original studies tested. In the original VBM studies (at least for those I was involved), we tested whether there are significant clusters that showed correlations with some trait variable. But the tests did not specify where such voxels should be, and therefore reported clusters have spatial uncertainty, and this needs to be taken into consideration in replication studies. Considering this potential caveat of the analysis approach, the emphasis on the failure of replication for all 17 findings seems a bit too strong.

    As a pre-reigistered study, the authors followed the protocols proposed in the pre-registration document, and exploratory analyses were presented as such. Therefore, my opinion is that the manuscript should be published only with minor corrections.

    The major concern I raised above should be taken as a point to be discussed within the community and considered in future studies. I think it would be beneficial for the future development of the field to discuss this in public. In particular, I believe replication methods for neuroimaging data could be improved by incorporating spatial uncertainty instead of averaging the values within an ROI. Such issues should be addressed in future studies.

    Minor 1: VBM methods were different. In the studies I was involved, I used SPM's co-registration methods (i.e., DARTEL), whereas FSL's VBM was used in this replication study. I'm not entirely sure how this methodological differences might affect results. Although in the ideal world, we expect to see consistent results, this is not always the case. Since the purpose of the current study is replication, the methods should also be closely matched.

    Minor 2: It is unfair and possibly misleading to claim that none of the original results were replicated, because half of them were underpowered (i.e. labeled as anecdotal). It is true that none of them was replicated. As the strength of using BF is its ability to statistically support a null hypothesis, it is slightly disappointing that nearly half of the results were underpowered.

    Minor 3: Since nothing was replicated, there is a chance something went wrong with the image acquisition or analysis procedure (although highly unlikely). It would be more convincing if the authors could replicate at least one classic finding (e.g. effects of age or gender identity) in their own data set.

    My review for the revised manuscript

    Recommendation: Accept Overall Manuscript Rating (1-100): 40

    Reviewer Blind Comments to Author:

    My frank opinion is that my previous comments were not addressed at all on the ground that this is a pre-registered study that does not allow analyses not planned at the outset of the study. My personal opinion is that they could have been addressed and labelled as exploratory analyses.

    On one hand, I am frustrated by the claim that replications failed as I know the analysis methods used here are suboptimal, and I think the lack of a formal review on the analysis method is problematic. However, I do understand the spirit of pre-registration studies, and its necessity. Since this is one of the earliest pre-registered studies, and as such, I understand that there was no formal mechanism available to implement a review process for the pre-registration document. I do not think sending out the online document to the authors of the original studies should be considered as a formal review process, as I was not clearly explained that I needed to review the document as I would review a manuscript for a peer-reviewed journal. As such, I am not convinced that this study went through the right procedure to be considered as a proper pre-registered study.

    I also believe that my minor points should have been addressed in this revision. The authors' responses only claimed that they cannot do anything not initially planned. But those minor points are points that I thought would be important for scientific merit (otherwise what is the point of reviewing then?). Even if the authors claim that they only followed the procedure described in the pre-registration document, it is scientifically crucial to use the same software package for a replication study. They should use the same software and re-run the analysis and label it as an exploratory analysis as requested, and denote them as an exploratory analysis as they did for the exploratory study they included in their original submission.

    In this sense, I feel there is a double standard. On one hand, the authors included their exploratory analysis, but on the other hand, they did not include additional reasonable analyses I requested by claiming this is a pre-registered study.

    While I have the critical opinions about this particular pre-registered study as above, I appreciate that the movement of pre-registration is a relatively new endeavour and recognise the importance for our community. Therefore, I feel this study should be published nonetheless, and we should discuss how we can improve on potential problems we encounter in practice (e.g. not being able to address reviewers' concerns when there are real concerns if it were a scrutinised as a typical peer-reviewed study). Also some of the points I made in the previous round were specifically related to how we should run a confirmatory study in the context of neuroimaging, which poses a domain specific problem of spatial uncertainty. Therefore, while I have criticisms, I think it would be more constructive to discuss the issues in post-publication peer discussion.

    • Neuroskeptic

      Thanks very much for posting these reviews! And for agreeing to be quoted in the post.

  • petrossa

    The failure to replicate aside, what is more astounding is the premise one can single out a part which causes ‘behavior X’ in the same place in all the brains in existence or even the majority of them.
    It’s analogous to pinpointing which part of the CPU is running ‘application X’ at a specific CPU cycle

    • D Samuel Schwarzkopf

      Some sweeping assumptions in this comment. By definition brain-behaviour correlation studies are correlational in nature. Nobody I’ve ever met who did such studies is saying that they imply that a single part ’causes behaviour X’. There *is* a worthwhile discussion to be had about the language used to describe correlations though. Verbs like ‘predict’ and ‘depend’ probably evoke an inaccurate connotation even though they aren’t statistically incorrect.

      Your analogy of comparing the brain to a CPU is also misleading because it’s based on circular reasoning. The question of how much the brain is like a computer (and behaviour the software running on it) is an empirical question. Failures to replicate aside, I think the body of evidence is strongly suggests that the computer analogy is inappropriate, even if the strict modularity hypothesis turns out to be incorrect.

      • petrossa

        hit a raw nerve i see. Strawman response. Using jargon to dissimulate your lack of argument, changing the the matter from factual to the metaphysical. In fact using a lot of words to say nothing at all but just conveying the impression of being ‘right’ by implying i am ‘wrong’. ‘The body of evidence’ please… so what does your ‘body of of evidence’ suggest? From what i’ve read nothing much except the notion that we are special so therefore it’s not possible we’re mere Autonomous Moving Objects.

        • D Samuel Schwarzkopf

          No but I’m not continuing this discussion. It’s off-topic to this blog post anyway.

          • ferkan

            One can understand this point of view. But it perhaps has more to do with academic titles and the mass media reporting of research than the actual content of papers.

            That said, I’ve been pretty pissed off at the way some of my erstwhile colleagues have sold their research to the media (even if they did it to keep their job/get more funding).

  • Eric-Jan Wagenmakers

    I’d like to add a comment about the sample size issue. Yes, we would have loved to test more participants and obtain more compelling results. This was unfortunately impossible because we simply lacked the financial resources. However, it is important to realize that power is a pre-experimental concept. So *on average*, with low power one runs a risk of obtaining results that are less compelling than one would like. However, after the data are in, all that matters is the posterior distribution and the Bayes factor, as these quantify all that has been learned from the data. Considerations of power are therefore irrelevant after the data are in. This point is elaborated upon in the following paper: Wagenmakers, E.-J., Verhagen, A. J., Ly, A., Bakker, M., Lee, M. D., Matzke, D., Rouder, J. N., & Morey, R. D. (in press). A power fallacy. Behavior Research Methods. (

    • Geraint Rees

      Isn’t power relevant to thinking about why your results turned out the way they did? If I have understood correctly, four out of the five replication attempts resulted in results with a BF<3 – sometimes many BF<3. Such 'anecdotal evidence' in favour of H0 is interpreted in the paper as representing a failure to replicate. But could BF<3 instead be reasonably interpreted as meaning 'can't decide, more evidence is needed' – and one of the reasons that more evidence is needed is that the study had too few participants to reach a clear conclusion?

      • Eric-Jan Wagenmakers

        Hi Geraint,
        I hope we were appropriately careful in our conclusions. We not only presented a default BF, but also the replication BF, and we plotted the posteriors under H1. Anyway, of course you are right in the sense that N is an important determinant of what you can expect in terms of the conclusiveness of the results. But thinking about *reasons* for why the inference is the way it is — that’s different from having the inference itself be influenced by power.

        • Ben

          I just had a look at the
          paper and personally feel the ‘strength of evidence’ point is quite relevant.
          In the paper this appears not to be discussed far beyond the summary that
          support for the null ‘ranged from anecdotal (Bayes factor 10)’.

          Looking at the tables of confirmatory
          BFs reveals quite interesting patterns within this range. The first thing that
          sticks out is that the Xu et al. replication attempt yielded stronger evidence
          in favor of the null than those for the other papers. There also is moderate to
          strong evidence against the Amygdala correlations reported by Kanai et al.,
          2012. I’d take this specific result with a pinch of salt, though. Registration
          accuracy will be particularly important for a structure that is small relative
          to the between-subject anatomical variance (such as the amygdala) and therefore
          it is hard to tell whether the apparent support for the null in this specific
          case is valid or points to suboptimal registration. Especially since the
          replication did not use the original method for registration (DARTEL).

          Please have a look at the
          attached plot, which hopefully will make the point more intuitive: The
          distribution of BFs is quite uneven and characterizing it by its range alone seems
          rather crude. I feel we should avoid lumping such a diverse set of results in a
          single bucket labeled ‘failure to replicate’, esp. since most readers will equate
          this phrase with meaningful evidence for the null. I don’t think the paper does
          just that, but it also could do a lot more to avoid this type of

          • Ben

            hmm, let’s try uploading this plot again…

          • Ben

            So, to bring it back to Geraint’s argument – there is no need to consider N to wonder about the strength of evidence. It’s the *results* that indicate that most tests didn’t yield any more than anecdotal evidence either way. And doesn’t that warrant exactly the type of discussion that asks for the *reasons* for this lack of conclusive evidence? It seems the most obvious candidate is the lack of power.

          • Neuroskeptic

            Thanks for the very handy plot.

            It almost looks like the BF’s are bimodal – most are anecdotal, but some are strong.

    • Jim Kennedy

      Here is a simple, common sense guideline: Every confirmatory hypothesis test or analysis should be publicly pre-registered and should include a power analysis. The power analysis (at a minimum) evaluates the probability that the experiment will draw the correct inference if the alternative hypothesis is true for a reasonably expected effect size. Power analysis is particularly important for Bayesian analyses because Bayesian analyses are more flexible than classical analyses and thus have greater potential for bias.

      The operation of a Bayesian hypothesis test depends on the
      selected prior distribution, the planned sample size, the specific statistical models, and the criterion for acceptable evidence. A power analysis evaluates how well these various factors combine to form an overall hypothesis test—the statistical validity of the test. A relatively simple simulation can generate data that has the expected effect size. The planned Bayesian analysis can be applied to see what proportion of simulations produce the correct inference. Kruschke (2011, Doing Bayesian data analysis: A tutorial with R and BUGS) provides a useful discussion of power analysis in a Bayesian context. He also recommends avoiding use of the Bayes factor because of the high potential for bias. As power analysis becomes more widely applied in Bayesian analyses, his position may become more popular. Another useful reference on power analysis for Bayesian analysis is

      An evaluation of power for a Bayesian analysis that found a surprising potential for bias, is at

      As a hypothetical (but not unrealistic) example, if a planned hypothesis test has a .4 probability of producing a Bayes factor of 3 or greater supporting the null hypothesis when the alternative hypothesis is true, an experimental outcome supporting the null hypothesis is unconvincing–and that is true whether the power analysis was done before or after the data were collected. Of course, it is in everyone’s best interest to do this evaluation before the data are collected, and that should become the standard for confirmatory research.

      • Eric-Jan Wagenmakers

        Hi Jim,
        I disagree strongly on all points. For inference, when the data is at hand, power is irrelevant. The criteria you use are frequentist and have no epistemic foundation: we care about the data obtained, not other data sets that could be obtained but were not. The Bayes factor is the only coherent method for model comparison. I know Kruschke and some others do not like it, but there are strong reasons to disagree with them (among them coherence and probability theory).

      • Eric-Jan Wagenmakers

        Well, actually, let me revise that statement. I do like the idea of a Bayesian power analysis *in the planning stage*. Just as long as one uses a coherent evidence assessment method, of course (i.e., the Bayes factor, and not some ad-hoc procedure).

        • Geraint Rees

          Out of interest, did you do a Bayesian power analysis in the planning stage of this study?

          • Eric-Jan Wagenmakers

            There was no point: we tested as many participants as we could. In research less costly, I usually adopt a stopping rule that says “continue sampling until BF exceeds 10 or 1/10, or until having tested at least x participants” (where x is large). Unfortunately this work was very expensive so we could not test more. Nevertheless, I do believe that the results are suggestive (and some are strong); we have not thought about analyzing the results asa whole.

          • Geraint Rees

            Thanks EJ.

            So to be clear, that means (I think, based on scoping Ben’s helpful histogram above) that 14/17 of your BF do not meet your standard stopping rule. And 9/17 (the majority) are ‘anecdotal’ meaning barely worth a mention in terms of whether they merit a mention or not in favour of the null or alternate hypotheses.

            I agree the data are suggestive. But what they suggest to me is that you didn’t collect enough data to decide whether the original findings replicated or not, irrespective of whether the methods were optimal!

          • Eric-Jan Wagenmakers

            Dear Geraint,
            I believe that the results (as a whole, including the replication Bayes factors!) suggest that the effects are smaller and more brittle than previously claimed. Surely we agree on this?This study was never meant to yield a definitive yes-no answer. But to ignore these results and argue that all of this is due to small sample size seems to me to be counterproductive and not in line with what the findings show (although I understand it might be in line with your prior beliefs — as a Bayesian, I have no problem with this). The constructive response, imo, would be to engage in a RR for Cortex where the proponents take up the gauntlet and demonstrate the veracity of these effects. Perhaps this can even be done in the context of an adversarial collaboration — in general the interaction with the researchers involved was very pleasant and the overall experience has been a good one. But I assume that at the very least the current report will reduce your estimate of effect size in these studies. We intended this paper to be the start of a conversation, not the end of it.

          • Geraint Rees

            Dear EJ,

            Thanks. I’m not trying to ignore your results; I’m trying to understand them! And I’m not suggesting that all of your results are due to small sample size – quite a few other possibilities are being discussed here including suboptimal spatial registration, sample characteristics etc – and you mention many of these in your preprint. Of course it remains possible that the original effects do not replicate because they are false positives. It’s just that we can’t tell from the study you conducted where the BF are inconclusive (the large majority of the effects). Regardless of whether we’re frequentist or Bayesian we surely agree on that?

            I agree that more replication would be wonderful! As you know we are already very keen on replication. For the Kanai et al (2012) result you attempted to replicate, we already reported an independent replication in the original publication. For the Kanai et al (2011) result you attempted to replicate, Kanai reports (above and in your peer review) that the result replicates in your data when using our (previously reported) analysis pipeline. And in addition there is an independent replication of that result in a different sample (Sandberg et al NeuroImage 2014). We’ll carry on trying to replicate our findings and certainly encourage others to do so too. But hopefully this debate has helped inform some of the things to think about when attempting replication, in particular the need for power analyses and careful consideration of methodological pipelines (e.g. DARTEL).

            I look forward to continuing the conversation in the spirit of constructive engagement!

          • Eric-Jan Wagenmakers

            A third possibility is that the effects are non-zero, but much smaller than previously asserted. This possibility receives support from the replication BFs, some of which are very large indeed. Replication is key, I completely agree, but I also believe that preregistration is essential (such as in the RR format for Cortex).

          • D Samuel Schwarzkopf

            This is a part I just don’t get. I have yet to port your R-script to Matlab so I can’t answer this question myself yet ;). But using your replication correlation-BF what would be the BF of the original correlations (the initial results or the original replication ones, where applicable) using the sample size you had?

            To me perhaps the most convincing advantage of Bayesian or precision stopping criteria is that they are orthogonal to the experimental hypothesis. As you say you could have collected data until BF is either >10 or <0.1. I can see that this is costly in this case but surely there is a too small N for actually confirming the null?

          • Eric-Jan Wagenmakers

            Interesting question! I cannot recall we computed BFs for the original results.

            In general (on average), it is difficult to collect compelling evidence for H0 unless you have many participants. It does help to have a one-sided test though.

          • D Samuel Schwarzkopf

            See above for my attempt at answering this question. I thought it was more useful to post this at the top of the thread were it doesn’t get lost…

        • Jim Kennedy

          Hi Eric-Jan,
          Another way to look at it is that a power analysis is simply a simulation to see how effectively a hypothesis test (decision-making process) works. It is an evaluation of a test method in a specific situation, not an evaluation of a scientific hypothesis. The method is evaluated by feeding it data with known properties and examining the output of the test method. Some of your comments could be interpreted as that you believe on philosophical grounds that your ideas about test methods are above empirical evaluation and should not be questioned. However, my impression is that you are not one of the Bayesians who believe that simulations should not be used to evaluate Bayesian methods. If a simulation finds that the test method tends to give incorrect results with reasonable models of the effect of interest, then findings using that method are not going to be convincing and philosophical arguments cannot overcome the simulation data. The evaluation of the test method is applicable whether it is done before or after the method is used with actual experimental data. If Bayesians are confident of their methods, I would think they would be advocates of power analysis.

          • Eric-Jan Wagenmakers

            Well the issue is that, from a Bayesian perspective, there is only a single method, and it is the optimal one, coherently using all information in the observations to update from prior to posterior. On average, before you see data, it yields a distribution of results. This may be informative, but it cannot result in the adoption of a different method. So the “performance” is never questionable: it is optimal by definition. But I do agree that ine could do a decision making analysis on the power results and take the action (conduct experiment or not) with the highest expected utility. We may have dine this but in a very implicit fashion.

  • storkchen

    I am intrigued by this critical comment by Kanai, cited in the post: “3. The suboptimal methods could not be corrected because it was a pre-registered study.” My instinctive reaction was: isn’t that precisely the point of preregistration, to avoid researchers switching to a more “favorable” analysis after seeing the data? So I think more information on what possible optimizations or corrections of the analysis were meant would be very helpful. This issue could potentially be very important to the discussion about the logic and goals of preregistration.

    • Geraint Rees

      The comment raises two issues. The first is why the suboptimal methods were not picked up at the pre-registration phase. Cortex states that pre-registration protocols are peer reviewed. However, the details of the peer review are not clear and the protocol is not (currently) to my knowledge posted on the Cortex website. So it could be that the peer review of the pre-registration protocol did not pick up these issues, or that the issues were picked up but the peer reviewers were overruled by the editor. A broader issue with pre-registration is therefore that it is not a panacea; if peer review is faulty or a protocol issue missed/disregarded, then suboptimal methods will be used in the subsequent protocol.

      The second issue is why the authors, when told about the suboptimal methods in the review phase of the main manuscript, chose not to pursue an additional exploratory analysis (labelled as such) with optimal methods. This would have given us much greater insight into why some of the findings failed to replicate. Exploratory analyses are perfectly valid in a pre-registered study.

      If, for example as Kanai pointed out, the authors own data replicates one of the previous findings when analysed (in an exploratory fashion) using the previous study protocol, but fails to replicate when using the pre-registered protocol, then this tells us something. That the failure to replicate is due to the protocol rather than the data or the biological effect. But, as documented above, the authors apparently chose not to include these details in their published paper.

      Pre-registration does not solve the issue of having good methods and appropriately powered studies in the first place.

      • Neuroskeptic

        “Pre-registration does not solve the issue of having good methods and appropriately powered studies in the first place.”

        It doesn’t guarantee it, but it does ensure transparency. With preregistration poor methods can’t be “brushed under the carpet” nearly so easily.

        Also, public preregistration offers anyone the opportunity to comment on the methods and criticize them, before the results arrive.

        This is constructive criticism because it could lead to the authors adopting better methods (at the risk of making themselves look obtuse if they don’t – again, thanks to transparency).

        Preregistration isn’t a guarantee that good science will be done. Nothing can guarantee that. But preregistration realigns the incentives so that good science (unlike today) is favored.

        • Geraint Rees

          I think you’re being a bit naive here. For most authors and most scientists interested in this area, the first opportunity to publicly comment on the pre-registration protocol came with publication of the manuscript pre-print. This is because the only way to find the pre-registration protocol is to read the manuscript which gives the URL where the protocol is published ( – it has attracted zero comments). There’s no signposting on the Cortex website that I could find.

          I’m also a little unclear on the (non-public) peer review of the pre-registration protocol. The only public comments available from Ryota indicate that he was sent the pre-registration protocol (but as information and not for peer review). So presumably Cortex (as indicated in their policy) also conducted single-blind peer review on the pre-registration protocol? The comments on this blog make clear there are quite a few methodological issues that should (perhaps were – we don’t know) have been picked up in that earlier protocol peer review.

          I think the lesson here is that pre-registration is only as good as pre-registration peer review. If pre-registration peer review fails to pick up issues, then bad science can be done easily, albeit in a transparently bad fashion.

        • Geraint Rees

          Chris Chambers has updated the blog to make clear that although the pre-registration protocol was published, it was not published by Cortex nor was it peer-reviewed. The pre-registration protocol is published on the personal blog of Luam Belay, a Masters student at the University of Amsterdam who is a co-author on the published manuscript. It’s published as an embedded document so is not possible to find easily with Google. So it seems that no peer-review was carried out on the pre-registration protocol that might have avoided these issues.

    • D Samuel Schwarzkopf

      This comment illustrates perfectly the problem I raised in my reply to Chris. I think the discussion of whether researchers switch to favourable analyses detracts from situations where this should be the correct course of action because the pre-reg protocol is inadequate. In this case the “pre-reg” protocol was already available in the original papers and it wasn’t followed.

  • Chris Chambers

    Hi all,
    Great discussion. I edited this paper at Cortex and I’d just like to add a few points:

    1. To be clear, Cortex did not review the pre-registered protocol; we received the manuscript to consider only after the study was completed. Cortex does offer a format of publication called a Registered Report which involves a two-stage peer review of the protocol and final paper — see here for details: However this submission was *not* a Registered Report. It was a standard research report in which the authors pre-registered their protocol on a blog independently of any journal. I agree with Geraint that such protocols should be published as prominently as possible to allow community feedback – there are now some terrific resources for doing so, such as the Open Science Framework ( — we have pre-registered several of our own studies on OSF and have then improved them following community input).

    2. It remains a topic of debate as to whether the method employed by Boekel et al was suboptimal (you will see further discussion of this issue in Kanai’s reply, which will appear in Cortex*) — but even if we suppose it was suboptimal then, as Kanai points out in this thread, this case provides an argument in favour of Registered Reports**, where the authors of the original study can be formally involved in peer review of the protocol.

    3. The manuscript by Boekel et al was received by Cortex on April 9,
    2014. It received three expert reviews: two from neuroscientists
    (including of course Ryota Kanai) and one from a specialist in Bayesian
    statistical methods. All reviewers recommended publication following a round of revision.

    4. There are two traps we must avoid here. The first is seeing the Boekel study as the last word on structural brain-behaviour correlations. It isn’t and it doesn’t mean such effects don’t exist. It is simply another (important) brick in the wall. The second is to set unrealistic standards for what a study must achieve simply *because* it fails to replicate another. If you find yourself questioning whether the Boekel et al study should have been accepted for publication (due to what you consider to be suboptimal methodological aspects), ask yourself whether you would arrive at the same judgement if all the effects *had* successfully replicated. If so, great. If not, you are reinforcing publication bias by setting a higher bar for negative findings than for positive findings.

    5. I think there are some misconceptions in this thread about what pre-registration is and isn’t, and the consequences it does (or doesn’t) have for exploratory research. In short, it doesn’t hinder exploratory analysis and it needn’t hinder science. Equally, I agree with Sam that it shouldn’t be seen as a panacea, and nobody I know has ever argued that it should be mandatory. We have tried our best to emphasize this repeatedly throughout the last 18 months — see here for more

    6. Finally I’d like to thank everyone involved — the authors, the reviewers, the commenters — for the helpful manner of this debate. Ryota, in particular, should be applauded for his transparency and for adopting such a positive and constructive tone. What a great example this discussion sets for other fields (and other areas within psychology) in terms of managing the sociology – as much as the science – of replication.

    * Perhaps the next structural brain-behaviour correlation study could be submitted to Cortex as a Registered Report!

    ** If you would like to submit a comment on the Boekel study (or the wider issues it raises) to Cortex, to appear in the same issue as the main paper, please contact me asap. I would be happy to widen the discussion in a series of Forum articles.

    • Geraint Rees

      Thanks Chris that’s very helpful. I’m not entirely in agreement with your point 4. The standards we are discussing are not just to do with whether the methods are suboptimal or not. Two additional suggestions have arisen in this discussion.

      The first is that the interpretation of the data is overstated. The majority of the Bayes Factors provide ‘anecdotal evidence, barely worth a mention’ in favour of non-replication. Ben has pointed out that these BF are lumped together in the to-be-published paper with the ‘strong evidence’ BF as a single range that ‘suggest’ non-replication. I ask in return, if you received a paper with 17 positive BF brain-behaviour correlations, but 14 did not meet standard stopping criteria for Bayesian inference, would you be comfortable publishing that paper as ‘suggestive evidence’ for the positive brain behaviour correlations? We must have similar standards for positive and negative results at the very least.

      The second issue that has arisen is the response of the authors to the peer-review suggestions from Kanai; in particular that if their data are reanalysed using the protocol in the original papers, the findings replicate. This at the very least suggests that there is an issue with the preregistration protocol that has led to the ‘non replication’. Again, if we considered a positive brain behaviour correlation paper and a reviewer pointed out that if the data are analysed in an optimal way the effect disappears I think we would expect the authors/editor to incorporate this information somehow into the discussion (or provide an exploratory analysis confirming this).

      So I agree with you that we should hold both positive and negative findings to the same methodological standards. But I’m worried that we didn’t in this case – although of course, hindsight is a wonderful thing.

      • Chris Chambers

        Thanks Geraint. In response to your first question: generally, yes, had the results gone the other way to the same strength of evidence (and had the reviews been equally favourable), I think I would have reached the same editorial decision. Of course it is very difficult to say for sure, as that paper doesn’t exist, but I do try as much as possible to apply the same standards in assessing the publishability of positive and negative results.

        I don’t agree that the authors’ results are oversold – they are in my reading very clear about the range of BFs obtained – but I will let them defend themselves!

        In terms of Kanai’s point about the re-analysis, I did think hard about this, and I decided that the replication authors shouldn’t be required to include these exploratory analyses in their manuscript for three reasons: (1) because Kanai himself recommended that the paper be published without this re-analysis; (2) because all primary authors had been consulted in advance with the detailed protocol and did not recommend any alternative approach at that time (admittedly this may have been better as part of a Registered Report, where reviewers would know exactly what was expected, though it is likely that RRs did not exist at the time); (3) that the exploratory analyses were likely to be reported anyway through replies to the article.

        All of the above is a judgement call on my part and I’m happy to accept that another editor may have decided differently. I don’t want to defend myself too strongly – I’d rather let others pass judgement on whether I made the right or wrong decision and I’ll take it on board next time around. I’m the first to admit that handling replication studies is challenging. Let’s face it, it’s not something we have nearly enough practice doing!

    • D Samuel Schwarzkopf

      Thanks for that clarification, Chris. Of course, I agree that this case doesn’t spell the doom for pre-registration. It is certainly still early days for this publication model. So we can probably regard these concerns as early issues to be ironed out rather than as criticisms for why pre-reg can’t work.

      By now I personally know two pre-registered protocols (or by hearsay, three) in all of which there were clearly issues with the original protocol. It shouldn’t be hard to see that emailing the protocol to the original authors to obtain tacit agreement about the methods is perhaps insufficient for guaranteeing replication protocols are adequate. We all seem to agree that peer-review of the pre-reg protocols should be an improvement. However, we also all know of the problems with peer-review so this can hardly be a perfect safe-guard either.

      I used to say that pre-reg would be ideal for replication attempts. This case has had me revise this opinion. As I’ve argued elsewhere in this thread, for replication the original protocols is already available in the original publication (or at least it should be if the previous methods contain sufficient detail). Obviously, it may still be a good idea to ask the original authors about the experimental design of the replication.

      But certainly in this case this was not critical. Just from comparison with the original studies it is clear that some of the fundamental methods are not matched (pulse sequence, coregistration, ROI-based analysis approach). This all seems quite essential. By the same logic I could use binoculars to show that findings from radio astronomy fail to replicate.

      Alright, perhaps that’s an exaggeration but I think it is quite a crucial issue how the dependent variable is measured. A failure to replicate can mean that the effect doesn’t exist – but it can also mean that it doesn’t generalise. (If I may make a shameless plug, I have recently discussed this very issue in a blog post: I think this is why both conceptual and direct replications are critical because only together they can actually answer this question.

      This replication attempt is clearly more of a conceptual replication. And the entire discussion about how non-pre-registered methods, such as the analysis Ryota recommended in his review, are “exploratory” (thus implicitly stating that they are not up to the same standard as methods from the registered protocols) is detracting from the more critical issue of whether the effect is real and, if so, what factors control it.

      • Neuroskeptic

        “By now I personally know two pre-registered protocols (or by hearsay, three) in all of which there were clearly issues with the original protocol.”

        Preregistration doesn’t ensure that a study is well-designed, of course, but it does ensure that everyone can see the quality of a design from the outset.

        If you feel that a preregistered protocol is flawed, you should say so and make a prediction as to what the results will look like as a result of these flaws. In this way preregistration makes criticism testable.

        • D Samuel Schwarzkopf

          Yes in theory. But in practice all of the three examples were suggested to be flawed after nobody saw, let alone commented on, the pre-reg protocol. So perhaps what this tells us is that if pre-reg is supposed to become more commonplace in the field we need to make evaluation of the pre-reg protocols much more prominent. They will need peer review. And they must not only be public but also visible so that other people can comment on them.

          • Neuroskeptic

            I agree – and I think that better structures for communicating and reviewing prereg protocols are needed. I’m confident that this will happen and is already happening, Cortex is leading the way (see Chris’ comment above). Also in the world of clinical trials, is an example of a very powerful database of protocols.

            However I’d say that even if someone puts their protocol up in some obscure place, and no-one reads it until after the final paper is published, this is still much better than if they didn’t put the protocol up at all.

            The preregistered protocol would help readers of the paper to judge to what extent the paper represents a fishing expedition etc. which is something that we currently can only speculate about.

          • D Samuel Schwarzkopf

            I agree with that. I would also note that at a smaller scale a lot of research is already pre-registered in a way. For every major imaging project we need to present the paradigm to the department first. The protocols are discussed and approved if people agree it is reasonable. And the slides of the presentation (and proposal in some departments) are recorded for posterity.

            This isn’t the same of course. For one thing not the whole community sees them and there is still a lot of flexibility. But it’s a similar principle.

          • Neuroskeptic

            Oh, I completely agree.

            In fact most in vivo studies are already ‘privately preregistered’ in at least three ways:

            1. The grant application
            2. The ethics committee (ERB) protocol
            3. An institutional proposal (as you noted).

            However the crazy thing is that these are rarely made public. Grant applications sometimes are, depending on the funder, but these are rarely easy to locate and link to subsequent papers.

            If these three kinds of existing de facto preregistration records were only made public and (as you’ve said) made easy to find and navigate, it would be a huge step forward!

          • D Samuel Schwarzkopf

            So to me the most sensible way to implement this would be to make grant proposals and internal project registrations more transparent, public and ideally central. I don’t know if ethics are as useful for this purpose because most ethics protocols are too general and don’t deal with the scientific issues for obvious reasons. In fact we have debated in my department whether to make project presentations publicly available. I guess this would not be met by massive enthusiasm because of people’s fear of being scooped (which is generally quite irrational in my opinion). I do think it would need to be centralised to be visible though which I suppose the open science framework does provide in theory albeit not do much in practice at the moment.

            I can certainly agree that I would have liked it if I had a public record of a certain experiment I did that was frequently met with accusations of “fishing”. It’s really ironic as it is the most hypothesis driven experiment I have ever done with the protocol being pretty much exactly as in the project proposal and even pilot data showing exactly the same result. Anyway sore point 😛 I’ll stop here cause writing Disqus comments on my phone is a major pain in the asymmetry.

          • Eric-Jan Wagenmakers

            The key point about preregistration is the pre-analysis part, eliminating all researchers’ degree of freedom in the analysis stage. That is not addressed by publishing proposals, because these usually do not contain the finer details of the analysis plan. We spend a tremendous amount of effort on this replication attempt. We contacted the original authors, used their masks, incorporated their advice, etc., and dealt with all of these things to the best of our ability. Note that we also attempted to replicate out own work. I don’t think this is a conceptual replication at all, and I don’t agree with the insinuation that the work is substandard because the protocol was published on a student’s website (at a point in time when OSF was just starting out, and Cortex did not yet have the RR format).

          • D Samuel Schwarzkopf

            I don’t think you can eliminate all researcher degrees of freedom through pre-reg either – but you are probably right that current grant proposals are not detailed enough. That’s certainly something that could be changed though. Then again, it’s also somewhat unreasonable for people to pre-plan all experiments before their grant applications.

            I believe that your replication of five separate studies would be a lot of work (and expensive). Still it would obviously been more ideal if you could have collected more data to have more conclusive (non-anecdotal) results.

            Surely most people will agree that carrying out replications like this is critical. Also, regardless of where one stands on the pre-reg issue, this case is a perfect field test for carrying out such replications. It’s sparked this important discussion which can only be a good thing in my mind.

            I don’t know why your methods are different despite the steps you took to match them, especially as you were communicating with the original authors. Perhaps there was a break-down in communication about that as Ryota seems to suggest. Again, I also think you always have to keep in mind that different methods details means that results don’t generalise. Still, a different pulse sequence and coregistration method sounds to me like a fairly strong methodological difference.

            This issue is something that could have been prevented had the pre-reg protocol been more visible, more public and thus more transparent. I don’t mean to insinuate that any of this has anything to do with being published on a students webpage. I would say (and have said) the same about OSF protocols that aren’t widely known or about RR when the peer-review is substandard. In order for pre-registration to succeed, we need a much better system to make the protocols known.

          • D Samuel Schwarzkopf

            Instead of sleeping (because who needs sleep?) last night I was thinking a bit more about this issue and what you said.

            First let me clarify one thing. I think my binocular analogy works, even if it’s doubtless exaggerated. This is not really meant as a value judgement however. Binoculars are a pretty good tool for certain measurements. They just aren’t suitable tools for radio astronomy.

            It is also not always clear which experiment had the “binoculars” in this case. I think it’s fair to say that DARTEL would have resulted in superior alignment and less restrictive ROI analysis would take spatial uncertainty into account (although it would be justified to ask if these methods can inflate spurious results somehow). Regarding the structurals, the MDEFT images on 1.5T used by Ryota’s studies may have a better tissue contrast and thus higher sensitivity to these kinds of individual differences compared to your images using a different sequence on a 3T (which we know tends to have more distortion). Of course, the opposite is also theoretically possible that these scans are just more prone to producing false positive correlations but considering that most of the original VBM studies contained replications already I’d say this is less likely.

            I think this actually points to a real problem: if you carry out a replication attempt you need to show that your methods and data are actually of high enough quality to have any hope of successful replication. To me this is an essential part of the scientific process and this is what people should do all the time – and I’d say they do. This is why I get somewhat irate whenever people claim that we as a field “don’t replicate enough findings”. I agree we can do more, especially by encouraging more publication of failed replication attempts to counteract the file drawer problem. But it isn’t strictly true that people never replicate anything.

            I think your study would have been a lot more convincing if it had actually contained some successful replication of effects, ideally of similar magnitude, that we would expect to be replicated. I know this is perhaps easier said than done – but as long as all 17 findings in this study are non-replications it is frankly impossible to know whether it isn’t simply due to a methodological problem.

          • Chris Chambers

            “you need to show that your methods and data are actually of high enough quality to have any hope of successful replication.”

            This, by the way, is why RRs include two specific criteria —

            At Stage 1 (protocol):
            Whether the authors have considered sufficient outcome-neutral conditions (e.g. absence of floor or ceiling effects; positive controls) for ensuring that the results obtained are able to test the stated hypotheses

            At Stage 2 (full manuscript review):
            Whether the data are able to test the authors’ proposed hypotheses by passing the approved outcome-neutral criteria (such as absence of floor
            and ceiling effects or success of positive controls)


          • D Samuel Schwarzkopf

            Thanks for the info, Chris. This is indeed great and should improve such RRs a lot. Of course, as I said, this should be part of all science not just RRs, wherever applicable. Obviously it is particularly important for replications.

          • Geraint Rees

            Thanks for referencing the pre-reg Cortex criteria Chris – they are very interesting. It’s ironic that the paper we are currently discussing would have failed to meet the current Cortex pre-reg criteria (specifically ‘For inference by Bayes factors, authors should guarantee testing participants until the Bayes factor is either more than 3 and less than 0.33 to ensure clear conclusions.’)!

          • Chris Chambers

            Yes the criteria for RRs are indeed more stringent, and necessarily so given that the journal provisionally accepts manuscripts prior to study completion. Whether that degree of stringency should be demanded of all papers is an open question – I suspect any such calls would be met with strong opposition. I’d also wager that few papers that any of us have published via conventional article formats, as Boekel et al have done here, would meet the RR criteria for statistical power or BF thresholds. For that reason I wouldn’t feel comfortable singling out their study for comparison against those criteria.

          • Geraint Rees

            I’m not sure about this. Are you really saying that we should apply weaker methodological standards when accepting a paper that is directly submitted compared to one that is pre-registered? Isn’t the idea that we should apply the *same* standards for both? Pre-registered simply binds the authors to applying the pre-registered analysis except where specified. But we shouldn’t surely have more stringent criteria simply because a study is pre-registered, or weaker because it is not?

          • Chris Chambers

            I wouldn’t frame the contrast as between directly submitted vs pre-registered but rather between RRs and non-RRs. I don’t really think it’s unnatural for a journal to apply stricter methodological standards when agreeing to publish papers before the data exists, at least early in the life of RRs. In assessing standard (non-RR) submissions, it’s also important for all editors of the same journal to work on parity. In this case I judged that the Boekel et al study met those common standards and so did the reviewers. I realise you don’t agree, in which case perhaps you could help us establish RRs more widely and raise the game!

            I do agree with you that in a perfect world we’d have the same standards for all papers but that’s just not realistic – yet. As I said earlier, if we were to apply the standards reflected in RRs to the entire literature and retract everything, we’d end up trash-canning a lot of published science. And how many of your papers, or mine, or even of the studies that Boekel et al sought to replicate, would have achieved 90% statistical power or BF > 3 / < 0.33 for every hypothesis test?

            What I hope is that as RRs increase in popularity we'll see standards increase across the board. In my view the Boekel et al study is already the beginning of this movement; five or ten years ago, who would have even considered conducting a direct replication of this kind, and what hope would it have had of being published? Standards evolve gradually but I see them moving upward.

          • Geraint Rees

            Thanks Chris – we will have to agree to disagree! i think this study illustrates many of the challenges of both pre-registration and of peer reviewing. And a lot of confusion about what consistutes adequate pre-registration and about differing standards of peer review for different types of study.

          • Chris Chambers

            Yes, I guess we shall – still though, it’s been a good discussion and I don’t think we actually disagree about all that much.

            I don’t find any of this particularly confusing. I just see it as a natural stage in the evolution of cognitive neuroscience as we embrace replication and pre-registration, and it’s a healthy debate to be having. And the constructive tone of these discussions stands in contrast to the vitriol we’ve seen in other fields. I’m sure we’ll see more examples of PPPR from Ryota Kanai, and others, in the published rejoinders to the Boekel study.

          • Neuroskeptic

            Today’s grant proposals are not detailed enough to be ideal for preregistration purposes, I agree.

            But they would still be better than nothing.

            I have known studies that failed to find evidence in favor of the original hypothesis but which ended up being published (sometimes much later) as evidence in support of some entirely new (post hoc) hypothesis. With the original null result never being mentioned.

            Publication of grant proposals would prevent this “macro” p-value fishing even if it might not stop “micro” fishing in the form of statistical tinkering.

          • Chris Chambers

            I think E.J’s point above is key – the real value of pre-registration lies in the detailed specification of experimental procedures and pre-planned analysis decisions. Publishing grant proposals and general study plans may well be better than nothing, but it does carry certain risks if busy readers don’t realise what exactly was pre-registered. It is easy enough to “pre-register” something that is sufficiently vague as to permit all the usual QRPs yet can still be called pre-registered. We see this sometimes in clinical medicine — some protocols are so vague as to be quite meaningless (and some are even pre-registered after the study was completed), but the study registration allows researchers to engage in the necessary box-tickery to get their paper into a medical journal. In the long run, I think this could risk eroding the value of pre-registration for boosting reproducibility. This is why I’m such an evangelist for RRs — not because I think peer review is perfect but because it is the most rigorous form of pre-registration we have available to us, given other limitations of the system.

          • D Samuel Schwarzkopf

            I agree that the RR concept is probably the only way that preregistration could really work. I think this is precisely the danger I see (and that DNS is convinced will come true) that the basic prereg concept will eventually be eroded and become completely meaningless.

            And there is also the danger that just by being preregistered people will put the prereg studies on a pedestal, which will blind people from the true issues that a study may have.

      • Chris Chambers

        I certainly don’t mean to imply that exploratory analyses are sub-standard or second rate – for me the term “exploratory” is just a way of describing an analysis that the researchers didn’t plan to do prior to inspecting their data. I can think of plenty of hypothetical examples where an exploratory analysis could reveal something much more important than a pre-registered one. There is no value judgement here – at least not from my point of view.

        I agree that replications, in theory, shouldn’t require pre-registration to ensure matching of methods, although what pre-registration does avoid (via RRs) is the outcome of a replication attempt ending up in the file draw. RRs also incentivise authors to attempt large-N (likely expensive) replications in the first place by assuring their publication in advance, and independently of the results.

        I suspect that had Cortex RRs been available at the time that Boekel et al prepared their protocol, they would have submitted to the format. Which reaffirms my belief that we should have implemented them years earlier – and that all empirical journals should offer them. I agree of course that peer review of protocols won’t catch everything.

        But the bigger problem is that published Method sections are generally inadequate for the task of direct replication because nobody expects anyone to directly replicate anything, hence authors have no reason to include detail that reviewers and editors will see as superfluous. Indeed the very inclusion of replicable methodological detail can appear amateurish to reviewers.

        Until we decide to nip that in bud (e.g. using something like Case Report Forms: ) we are going to run into problems. Another approach that might help is for authors to publish a “replication recipe” as supplementary information, which would state for the record exactly which aspects of their methodology they believe are crucial for the effect to be reproduced. This could help clarify disagreements down the track, and the inevitable moving of goal posts when each side is desperate to prove they are right. (I do not mean to imply that this has happened in the current case – quite the opposite in fact – but we don’t have to look far in psychology to find such cases).

        • D Samuel Schwarzkopf

          There is no value judgement here – at least not from my point of view.

          I agree and I believe you view it this way. I’m getting the feeling a lot of people don’t.

          But the bigger problem is that published Method sections are generally inadequate for the task of direct replication…

          This is certainly true and it is a problem. We (well I anyway) learned in my studies that methods sections are supposed to allow others to replicate. We should have much more detail in them (and I think many journals have realised this in recent years. And I also agree that we need more replications because replications are a corner stone for science. But to come back to my original point, I think you could pre-register a replication attempt (to ensure it’s getting published) simply by stating “We will precisely follow the methods laid out in Original Author et al”. If there is a lack of detail (which frankly there wasn’t in Ryota’s studies from what I can tell), you could still declare which details are unclear and how you are planning to address these issues.

  • D Samuel Schwarzkopf

    I think I may now have answered the question I posed to EJ in the discussion below regarding what Bayes Factor they could have expected with their sample sizes of (31-36) assuming the original effect sizes were true. This to me suggests that at least for 12 of the 17 results the N was too small to achieve BF01 of <0.1, i.e. strong evidence in favour of the alternative hypothesis:

    • D Samuel Schwarzkopf

      Figure legend: Block dots (and line) indicate the replication BF (pre-reg version) assuming the original effect size and using the same N used in the replication. The dotted black lines indicate the same BFs but for the 95% confidence intervals of the correlation (I know a Bayesian wouldn’t like CIs because they are frequentist but I’d think it’s still a reasonable indication of the range of CIs to expect).

      The dashed red lines indicate the anecdotal criterion (3, 1/3) and the dashed blue lines are the strong criterion (10, 1/10). The Y-axis is on a logarithmic scale because Forstmann et al has a very different BF than the other studies (unsurprising given the original correlation was r=0.93).

      Most of the other results don’t even reach BF01<1/3 and of those that do only Xu et al would go below 1/10.

    • Eric-Jan Wagenmakers

      Ah, but I thought you were asking a different question, namely: what was the strength of the evidence for the effects as originally reported. We did not look at this but that would be interesting. As for your figure, I assume you are plotting the log BF. Can you replot without the Forstmann study? — this skews the scaling and makes it difficult to view the results for the others. More fundamentally though: it does not matter what we could have expected based on an average across the sample space. All that matters, after the data are in, is the inference that is conditioned on those data. So it is an interesting figure for planning purposes (although it needs distributions rather than single points) but after the data are in all that matters is the BF that was obtained.

      • D Samuel Schwarzkopf

        Yes I realised that was ambiguous. I had done the BFs on the original results too but in many cases it was fairly obvious that they would support H1 (and I see you’ve done that now yourself so no need for me to repeat it).

        Anyway, my concern (and that of many commenters here) was really with the ‘power’ of your analysis. I don’t believe in posthoc power analysis but I still disagree with your statement ‘after the data are in all that matters is the BF that was obtained’. Surely if the sample size you are using has little hope of finding evidence for H1 based on the original effect sizes then doesn’t this mean it is underpowered?

        Regarding distributions, this is why I included the lower and upper bound of the parametric confidence interval (again based on the original effect but your sample size). There is probably a better way to do that but it was quick and easy.

        I will upload a scaled version of the figure in a minute to show the BFs without Forstmann result.

        • D Samuel Schwarzkopf

          Here is the figure rescaled (sorry this ain’t pretty – just changed the scale).

          • Eric-Jan Wagenmakers

            One more request :-)
            Can you add our observed Bayes factors to that plot?

          • D Samuel Schwarzkopf

            Done ;). The green curve here plots the BF01 from your replications.

        • Eric-Jan Wagenmakers

          Thanks, the figure is clearer now. About that power: A study can be hopelessly underpowered and provide a decisive result; likewise, a study can have .99 power and provide a completely ambiguous result. The “hope” you refer to is an average across the sample space: relevant for planning, irrelevant for inference. In earlier work, we’ve called this a “power fallacy”:

          • D Samuel Schwarzkopf

            Thanks, I should probably read this first before saying more about it. Still intuitively I don’t quite follow. For one thing my guess is that most correlations will be inflated (due to lack of power) so the prior for replication should probably tend towards the lower end of the original CI. The Forstmann result is a perfect example because r=0.93 at n=9 is probably a vast overestimate and in fact the replication in the original study was already weaker (and of course at 0.93 it is unlikely to be much higher).

          • D Samuel Schwarzkopf

            So I have read this now and agree with the general notion. Of course, high power can produce inconclusive results and low power can be informative. Your urn example is a clear illustration although that is of course a highly asymmetric situation that hardly exists in actual research data. In common parlance you would probably describe that as “Absence of evidence isn’t evidence of absence, while presence of evidence is evidence of presence.” The t-test examples are more applicable although also not as clear… (But I like them and I might weave them into my bootstrapping manuscript 😉

            Anyway, I think there is still something amiss though. Yes, just because the replication was underpowered (which we seem to agree on?) it could nonetheless yield some informative results. Also you shouldn’t use the newly observed data for making an inference about power (i.e. this is what annoys me about that “excessive significance test”).

            However, the discussion we are having here is not about making an inference from observed data but bemoaning that there evidently was no consideration of power in the planning stage. I think this is still a valid criticism. The figure I showed suggests that at this N using the uniform prior BF would have not produced ‘strong’ evidence for H1 even if the effect estimate had been exactly the same as in the original studies. This to me suggests that you should have run a lot more people. I will attach another plot in the following comment comparing the BF10 you get under rho=0.35 and rho=0 across a range of sample sizes.

            As numerous other commenters in this thread already pointed out your evidence for H0 is anecdotal (I’d call that inconclusive personally) for a lot of the results – however for some it is strong. This is actually consistent with your argument of the power fallacy: even though the experiments were underpowered they nonetheless sometimes yield strong evidence. But the fact that most of them don’t is an indication of the lack of power. Or put another way, it isn’t only an individual case but an ensemble of 17 results.

            So I think there is a logical flaw in your line of reasoning here. While I agree you shouldn’t make inferences based on posthoc power, I think a solid study – especially a replication attempt – should contain considerations of power at the planning stage. You’ve argued the same in the power fallacy article:

            “Thus, when planning an experiment, it is best to think big and collect as many observations as possible. Inference is always based on the information that is available, and better experimental designs and more data usually provide more information.”

            But I don’t think it is appropriate to use the power fallacy as a defense of findings that were underpowered from the start. Using the same rationale I could run a replication with N=3, and (almost) always call this a failure to replicate (the maximally achievable BF10 for this situation is 3.333). Surely this doesn’t make sense.

          • D Samuel Schwarzkopf

            As promised another plot (yay!). Again, y-axis is logarithmic Bayes Factor (BF10 so positive is evidence for H1). The dotted and dashed black lines are the criteria for anecdotal and strong evidence respectively. The red and blue curves are the BF (uniform prior) for rho=0 and rho=0.35 (taken from the Facebook study). The cyan and magenta curves are the same but using the two-tailed BF from Wetzels et al. This shows that while your one-tailed BF is less conservative the difference isn’t huge.

          • Eric-Jan Wagenmakers

            Hi Sam: “However, the discussion we are having here is not about making an inference from observed data but bemoaning that there evidently was no consideration of power in the planning stage. I think this is still a valid criticism.” Maybe — I haven’t made up my mind just yet. Clearly we would have loved to test more participants, but we tested the entire pool that was at our disposal. Also, we are not making decisions — we are quantifying the evidence. But I see there’s arguments pro and con here. I hope Cortex invites several people to comment so we get a discussion that is as broad and constructive as the one we are having here. I appreciate the graphs, by the way.

          • D Samuel Schwarzkopf

            Yes more discussion in an actual published form will definitely be welcome and I feel we’ve probably exhausted what we can do here. I fear it I upload another graph Disqus may break 😉

            Anyway, I am looking forward to Ryota’s published commentary and any that may follow. And generally I think wherever one stands on the issue, we can probably all agree that this sparked a very interesting and important discussion!

    • Eric-Jan Wagenmakers

      I quickly ran the Jeffreys’ BF correlation test on the original studies. Many of them are statistically compelling, with the exception of: (1) in replication 2 (Kanai et al.), r=.26 with n=65 gives BF10 = 1.31; (2) in replication 5 (Westye et al.), r=-.21 with n=132 gives BF10 = 1.96, r = -.15 gives BF10 = 0.47 (evidence for H0), r = -.13 gives BF10 = 0.32, and r=-.26 gives BF10 = 9.62. When we conducted our replications we took the effects for granted — in fact, our goal was to test their robustness using a preregistered paradigm.

  • Fred Hasselman

    Fascinating… and a bit confusing too.

    Each study is introduced as a relationship between individual differences in [brain] and individual differences in [behaviour]

    Now, I know it’s statistical syllogism to move from the ensemble level to the individual, but what do these correlations mean, in terms of actual behaviour of individuals if they can only be evidenced in an ensemble of N>200 participants as standardised ES of .2 and .3, but not in a confirmatory sample of 36? (I.e we know where to look and which contexts must yield the observational constraint)

    In other words, if we randomly draw a Facebook profile from the lowest and from the highest 5% in terms of #friends and scan these individuals, wouldn’t we at least expect to observe this effect in the sample of just N=2?

    If it only exists as an ensemble effect, I have no idea how to evaluate its significance for behavioural cognitive or neuroscience, perhaps something for evolutionary psychology

    • eikofried

      Fred, I assume it means that only a small proportion of variance explained. This may be similar to some current genetic work. In a recent genome-wide association study by Hek et al (2013), for instance, no single locus reached genome-wide significant in 35000 subjects. In the conclusion section of the abstract of the paper, the authors call for an investigation of 50000 subjects. I wonder: if you need a sample like that to find a significant effect, doesn’t it imply that differences between participants (in this case healthy vs depressed) will be so tiny that they are completely irrelevant for clinical practice? That is, 50.2% of the depressed patients and 49.8% of control participants have a certain allele, which may become significant at some point, but is basically useless information.

      I wonder whether the same holds here (in a smaller scale). In fact, I rarely see neuroimaging studies addressing the issue of the magnitude of differences in terms of sensitivity and specificity (the 49.8% and 50.2% I mention above). Group differences are meaningful if they allow us to use the information somehow – for instance, to differentiate between groups based on such neurological markers. In a sample of 200 people I may be able to identify a significant difference between group A and group B in terms of structural size of brain region Y, but the difference may be so miniscule that Y carries nearly no predictive value about whether a person belongs to group A or B. And that is what I would like neuroimaging studies to address more.

    • D Samuel Schwarzkopf

      Now this is a very different discussion but one entirely worth having. I agree that effect sizes can be so low that they stop being meaningful. I don’t think you quite have that with r=0.2-0.3. While only a modest part of the variance is explained this doesn’t mean that this can’t tell us something useful. In the longer run, a note comprehensive model must describe how the different variables interact (which are likely to be nonlinear relationships in many cases). But in order to get there we need to characterise isolated relationships.

      • Fred Hasselman

        @eikofried Thanks that is an interesting reference, indeed (I think) about the same issue.

        @S Schwarzkopf

        To be clear, I do not object in principle to exploring relations between micro structure and macro scale states, or more abstract aggregate variables like FB friends. Although I would prefer to focus on micro scale dynamics, before you know it, someone without any recognizable structures at all wants to be your Facebook friend:

        I also think you are right to assume most relationships will be nonlinear. In fact, taking an engineering perspective: If a linear model can explain max. ~10% of the variance (and they usually do a little worse in these contexts similar to longitudinal prediction in developmental psychology) then the other 90% can be explained by a nonlinear model + measurement noise.

        What I disagree about is the idea that we should first exploit all possible linear predictors… maybe it’s just not the right model for studying inherently nonlinear, multiplicative, hierarchical, multiscale interactions.

        • D Samuel Schwarzkopf

          I am not necessarily saying that isolated relationships are likely non-linear (although undoubtedly some are). Rather I think that the combination (or interaction) of the various factors may be non-linear. Either way, obviously I don’t think this line of research can stand on its own. If all we’re doing is finding what behavioural variables correlate with what brain structures we aren’t going to get very far. We also need to understand that these relationships (the ones that aren’t false positives) actually mean. We still don’t really know how grey matter volume or cortical thickness relate to behaviour (and it may not be the same for all things). Another thing is causal manipulation to understand if these correlates are actually involved in producing these behaviours. Anyway, I could talk about this at great length but I think this thread is already enormous as it is so this is probably neither the right time nor the right place.

          • Geraint Rees

            The key is to see demonstrating a (replicable) brain-behavior correlation as providing a clue to a mechanism rather than as an end in itself. This then allows the brain-behavior correlation to enable a subsequent line of research aimed at providing complementary evidence, providing constraints on possible mechanisms and ultimately delivering an integrated understanding of behaviour.

            In the context of the current studies we’re discussing the CFQ study is one example. Demonstrating a relationship between (attentional) cognitive failures and the structure of parietal cortex led to the hypothesis that this structure was therefore critical for such behaviours, consistent with the prior literature. We then (independently) tested whether such a hypothesis might be true by temporarily disrupting the function of this area using TMS, and finding that attentional capture was disturbed. Then we independently replicated the brain-behavior relationship in a new sample (Sandberg et al 2013) and further examined the possible role of GABA (an inhibitory neurotransmitter) in this cortical area.

            Cognitive neuroscience explicitly assumes that no single study is definitive, but requires both replication and converging evidence from multiple methodologies. But I agree with Sam we’re now a bit off-topic!

  • Pingback: NeuroBreak: Bounty for Dementia Dx, Fatigue Model – Medical 24/7 News()

  • Eric-Jan Wagenmakers

    For this particular line of research (brain-behavior correlations) I’d like to suggest an exploration-safeguard principle (ESP): after collecting the imaging data, researchers are free to analyze these data however they see fit. Data cleaning, outlier rejection, noise reduction: this is all perfectly legitimate and even desirable. Crucially, however, the behavioral measures are not available until after completion of the neuroimaging data analysis. This can be ensured by collecting the behavioral data in a later session, or by collaborating with a second party that holds the behavioral data in reserve until the imaging analysis is complete. This kind of ESP is something I can believe in.

    • D Samuel Schwarzkopf

      “This kind of ESP is something I can believe in.”

      What about my precognition that somebody would do a replication study on structural brain-behaviour correlations? 😉

      Seriously though regarding your suggestion, isn’t this how most of these studies are done anyway?

      • Eric-Jan Wagenmakers

        Not so sure (I quickly browsed a few papers but they don’t mention any ESP procedure) — perhaps this does happen initially. But to make sure that there is no additional tinkering, one would need some additional safeguards (a very easy preregistration, involvement of a third party, etc). In fact, what I see in many papers is transformations and cleaning of the behavioral variable. Clearly, this should also take place *prior* to the computation of the correlation. It would already suffice to have a third party shuffle the subject coding for the behavioral variables, and reveal the correct coding only after all preliminary analyses have been completed.

        • D Samuel Schwarzkopf

          I can’t really comment on Ryota’s work. But in my previous brain-behaviour correlation work we tried to separate the behavioural and imaging data collection, where possible, and in general aim to make the behavioural data analysis as automatic and free of experimenter bias as possible. Pre-registration may certainly be an additional safeguard but I am still not convinced.

          • Geraint Rees

            Agree with Sam. This is a standard approach in the field.

  • Pingback: The Pipedream of Preregistration | The Devil's Neuroscientist()

  • Michael Kovari
    • D Samuel Schwarzkopf

      Sorry, this thread is about structural MRI 😛

  • Ben

    This paper just appeared in my inbox: It replicates the amygdala and entorhinal findings from Kanai et al., 2012 that didn’t replicate in the Boeckel et al attempt. And it does so despite using a small sample, 3T scanner and MPRAGE sequence (all of which is different from the original Kanai paper, but similar to Boekel et al). However the one point at which it stays true to the original study (and deviates from Boekel et al.) is the use of DARTEL for image registration. In my previous comment below I already wondered about the importance of this step for structures that are small compared to anatomical variability (like the amygdala). This to me suggests the possibility of a real alignment problem. It would be very interesting if Boekel et al. could check whether the use of DARTEL makes a difference for their dataset…

    • D Samuel Schwarzkopf

      Thanks this is very interesting. Another difference to Boekel et al is that they used a spatially uncertain ROI approach as suggested by Ryota in his reviews. Using the Boekel et al Bayes Factor for replication I get the following (BF01):

      Left amygdala: 0.0757 (strong H1)
      Right amygdala: 0.0151 (very strong H1)
      Left OFC: 0.0151 (very strong H1)
      Right OFC: 0.2999 (basically anecdotal H1)
      Right entorhinal: 0.4913 (anecdotal H1 – note this didn’t reach significance in NHST either)

      Another difference from either Kanai and Boekel study is that they didn’t seem to take the square root of the number of Facebook friends. Don’t know how much of a difference that would make.

  • stmccrea

    Interesting the emotional reactions this seems to evoke. It seems pretty apparent to me that despite technical discussions of effect and sample sizes and so on, whatever effects are being claimed have not been clearly established by any methodology with any degree of certainty. Accordingly, it is proper to assume the null hypothesis in the absence of clear evidence of an effect. This does not mean there will ultimately be no effect proven, but the burden of proof is on the one claiming the effect. It is clear that researchers in this field have been allowed to publish and publicize very preliminary findings as if they were conclusive, and this document puts those preliminary findings into the necessary perspective. The world is very anxious to believe any neuro-psychiatric finding these days, so especial care is needed in the direction of caution regarding overinterpretation of findings, since the press and the public are all too likely to overinterpret in any case. I think it’s best to keep scientific values strong and in control of this and any related research, including the most salient value of all, namely, that a hypothesis continues to be assumed false until there is replicable and reliable evidence of its validity, and then is only assumed true in the absence of contradictory data. “Explanations” are fuel for other research, but don’t invalidate the observation that these findings are uniformly not replicated by the author’s efforts.

    —- Steve

    • D Samuel Schwarzkopf

      I wouldn’t call experiments of this sort with >120 subjects in the first sample and additional replications of the effect within the original study “preliminary”. Except in the sense that I have argued here (and elsewhere) that we should always take new scientific findings as preliminary until they have been replicated repeatedly and independently and with sufficient scrutiny that allows determining the factors controlling it. Honestly though I think most people agree with this notion. I also agree that skepticism is important and necessary in science.

      Anyway, what I think gets up the emotions in cases like this is that it is very easy to fail to replicate. It is also easy to massage effects into data but I would actually say it’s usually easier to massage effects out of data than it is to massage them in. That doesn’t mean the two shouldn’t be held to the same standard but I do think it’s an asymmetric situation nonetheless.

      As for replicability, for two of the VBM results you now actually have several replications: for both the distractibility and social network size findings there have already been published replications (you could probably argue that the distractibility one wasn’t fully independent as it shares some authors with the original, although it was done on a different sample in another lab on a different scanner in a different country and adding a different measure). In addition, the reviews posted by Ryota Kanai suggest that if the analysis pipeline of this present failed replication is matched to the original even these data replicate that finding (and to be honest, even the present “failed” replication actually shows effect estimates consistent with the original finding – Fig 8, pts 12 and 13).

      So is this “reliable evidence of its validity”? No, because this is still only a handful of replications. More importantly, I think there needs to be a better understanding of what these effects mean although that research can go hand in hand with attempts to replicate the effect (or falsify the hypothesis).

      However, I think you are incorrect to categorically state that results can “only assumed true in the absence of contradictory data”. If I try to measure light reflected from the moon and point my measuring device in a different direction, this isn’t “contradictory data” it is quite frankly a bad experiment. Like my previous astronomy example this is an exaggeration, but still this is more or less the situation you have when you use a less accurate alignment method for spatial normalisation. Naturally, it should go without saying that other converging evidence is also required to confirm the validity of the purportedly superior alignment procedure.

      • stmccrea

        Can’t argue with anything you said here. Of course, bad “data” doesn’t count, and study design is critical in evaluating the validity of data. My observation, however, is that the current scientific environment, with pressure to publish, financial incentives to find positive outcomes, and skewed coverage of positive results over negative or neutral ones, creates a situation where people are encouraged to be overly lenient in interpreting data. As a great example, the MTA study, started back in the 90s, got huge press when it found a small to moderate improvement in reading scores for kids taking stimulant medication. This resulted in worldwide media attention. The three- and 8-year followups showed that this advantage had vanished completely, and other studies in Quebec and Australia have also failed to find any long-term advantage. None of the latter followups/studies received any significant media attention, and those whose incomes depend on the validity of this intervention continue to quote the 14-month MTA results as if that were the end of the story. It’s just not profitable to publish the truth, and when science and profit (or egotism) go head to head, science almost always loses.

        I appreciate your very rational and scientific response, and hope your attitude catches on!

        —- Steve

        • D Samuel Schwarzkopf

          Thanks, yes the publication pressure is probably to blame for a lot of the problems our field is debating and that’s why discussions like this are so important. I agree that we need to work harder that replication attempts are encouraged. It has certainly gotten a lot better with numerous specific replication attempts having been published in journals recently, but in my opinion it’s even more important that replications that are just part of the natural progression of science are published, too, and that there are easier ways to accumulate the replication evidence for a result.

          Of course, there are a lot of ways that communication of scientific findings to the public can be improved, too. Some people (including Chris Chambers who commented on here) have directly looked at how press releases are distorted (and in which way the scientists themselves can be blamed for miscommunication). On the other hand, evaluation of a scientist’s impact now extends to some very shallow markers of mass appeal, which is a scary development.

          • stmccrea

            Scary, indeed, and emblematic of our increasing management of society by marketing rather than science!

  • Pingback: Replication and post-publication review: Four best practice examples | To infinity, and beyond!()

  • Pingback: Failed replication or flawed reasoning? | NeuroNeurotic()

  • Pingback: Что-то вышло неверно … |

  • Pingback: Nassim Taleb on IQ | Ideas and Data()



No brain. No gain.

About Neuroskeptic

Neuroskeptic is a British neuroscientist who takes a skeptical look at his own field, and beyond. His blog offers a look at the latest developments in neuroscience, psychiatry and psychology through a critical lens.


See More

@Neuro_Skeptic on Twitter


Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!

Collapse bottom bar