The problem of false positives

By Razib Khan | November 10, 2011 7:01 pm

False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant:

In this article, we accomplish two things. First, we show that despite empirical psychologists’ nominal endorsement of a low rate of false-positive findings (≤ .05), flexibility in data collection, analysis, and reporting dramatically increases actual false-positive rates. In many cases, a researcher is more likely to falsely find evidence that an effect exists than to correctly find evidence that it does not. We present computer simulations and a pair of actual experiments that demonstrate how unacceptably easy it is to accumulate (and report) statistically significant evidence for a false hypothesis. Second, we suggest a simple, low-cost, and straightforwardly effective disclosure-based solution to this problem. The solution involves six concrete requirements for authors and four guidelines for reviewers, all of which impose a minimal burden on the publication process.

Since the paper is behind a paywall, I’ve cut & pasted the solutions belows:

We propose the following six requirements for authors.

  1. Authors must decide the rule for terminating data collection before data collection begins and report this rule in the article. Following this requirement may mean reporting the outcome of power calculations or disclosing arbitrary rules, such as “we decided to collect 100 observations” or “we decided to collect as many observations as we could before the end of the semester.” The rule itself is secondary, but it must be determined ex ante and be reported.

  2. Authors must collect at least 20 observations per cell or else provide a compelling cost-of-data-collection justification. This requirement offers extra protection for the first requirement. Samples smaller than 20 per cell are simply not powerful enough to detect most effects, and so there is usually no good reason to decide in advance to collect such a small number of observations. Smaller samples, it follows, are much more likely to reflect interim data analysis and a flexible termination rule. In addition, as Figure 1shows, larger minimum sample sizes can lessen the impact of violating Requirement 1.

  3. Authors must list all variables collected in a study. This requirement prevents researchers from reporting only a convenient subset of the many measures that were collected, allowing readers and reviewers to easily identify possible researcher degrees of freedom. Because authors are required to just list those variables rather than describe them in detail, this requirement increases the length of an article by only a few words per otherwise shrouded variable. We encourage authors to begin the list with “only,” to assure readers that the list is exhaustive (e.g., “participants reported only their age and gender”).

  4. Authors must report all experimental conditions, including failed manipulations. This requirement prevents authors from selectively choosing only to report the condition comparisons that yield results that are consistent with their hypothesis. As with the previous requirement, we encourage authors to include the word “only” (e.g., “participants were randomly assigned to one of only three conditions”).

  5. If observations are eliminated, authors must also report what the statistical results are if those observations are included. This requirement makes transparent the extent to which a finding is reliant on the exclusion of observations, puts appropriate pressure on authors to justify the elimination of data, and encourages reviewers to explicitly consider whether such exclusions are warranted. Correctly interpreting a finding may require some data exclusions; this requirement is merely designed to draw attention to those results that hinge on ex post decisions about which data to exclude.

  6. If an analysis includes a covariate, authors must report the statistical results of the analysis without the covariate. Reporting covariate-free results makes transparent the extent to which a finding is reliant on the presence of a covariate, puts appropriate pressure on authors to justify the use of the covariate, and encourages reviewers to consider whether including it is warranted. Some findings may be persuasive even if covariates are required for their detection, but one should place greater scrutiny on results that do hinge on covariates despite random assignment.

Guidelines for reviewers

We propose the following four guidelines for reviewers.

  1. Reviewers should ensure that authors follow the requirements. Review teams are the gatekeepers of the scientific community, and they should encourage authors not only to rule out alternative explanations, but also to more convincingly demonstrate that their findings are not due to chance alone. This means prioritizing transparency over tidiness; if a wonderful study is partially marred by a peculiar exclusion or an inconsistent condition, those imperfections should be retained. If reviewers require authors to follow these requirements, they will.

  2. Reviewers should be more tolerant of imperfections in results. One reason researchers exploit researcher degrees of freedom is the unreasonable expectation we often impose as reviewers for every data pattern to be (significantly) as predicted. Underpowered studies with perfect results are the ones that should invite extra scrutiny.

  3. Reviewers should require authors to demonstrate that their results do not hinge on arbitrary analytic decisions. Even if authors follow all of our guidelines, they will necessarily still face arbitrary decisions. For example, should they subtract the baseline measure of the dependent variable from the final result or should they use the baseline measure as a covariate? When there is no obviously correct way to answer questions like this, the reviewer should ask for alternatives. For example, reviewer reports might include questions such as, “Do the results also hold if the baseline measure is instead used as a covariate?” Similarly, reviewers should ensure that arbitrary decisions are used consistently across studies (e.g., “Do the results hold for Study 3 if gender is entered as a covariate, as was done in Study 2?”).5 If a result holds only for one arbitrary specification, then everyone involved has learned a great deal about the robustness (or lack thereof) of the effect.

  4. If justifications of data collection or analysis are not compelling, reviewers should require the authors to conduct an exact replication. If a reviewer is not persuaded by the justifications for a given researcher degree of freedom or the results from a robustness check, the reviewer should ask the author to conduct an exact replication of the study and its analysis. We realize that this is a costly solution, and it should be used selectively; however, “never” is too selective.

To preempt angry and offended psychology professors: this problem is not limited to their discipline. It is probably a bigger problem in medicine because it costs us a lot of money and likely kills people.

MORE ABOUT: Psychology

Comments (6)

  1. zkkz

    What about the most effective solution: report Bayesian statistics!

  2. Awesome. There are many, many disciplines in the social sciences that should adopt these rules.

  3. xkcd covered this problem a while ago:

  4. Mostly good. The most problematic is the small sample size issue. As this blog itself has illustrated, and is amply illustrated from archaeology and linguistics and neuroscience and medicine, sample sizes of one are frequently useful and sample sizes of a dozen can often be powerful.

    Sometimes statistical significance isn’t the most important issue. In ethnography or any other kind of research method where you get depth (e.g. each whole genome has thousands of data points), smaller samples can work. Likewise, if you have a good model to fit your data to (e.g. a likely recessive gene pattern) a small sample can say a lot. Strategic sampling can also be quite powerful – e.g., the Dow Jones Industrial Average is remarkably good at matching larger market trends with just twenty carefully chosen data points and losing one or two of those points would still leave an instrument that was almost as good.

    Very frequently, a small number of convincing outliers can make a powerful point. Lots of important neuroscience discoveries are based on samples of one where an outlier individual lacks this or that feature of the brain and the result is described. A single dig can greatly change the dating of an archaeological period.

    More generally failure of the guidelines to broadly recognize these issues suggest that the domain of research activities over which the guidelines are useful is narrower than suggested.

  5. Dick

    Single subjects can be useful for generating hypotheses, small samples with large effects can be useful for strengthening hypotheses, and very large samples with small differences can mislead by giving clinically meaningless but highly statistically significant differences. In the past I worked with sample sizes of 640,000 to 1,000,000 and was showered with findings at the .00001 level, few of which were meaningful. Therefore sample size in and of itself is not a valid criterion of warranted outcomes.


Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!

Gene Expression

This blog is about evolution, genetics, genomics and their interstices. Please beware that comments are aggressively moderated. Uncivil or churlish comments will likely get you banned immediately, so make any contribution count!

About Razib Khan

I have degrees in biology and biochemistry, a passion for genetics, history, and philosophy, and shrimp is my favorite food. In relation to nationality I'm a American Northwesterner, in politics I'm a reactionary, and as for religion I have none (I'm an atheist). If you want to know more, see the links at


See More


RSS Razib’s Pinboard

Edifying books

Collapse bottom bar