The F Problem With The P-Value Sciences

By Neuroskeptic | October 16, 2013 5:00 pm

There is a problem in science today. I’ve written a lot about how to cure it, but in this post I want to outline the nature of the disease as I see it.

The problem goes by many names:

So I’m going to call it the f problem for short.

I like to visualize f as a forking path. Given any particular set of raw data, a researcher faces a series of choices about how to turn it into a ‘result’.

There are choices over which statistical tests to run, on which variables, after excluding which outliers, and applying which preprocessing… and so on:

f_neuroskeptic_science

The f problem is that researchers can try multiple approaches in private, and select for publication the most desirable ones.

Most often, it’s statistically significant effects, that match with prior hypotheses, that are desired. Even if there are no real effects of interest in the data, some comparisons will be ‘positive’ just by chance.

Researchers today face pressure to publish ‘good results’ and are rewarded for doing so – this is what turns f from a theoretical concern into a real one.

Yet f is not a problem for all of science. Broadly, f only affects research in which the results take the form of p-values. Thus fields like mathematics, where results are proofs, are immune.

But even some p-value sciences manage to escape f. This is because in these fields, the nature of the enterprise means that ‘everyone knows’ what experiments are being carried out, and how the data ought to be analyzed – in advance.

In particle physics, for example, it was public knowledge that CERN was looking for the Higgs Boson, and it was also known exactly how they planned to look for it in the data. Openness meant that was no room for f, because f is all about the scope for hidden flexibility.

CATEGORIZED UNDER: FixingScience, media, select, Top Posts
  • Nick Brown

    I think that in fields like psychology, the problem is often that apart
    from the authors of the paper (and even then, I’m not always sure), nobody actually *cares* about what the authors were trying to find. If 100 or 1000 people, or half a dozen international governments, agreed that it was really important to see whether, say, education level moderates the relationship between unemployment and depression, or whatever the ostensible topic of the study is, then the experiment and the criteria by which it would be evaluated would be pretty much collectively designed in advance by the community, and it would just be a question of deciding who got to run it. It seems to me, as a recent arrival and still comparative outsider, that many papers in psychology either report some “discovery” that is of no consequence to anybody, or else their hypotheses are transparently post hoc in order to salvage *something* that can be published from a study. I think that reviewers and journal editors know this, but hey, we all have to eat.

    • storkchen

      Brilliant point.

    • Dave Nussbaum

      I’m not going to argue the basic point that science is fraught with problems and it’s an ongoing struggle to overcome them, nor that this is more of a challenge in some fields than others. But as soon as the plan is to let international governments decide which questions are worth pursuing, and not the scientists themselves, then we will have taken a huge step backwards. I’m pretty sure Galileo and Copernicus would have had a lot of trouble getting federal grant money.

      • Nick Brown

        Sorry, I should perhaps have been clearer. I wasn’t making a serious suggestion to hand this sort of thing over to governments. My point is that the quality of science is at least
        partly a function of how many parties (multiplied by some function of how “important” they are, for any value of “important” you choose) are interested in the outcome. When CERN goes looking for the Higgs boson, you can bet that they’re not going to get away with a badly-designed experiment that relies on p-hacking. When some psychologist “shows” that people who sat in a neat office versus an untidy one become “much more likely” to adopt a healthy lifestyle (as evidenced, exclusively, by
        their choice of an apple rather than a chocolate bar on leaving the
        lab), about the only person interested is their supervisor… and of course, the university press office (“U of Bull scientists prove that…”).

        • http://blogs.discovermagazine.com/neuroskeptic/ Neuroskeptic

          I agree with the argument – but I don’t think the “no-one cares” problem is inherent to psychology.

          To the average government or the average man in the street, the results of even the more niche psychology studies are more obviously ‘relevant’ than the existence of the Higgs Boson. The prediction and detection of the Higgs was a triumph of the human intellect but it might not be a useful one.

          Whereas imagine if someone really proved, to particle physics standards, that tidy offices reduce obesity. Governments would go bananas! Suddenly, they’d have a new tool to reduce obesity, and increase productivity, save money etc. And individuals struggling to lose weight would also pay attention.

          Obviously, this doesn’t happen, but (only?) because deep down no-one really believes that the psychology paper claiming tidy -> healthy is really true.

          So it’s a vicious circle. No-one cares, so no-one takes it seriously enough to prevent f, so no-one believes, so no-one cares…

          But it is perhaps a breakable cycle.

          • Nick Brown

            We-e-e-ll, except that the UK government has its “Nudge Unit”, which seems to justify its existence on all those amazing studies done by those really clever social psychologists, which brought you this scandal http://www.theguardian.com/society/2013/apr/30/jobseekers-bogus-psychometric-tests-unemployed for which someone is probably going to end up getting whatever the BPS equivalent of struck off is. And apparently their results are so impressive [rolls eyes] that the U.S. is interested as well (http://swampland.time.com/2013/08/09/nudge-back-in-fashion-at-white-house/). So it seems that some fairly influential people do believe that a lot of this stuff is true. But hey, maybe it is.

          • Kevin Denny

            Why should you dismiss the Nudge unit on the basis of one idea? Talk about fishing for results…

          • Artem Kaznatcheev

            I don’t think this is where “no one cares” stems from. No one cares in psychology because there is no precise overarching theoretical framework, thus every experiment is almost completely independent of every other experiment, and you would only worry about the result of any given experiment if you happen to work on a very similar one yourself. In physics (also chemistry and parts of biology), experiments are done within a rich theory, and the results have repercussions for various parts of that theory that lots of other scientists (working on different experiments) also use. Hence, they care.

            The reason that paper tiding –> healthy doesn’t matter is because the conclusion (even if it isn’t a statistical fluke) is only relevant and predictive in the context of that specific experiment, and can’t with any degree of rigour be applied elsewhere in psychology or the practical world (unless we really care about a tidy office at U. Bull and marginally larger selection of apples). This is why there is no application of psychology experiments years down the road to completely unprecedented problems, unlike physics where say an idle worry about observer-independence of physical laws ended up resulting in GPSes that could correctly position you to within 1 meter anywhere in the world.

      • Artem Kaznatcheev

        Although I agree with your general sentiment, I don’t appreciate the unnecessary reference to Galileo and Copernicus. Who do you think funded them? I am pretty sure that Copernicus was funded for most of his life by the Prince-Bishop of Warmia (the equivalent of federal funding in those days, since Warmia was semi-autonomous and minted its own money). I know less about Galileo, but since he was employed by Italian universities, and those were funded by church (and sometimes state), I also highly doubt your remark.

  • William Idsardi

    Is mathematics really immune? Can’t mathematician choose to do some monster-barring (Lakatos 1976) to save their proofs?

    • http://blogs.discovermagazine.com/neuroskeptic/ Neuroskeptic

      But in that case, they will still end up with a proof of some truth. It might not be the theorem that they set out to prove, but they have still proven a truth, not just found a statistic (which is only a guide to truth when interpreted properly e.g. with multiple comparisons corrections, which is the whole f problem.)

      • William Idsardi

        I’m afraid it’s not that simple. Some monsters (cases for which the general statement does not hold) really are refutations of the supposed “proof”. In that case you work to find the incorrect lemma and reformulate (i.e. change) the general statement. But, if you want, you can try to bar the counterexamples (a ridiculous example would be “no true prime is even”). The question is whether it makes sense to exclude those cases. This seems remarkably similar to the problem of outlier detection to me; sometimes excluding outliers is exactly what we want to do as they are a result of apparatus failure.

        • http://blogs.discovermagazine.com/neuroskeptic/ Neuroskeptic

          I stand corrected – that is indeed like outlier rejection!

          • Helene Hoegsbro Thygesen

            well the problem with outlier rejection in contect of the f-problem is that weather the claim acknowledges the outlier rejection (“women who drink less than 10 cups of coffee per day drink less coffee than men”) or not (“women drink less coffee than men”) the conclusion is invalid because it was found as a result of a fishing expedition. To replicate the result you are likely to have to ommit the outlier rejection (but do some other fishing instead).
            “All primes are odd, except for 2″ is simply correct. Whether it is interesting or not is a different question, but in any case it is not based on *statistical* inference and is therefore not related to the f-problem.

        • Artem Kaznatcheev

          The problem is the level of transparency. A mathematical statement, after all the monster-barring understand WHY the barred those monsters (usually, when some example breaks your proof, you understand why it does so). A psychology experiment doesn’t understand yield any understanding to why some outliers were excluded.

          • William Idsardi

            Actually, you bar the monsters to save the proof. Whether the monsters form some kind of coherent class is, at that point, a question left for the future. In the case of experiments (psychology or physics or …) some outliers should be excluded (for apparatus failure, for failure to follow protocol, etc.). Whether the outliers you have in some particular case should or should not be excluded is equivalent to the monster-barring question. Lakatos was apparently trying to tie these strands together at the end of his life (i.e. synthesizing _Proofs and Refutations_ with _The Methodology of Scientific Research Programmes_).

  • http://petrossa.me/ petrossa

    higgs-boson had another problem though, nobody knew what was the weight of the particle they were looking for. Obviously that hinders discovery. If you know it’s for example 50 gev and you find a particle of that energy, you can conclude definitively you found it. In this case they found a particle with a certain weight in the overall ballpark and declared it found. Which to my mind is no different from your described problem.

    • Rob Hooft

      Indeed, but that is why physicists look for “six sigma” deviations. p=0.05 as is used in a lot of p-value science is only “two sigma”. Two sigma is an accidental result every 20 tries. Six sigma only once in every many millions.

      • http://petrossa.me/ petrossa

        Sure. But even once in a million can happen 10 ten times in a row if the scale is sufficiently compressed. As in particle physics where the scale is that small. This whole universe only exists for a tiny fragment in a random flow of energy. To us it seems long, billions of years, doesn’t mean it is objectively long. Here one tries to project human timescale on events that happen objectively in a fleeting moment. And what to us is a fixed law of chance isn’t anymore on the true timescale. It’s only our lack of the capacity to fully comprehend the scales involved that make us see patterns. we can’t distance ourselves far enough to have the full overview. Hubris in optima forma.

      • http://blogs.discovermagazine.com/neuroskeptic/ Neuroskeptic

        True, but datasets in experimental physics are big. The LHC produces petabytes of data per year – with sufficient fishing, it would be possible to prove any theory to six sigma from that dataset. Luckily, we know that the evidence for the Higgs Boson was not fished (as it was naturally “pre-registered”).

      • andrew oh-willeke

        The actual standard in the physics field is “five sigma” rather than six, but your analysis is otherwise correct.

  • Pingback: The F Problem With The P-Value Sciences - Neuro...

  • Daniel Herman

    Yes. But what about the allied problem of simply relying on null hypothesis testing to determine the “value” of findings? I am not a statistician but I have found the critiques of NHST to be pretty compelling. Moving away from this approach in psychological research might in itself help to diminish the post hoc fishing for p values problem you identify.

    • Sander Greenland

      Exactly: Much of the problem arises from dichotomizing a complex or nearly continuous reality into “null”/”not null”, and treating each study as if it alone is a basis for deciding between these two artificial choices.

      Even if we do not question this artificial testing approach, we still have a problem that a researcher can fish for a “null” (P>0.05) result if that is preferable – perhaps rare in psychology, but not uncommon in medical research when the outcome under study is a serious side effect of a product of the research sponsor; see for example http://www.badscience.net/

  • http://blogs.discovermagazine.com Jonathan Tracey

    Why not just call it the function?
    Anyway, to be serious, I feel that full-publication will not be enough to fix the problem. Results can be chosen from multiple tests, and variables can be set up with the specific purpose of fixing results.

  • Pingback: The F Problem With The P-Value Sciences - Neuro...

  • Rob Hooft

    Even assuming completely open reasoning, one in twenty p=0.05 results reported in the literature is due to chance alone.

    • Dave Langers

      Fie! That is precisely not what a p-value means!
      “Given that there is no effect, there is a 5% chance of nevertheless finding one” IS NOT THE SAME AS “Given that an effect is found, there is a 5% chance that there is none”

      • Rob Hooft

        You’re right of course, but these two are quite comparable if the number of thinkable “false hypotheses” is much larger than the number of “true hypotheses”. I feel this is the case in practice.

        • Dave Langers

          Double fie!
          If the null-hypothesis were indeed true in a majority of studies, then *much more than* one in twenty p=0.05 results reported in the literature would due to chance alone. Perhaps only one in twenty might not just be due to chance.
          I guess that reinforces your point further, but the math was still off ;-)

          • Rob Hooft

            Right again….. I keep being inaccurate in how I express myself. In practice, much more than 5% of all published p=0.05 studies are not reproducible. In general, the effect size wears off on reproducing of any p-factor science; this has been called the “decline effect”: http://en.wikipedia.org/wiki/Decline_effect

  • Pingback: The F Problem With The P-Value Sciences | Bioinfo Toolbox

  • Pingback: The F Problem With The P-Value Sciences - Disco...

  • Liisa Galea

    While I understand the sentiment of this post I think it is a flawed plan to only go in with one outcome in mind. Data is never that easy to interpret – for example it maybe that the treatment only worked in a subgroup and you will miss this important piece of data. The idea of going in with the understanding that your treatment will be beneficial to a specific cognitive test is reminiscent of a RCT. If you don’t find a positive effect on that specific test you have p>0.05 and you are done. However, it may well be that your treatment is only effective in young women (or only in men that exercise regularly) for example and you will miss that finding without thoroughly examining your data.

    • http://blogs.discovermagazine.com/neuroskeptic/ Neuroskeptic

      I completely agree – which is why I said that “The f problem is that researchers can try multiple approaches in private, and select for publication the most desirable ones.

      There’s nothing wrong with trying lots of approaches so long as you do so openly, allowing readers to judge for themselves what results they consider interesting given the number of degrees of freedom.

  • Pingback: Somewhere else, part 83 | Freakonometrics

  • Bill Skaggs

    The real problem, I think — or at least a major part of the problem — is that people often have wrong expectations about the scientific literature. The criteria for publishing should be that a result is (a) interesting and (b) strong enough to be worth reading, not that it is absolutely unequivocally proved.

    Of course it’s very nice when hypotheses are clearly framed and all the data analysis is mapped out ahead of time, but many interesting findings have emerged from exploration — “fishing”, if you will. Ideally the observations made while fishing are then tested by specific followup experiments, but that can take a long time and more resources than are available — it’s sometimes better to let the community know what you’ve seen, even if it isn’t fully solid.

    My own approach here is that when I’m fishing, I don’t pay attention to marginal p values — generally nothing weaker than p < 0.001 will get me interested.

  • Pingback: Juuso Kähönen (juusu) | Pearltrees

  • andrew oh-willeke

    I would suggest that there was plenty of “f” in the Higgs boson search. This is because only a small percentage of the total data produced in a particular run of collisions is recorded and analyzed, with the remainder being excluded from analysis according to certain criteria (sometimes, but not always “blind”) and the process that goes into these technical data “cuts” is so esoteric that only a tiny number of people with a firm PhD level high energy physics background and access to unpublished information can meaningfully evaluate the impact of these “cuts.”

    The disciplinary fix for this problem is a kludge. Physics, based on past experience, treats only five standard deviation results (far less than statistical significance in other fields) as discoveries, when if there were no “f” problem and significance was evaluated with full and accurate “look elsewhere effects” considered, two or three standard deviation effects should be far more notable than they actually are in practice.

    • http://blogs.discovermagazine.com/neuroskeptic/ Neuroskeptic

      Very interesting. Thanks.

  • Dave Langers

    I feel that we should make a clear distinction between exploratory work and confirmatory work. One cannot live without the other:
    - Exploratory studies don’t prove a thing, but are great ways to discover new unexpected science and arrive at plausible hypotheses;
    - Confirmatory studies require a lot of luck and prior insight to hit a nail on the head, but if you hit it then it will stick.
    The problem is that we seem to have become unable to make the distinction.
    I see the issue, and it is real, but my problem with forcing *every* study to be preregistered is that is takes the exploration away, and that is going to make science move a lot slower than it could.

  • matus

    The problems seem to boil down to Frequentist statistics. Maybe we should start to look to the B solution, no?

    • http://blogs.discovermagazine.com/neuroskeptic/ Neuroskeptic

      Bayesian stats would solve some problems, but it wouldn’t cure F. It would still be possible to select the most positive comparisons / tests and only publish those, however ‘positive’ is defined. And with Bayesian stats there’s the additional degree of freedom of choosing priors.

  • Pingback: A Fortnight of Links – 2013 10 26 | Biased Transmission

  • Pingback: Dopamine Equals 'Don't Be Mean'? - Neuroskeptic | DiscoverMagazine.com

  • Pingback: Hormones and Women Voters: A Very Modern Scientific Controversy - Neuroskeptic | DiscoverMagazine.com

NEW ON DISCOVER
OPEN
CITIZEN SCIENCE
ADVERTISEMENT

Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!

Neuroskeptic

No brain. No gain.

About Neuroskeptic

Neuroskeptic is a British neuroscientist who takes a skeptical look at his own field, and beyond. His blog offers a look at the latest developments in neuroscience, psychiatry and psychology through a critical lens.

ADVERTISEMENT

See More

@Neuro_Skeptic on Twitter

ADVERTISEMENT
Collapse bottom bar
+

Login to your Account

X
E-mail address:
Password:
Remember me
Forgot your password?
No problem. Click here to have it e-mailed to you.

Not Registered Yet?

Register now for FREE. Registration only takes a few minutes to complete. Register now »