# P-Hacking: A Talk and Further Thoughts

A week ago I gave a talk to Marcus Munafo’s group at the University of Bristol on the subject of P-hacking.

The presentation is now available online. If you don’t have time to watch the whole thing, I’d recommend the section from 14:50 to 22:00 – that’s my live demonstration of how, with a few minutes of P-hacking, you can find a nicely significant correlation ( p = 0.006 ) between two variables that have no relationship whatsoever.

Here’s a few further thoughts on this topic.

As I noted in my talk, there are various proposed methods of detecting p-hacking within a set of studies or a whole literature. These methods tend to rely on looking at the distribution of published p-values, or p-curve. Assuming that there really is a significant effect, p-values should be clustered around 0.

Under the null hypothesis of no effect, p-values are evenly distributed between 0 and 1, so lots of significant p-values evenly distributed between 0 and 0.05 indicates selective publication of chance findings. Finally, a ‘bump’ in p-values just below 0.05 is a sign of heavy p-hacking.

These methods are powerful. However, I worry that selective publication and/or p-hacking could produce p-values clustered around 0, not 0.05, and therefore go undetected, in some cases. Here are three scenarios in which this might happen:

**1. “Overhacking”: **researchers might not stop with a significant result, but might keep hacking their p-values to be as low as possible, to make their results look more compelling. There is no reason why p-hacking has to stop once p < 0.05. Indeed, as awareness of p-hacking grows, researchers may become increasingly suspicious of marginally significant p-values, and the* de facto* desired significance threshold may shift downwards. Perhaps this is already happening.

**2. ****Selection bias: **we tend to prefer the lowest p-value given a choice. If you find two results (or two variants of the same result), with p-values of (say) 0.04 and 0.07, it would be easy to report only the 0.04. But if you have three options, 0.01, 0.04 and 0.07, you’d probably report 0.01, *not* 0.01 and 0.04. In other words the pressure is not “publish all p-values below 0.05” but “publish the lowest possible p-value, if this is below 0.05”. Importantly, this doesn’t even require ‘active’ p-hacking i.e. trying different variants of the same analysis, it just requires the running of multiple analyses (maybe on different variables) and selective publication.

**3. Selective debugging:** Sometimes researchers use inappropriate statistical tests, or there are artifacts or data coding errors. Researchers may be more likely to spot and fix ‘bugs’ that create non-significant p-values than those that produce significant ones. If an experiment ‘doesn’t work’, I look at my analysis procedures. If I spot a bug, I fix it, and run the analysis again. And so on. Once I do get a significant result, it is very tempting to stop looking for more bugs. What this means is that I’m selecting in favor of bugs that produce false positives. I consider selective debugging a form of p-hacking. But bugs can’t be assumed to produce evenly distributed p-values. Some bugs produce p-values clustered around 0, because there ‘really’ is a deviation from the the null hypothesis – albeit because of a bug, not because of the true data.

Pingback: » P-Hacking: A Talk and Further Thoughts – Discover Magazine (blog)()

Pingback: News about Hacking | IT and CNC Geeks' World()

Pingback: Believe It Or Not, Most Published Research Findings Are Probably False | untiredwithloving()

Pingback: The Strange World of "Reward Deficiency Syndrome" (Part 2) - Neuroskeptic()

Pingback: When Science is a Family Matter - Neuroskeptic()

Pingback: The problem with our data-driven world | Fusion()

Pingback: The deception that lurks in our data-driven world – Jamaica Gateway()

Pingback: Felix Salmon Isn't Shopping for What Al Gore Is Promoting | Pulse Bell()

Pingback: Reproducibility Crisis: The Plot Thickens - Neuroskeptic()

Pingback: Weekend Reading CXXVI : Blogcoven()

Pingback: Real Data Are Messy - Neuroskeptic()

Pingback: P-Hacking: A Talk and Further Thoughts – Our Bookmarks()

Pingback: Believe It Or Not, Most Published Research Findings Are Probably False | Big Think – LaAntiguaFrontera()

Pingback: Critical evaluation of information: Credibility crisis | inquiry learning & information literacy()

Pingback: La scienza ha un problema con la statistica | maticciofive()

Pingback: Flexible Measures Are A Problem For Science - Neuroskeptic()

Pingback: Flexible Measures Are A Problem For Science – NewsKon 新聞控()

Pingback: More on Publication Bias in Money Priming - Neuroskeptic()

Pingback: Problemi di statistica | Num3ri v 2.0()

Pingback: Negative Results, Null Results, or No Results? - Neuroskeptic()

Pingback: Does Sugar Really Fuel Willpower? – BlogON.cf()