The presentation is now available online. If you don’t have time to watch the whole thing, I’d recommend the section from 14:50 to 22:00 – that’s my live demonstration of how, with a few minutes of P-hacking, you can find a nicely significant correlation ( p = 0.006 ) between two variables that have no relationship whatsoever.
Here’s a few further thoughts on this topic.
As I noted in my talk, there are various proposed methods of detecting p-hacking within a set of studies or a whole literature. These methods tend to rely on looking at the distribution of published p-values, or p-curve. Assuming that there really is a significant effect, p-values should be clustered around 0.
Under the null hypothesis of no effect, p-values are evenly distributed between 0 and 1, so lots of significant p-values evenly distributed between 0 and 0.05 indicates selective publication of chance findings. Finally, a ‘bump’ in p-values just below 0.05 is a sign of heavy p-hacking.
These methods are powerful. However, I worry that selective publication and/or p-hacking could produce p-values clustered around 0, not 0.05, and therefore go undetected, in some cases. Here are three scenarios in which this might happen:
1. “Overhacking”: researchers might not stop with a significant result, but might keep hacking their p-values to be as low as possible, to make their results look more compelling. There is no reason why p-hacking has to stop once p < 0.05. Indeed, as awareness of p-hacking grows, researchers may become increasingly suspicious of marginally significant p-values, and the de facto desired significance threshold may shift downwards. Perhaps this is already happening.
2. Selection bias: we tend to prefer the lowest p-value given a choice. If you find two results (or two variants of the same result), with p-values of (say) 0.04 and 0.07, it would be easy to report only the 0.04. But if you have three options, 0.01, 0.04 and 0.07, you’d probably report 0.01, not 0.01 and 0.04. In other words the pressure is not “publish all p-values below 0.05” but “publish the lowest possible p-value, if this is below 0.05”. Importantly, this doesn’t even require ‘active’ p-hacking i.e. trying different variants of the same analysis, it just requires the running of multiple analyses (maybe on different variables) and selective publication.
3. Selective debugging: Sometimes researchers use inappropriate statistical tests, or there are artifacts or data coding errors. Researchers may be more likely to spot and fix ‘bugs’ that create non-significant p-values than those that produce significant ones. If an experiment ‘doesn’t work’, I look at my analysis procedures. If I spot a bug, I fix it, and run the analysis again. And so on. Once I do get a significant result, it is very tempting to stop looking for more bugs. What this means is that I’m selecting in favor of bugs that produce false positives. I consider selective debugging a form of p-hacking. But bugs can’t be assumed to produce evenly distributed p-values. Some bugs produce p-values clustered around 0, because there ‘really’ is a deviation from the the null hypothesis – albeit because of a bug, not because of the true data.