P-Values and Exploratory Research

By Neuroskeptic | March 31, 2015 3:18 pm

Lately I’ve been talking a lot about the question of whether scientists should preregister their research protocols.

fixing_science

One question that often arises in these discussion is: “what about exploratory research?”

The argument goes like this: sure, preregistration is good for confirmatory research – research designed to test a particular hypothesis. However, some research (perhaps most) is exploratory, meaning that it’s about collecting data and seeing where it leads. Exploratory studies have no prior hypothesis or set protocol. Preregistration would hamper or stigmatize such open-ended hypothesis-generating research, which would be a bad thing.

Now, I don’t think that preregistration would hurt exploratory research, but in this post I want to ask: what exactly makes research ‘exploratory’? In particular, I’m going to explore the question: can research be called exploratory if it uses p-values?

P-values are everywhere. In neuroscience, psychology and many other fields, the great majority of published empirical research uses them. Now a p-value is “the probability, under the assumption of the null hypothesis, of obtaining a result equal to or more extreme than what was actually observed.” So every p-value implies the existence of a hypothesis – the null hypothesis.

How, then, can any study that results in p-values be considered purely hypothesis-generating? Surely every p-value represents a hypothesis being tested?

One answer would be as follows: maybe a study is only confirmatory if it involves a positive hypothesis, as opposed to the a null hypothesis. The null  ‘hypothesis’ in an exploratory study might be “that there is nothing new and interesting going on here.” We might decide that this doesn’t count as a hypothesis for the sake of deciding whether a study is confirmatory. My impression is that this is the assumption behind many discussions of exploratory science.

My concern is that this approach makes ‘exploration’ purely a matter of the researcher’s intentions. The very same analyses on the very same data could be either exploratory, or confirmatory, depending on what is going on in the researcher’s mind when they do it. This is unsatisfactory to me.

So what if we bite the bullet and declare that anything involving a p-value is a confirmatory study? Taken to its logical conclusion, this could mean that all confirmatory (p-value) analyses should ideally be preregistered, while non-preregistered analyses could use descriptive statistics, but not inferential ones.

A “no preregistration, no p-values” rule also ensures that p-values can be taken at face value. A p-value is the chance of finding a result as extreme as the observed result, under the null hypothesis. But what if you run lots of different statistical tests to address the same hypothesis? Then your chance of finding an extreme result in at least one test is higher than the p-values indicate. (Multiple comparisons correction solves this problem, but only if it’s applied over all of the tests that were ever tried, not just all of the tests that are published, and preregistration is the only way to ensure this.)

However, it would be impractical to “ban” exploratory research from using p-values altogether. Even if we could, doing so would undermine the principle that exploratory research is open and free-form. But maybe we could re-brand exploratory p-values so that they can’t be mistaken for (preregistered) confirmatory ones. We could call them p(e) or p* or p’ values. We could even call them o-values, because they are “not quite p’s”.

CATEGORIZED UNDER: FixingScience, science, select, Top Posts
ADVERTISEMENT
  • Sturla Molden

    Or we could stick to descriptive statistics and focus on graphing data. This could be supplied with Bayes factors where needed, instead of classical hypothesis testing. In a Bayesian context the difference between exploratory and inferential/confirmatory analysis does not exist.

    • PsyoSkeptic

      Except with exploratory you’d pretty much not have meaningful priors by definition so Bayes would be moot. It would be the confirmatory followup studies that would use your study to generate priors.

      • Sturla Molden

        In natural sciences we can generate priors from common sence. If I am counting virus, I know there cannot be fewer than 0 virus particles in a sample. If I also know the size and the mass of the virus I can give an upper limit given the size and volume of the sample. If I know the disease is very rare, I can use 0 as the prior mean. And now I can fit a rough prior. Bayes factors is robust against the exact choise of prior, so it doesn’t really matter. You don’t need data to generate a prior, you just ned to use common sence and your knowledge of whatever you have measured.

  • GCBill

    I do think the same data analysis can function as exploratory or confirmatory, but I don’t think its status entirely depends on the researchers’ intent even if we allow p-values. It also matters what competing hypotheses have been corroborated or conflicted in recent literature. Research with p-values can be “exploratory” if the hypothesis-space for “something interesting” hasn’t been significantlying narrowed. The process of homing in on the correct theory is, as always, a collaborative one.

  • Wintermute

    Sounds like it would split a larger study into two separate papers. The hypothesis generating experiment tested with descriptive statistics, *then* pre-register a study to validate your hypothesis with inferential statistics.

    • D Samuel Schwarzkopf

      I actually think the exact opposite would be a great leap forward to our field. We should allow just that to be submitted in the first stage. It is essentially a more formalised pilot study, quantifying and describing the findings, which then gets reviewed. The journal can decide if this finding is theoretically of interest to their lofty goals. Together with reviewers you then finalise the design for stage 2. Could be same but ideally it can be improved, power calculations can be made etc. This then becomes a preregistered self-replication which you carry out and then present as the final product. Journal must publish outcome regardless of what you find.

    • practiCalfMRI

      Yes, more replication. I’d argue more important than a particular p value, even a very good one (which ought to make replication a breeze, right?).

      • D Samuel Schwarzkopf

        Does it? (Or is this yet another AF joke…? 😉

  • D Samuel Schwarzkopf

    I think there is a confusion as to what people mean by ‘exploratory research’. In the context of the preregistration discussion, a lot of exploratory research is still hypothesis-driven. It’s just that the hypotheses weren’t necessarily defined prior to data collection. The guidelines about Registered Reports for instance state that this is the
    kind of ‘exploratory analyses’ that you would declare.

    It -is- possible to formulate a hypothesis after you collected the data. Hypotheses aren’t contingent on when you collect data. Say, you flipped a coin 10 times and it falls on heads 10 times but you neglected to preregister your hypothesis test because I took you to the pub the night before to continue our discussion from the other week… You’d be perfectly right to run an hypothesis test to check if the coin is biased. I woudn’t call that exploratory. If it were anything for which the data are already present, e.g. in archaeology or astronomy or reanalysing a big data neuroimaging set, would also be exploratory.

    So what I think this shows is that the distinction between exploration and confirmation isn’t actually as clear cut as people would like us to believe. Instead I think we should call a spade a spade. What this discussion is usually about is preregistered vs non-registered protocols.

    There is also a different type of exploratory study which actually requires statistical control, such as whole-brain imaging. If you have no prior hypothesis of where you expect to see an activation to a particular condition you need strict control of false positives. It’s not only exploratory because it’s just about the null hypothesis. The alternative hypothesis is very clear: you expect activation to condition X. However, since you don’t have a defined hypothesis where this activation should be or (even better) how large it should be, or if it even exists at all, this makes it exploratory (not that this sort of experiment could certainly be preregistered though!).

    Now I am with you that there could be purely exploratory research. There are a lot of unnecessary hypothesis tests reported in studies that they don’t really need. I recall reading a nice blog about this recently (can’t recall where though) and that last Gigerenzer paper discussed this too. For those things you could abandon the use of p-values – although in that case you should also abandon confidence intervals because they aren’t really all that different from p-values.

    I definitely disagree with your point here and what you said at the UCL event: it really makes no sense to correct for the multiple statistical tests you do over the course of your career. I’m not sure if you’re being tongue in cheek there but if one did that you would essentially make it impossible to ever find scientific discoveries. This cannot be the answer…

    Last but not least, I think this discussion dances around the true problem: Whatever we do, we should abandon the use of p-values altogether – with or without preregistration 😛

    • Alexander Etz

      “Whatever we do, we should abandon the use of p-values altogether – with or without preregistration :P”

      Amen.

      • D Samuel Schwarzkopf

        Yeah I should just have posted that and gone to bed 😉

    • PsyoSkeptic

      Your coin example is very bad. It’s exactly the kind of thing that you *cannot* run a hypothesis test on because the probability of those heads given no prior hypothesis is 1. You can post hoc see what the probability would be if the null were true but you can’t reject the null. You could only use the found probability to guide future research where you could reject the null.

      (a hypothesis test and p-value are not the same thing)

      There are several other issues with your comment as well. For example, Type I error correction in FMRI research should be set to very different values for exploratory and confirmatory research.

      Exploratory research is much more about being open about what you’re doing than any particular analysis. Admit that you looked at lots of things and this was the interesting thing you found. But don’t call it a hypothesis test post hoc because that’s completely nonsensical.

      • D Samuel Schwarzkopf

        I never said anything about having no prior hypotheses. Of course you can have a prior hypothesis in the coin example (I’m afraid I apparently wrongly assumed that was obvious). Your null hypothesis is a fair coin, and in this case you presumably want to test the alternative that this is a two-headed coin. You can set out calculating the probability of obtaining the observed result under the null just as if you had decided doing so before flipping the coin.

        What you’re saying about fMRI is also wrong. I *specifically* said that for exploratory, whole-brain localisation analysis you need to use very different control of false positives than you would for testing a specific hypothesis.

        Either way, what is called ‘exploratory’ in the context of the prereg discussion usually means ‘non-registered’. In many situations that may be just as you said, run many analyses and report what you found is interesting. But that may not be the same thing. In the preregistered studies I’ve seen most ‘exploratory’ analyses were very hypothesis-driven.

        • PsyoSkeptic

          You never said anything about having or not having a prior hypothesis. Did I incorrectly interpret the example as really being, “you could have some data that seem surprising and test it after the surprise”? That seemed to be what you are getting at and in that case the test is inappropriate. I conceded that for coin tosses the null might be a default going in but that’s definitely not the case generally and in either case you can’t decide to test it after seeing surprising data. Your false positive rate for tests would become greatly inflated.

          Sorry I missed the FMRI distinction, that was good to point out.

          • D Samuel Schwarzkopf

            No. My point is that the data can come in before you formulate your hypothesis. Preregistration is a separate issue. It would prevent HARKing but hypothesising after the data have been collected is not the same as hypothesising after the results are known. Although I grant you that this probably happens a lot.

            In the scenario you describe (and which I was probably unclear about) you form the hypothesis because you saw surprising results – so it is outcome dependent. I agree that then you should formulate a hypothesis that you can test explicitly in the next experiment.

            However, I disagree that it would be *wrong* to calculate a probability on the 10 initial coinflips. The probability of obtaining that result under the null would be pretty low (<0.001) and this can flag up interesting results to follow up. So, the answer to NS's question is no, probability values should not be excluded from the 'exploratory' non-registered results. (Of course in practice, if you don't include p-values, some reader will just calculate them anyway :P)

          • PsyoSkeptic

            You’re making a distinction that doesn’t exist. Formulating a hypothesis after data come in is the same as hypothesizing after the results are known.

            If I receive some data, perhaps 5 samples from levels of a treatment. If I then look at the means and decide, “oh dear, the most powerful treatment effect is that between B and D I better check that with a t-test.” That one statistical test of B and D with typical cutoff has an alpha of 0.27, NOT 0.05. And it’s not an easy solution to decide to just correct it because the correction for all tests then makes alpha much much smaller than desired.

          • Guest

            Of course there is a distinction! One is a questionable research practice, the other is how the scientific method works.

          • D Samuel Schwarzkopf

            Of course there is a distinction here! I never said “look at the means and decide”. That would make the inference contingent on the outcome which is a different story.

          • PsyoSkeptic

            Hmmm, I see you’re trying to make a distinction between hypothesis after data comes in and hypothesis after results are known but you’re not making it clearly.

            You comment earlier it would be ok to do a test if the coin flips all came out a certain way that was surprising. You comment that the data could guide the hypothesis generation. So, this isn’t just getting data in a box but data you’ve looked at and generated hypotheses from.

            You’re saying that’s not HARKing.

            So, perhaps you mean by known results the results of a statistical test as opposed to everything else about the data?

          • D Samuel Schwarzkopf

            As I said before, I wasn’t being clear in my initial example. I don’t know if I can be any clearer now though. I am not talking about “surprising” data. For it to be surprising you have to look at it first – which is a form of inference. NS’s post was about preregistration. The whole point of my example was that you can do hypothesis tests without that because not being preregistered doesn’t automatically make it “exploratory”.

            Now there is a separate issue to be discussed about surprising data. You need to be able to make inferences about those too. I might write a blog post about this in the future because it’s interesting – but that wasn’t what I was discussing above.

    • http://blogs.discovermagazine.com/neuroskeptic/ Neuroskeptic

      “Say, you flipped a coin 10 times and it falls on heads 10 times but you neglected to preregister your hypothesis test… You’d be perfectly right to run a hypothesis test to check if the coin is biased.”

      Hmmm. But what if you flip one thousand coins ten times each? Would you be justified in running a test on whether the coin that gave the most heads is biased?

      • D Samuel Schwarzkopf

        No but that’s not what I said 😛

        • D Samuel Schwarzkopf

          LOL I deleted that because I didn’t want to clutter this thread with redundant comments. That backfired :P. Backfired even more by me now commenting on my redundant comment 😉

      • D Samuel Schwarzkopf

        Actually scratch that. Yes, you would be justified in doing so provided that you also declare that this is what you did and you corrected your statistical test accordingly…

        This is actually a perfect example for what I mean. It’s quite clearly exploratory but you nonetheless calculate probability values.

        • http://blogs.discovermagazine.com/neuroskeptic/ Neuroskeptic

          “Provided that you also declare that this is what you did and you corrected your statistical test accordingly…”

          Ah, but how would you correct it? Would you correct it for (say) 1000 multiple comparisons.

          Or would you have to correct for 2000 comparisons because you might also have decided to look at the coin that gave the most tails.

          Or 3000 if you also decided to look at the pair of coins that were most correlated with each other (coin-coin connectivity analysis)?

          My point being, if you are free to adopt any number of analysis strategies, how do you make sense of the p-values that result from any one strategy?

          • D Samuel Schwarzkopf

            I think the answer to this question is inherent in what I said: “Provided you declare what you did”

            If you decide your Bonferroni correction factor based on the outcome then this is a form of p-hacking. In this example you should correct for the number of coins (unless there is a random field theory application taking into account the correlatedness of the coins…? ;)).

            But, as I said, you shouldn’t be using p-values at all ;). The problem is that I haven’t quite understood how to deal with multiple tests in BF world and as far as my bootstrapped evidence is concerned I haven’t quite decided yet what the best way is to deal with multiple tests. The good news is that false positives are generally reduced with greater power so it already makes the problem less severe but that alone doesn’t fix it.

          • D Samuel Schwarzkopf

            Actually since the evidence reflects the power it may simply be wise to consider the evidence. If everything is barely above inconclusive range you probably should reserve judgement. This is something that you can’t just do with classical stats. But it’s largely an empirical question that I may tackle one day if I actually continue this work… 😛

  • PsyoSkeptic

    The critical issue here is not the p-value but what is done with it. It’s just the probability of the data given the null hypothesis. The p-value is *not* a test. A test is typically comparing that p-value to a predefined cutoff to decide that the null hypothesis itself is false. But true / false is a dichotomy made in a decision process and not a p-value. When you have a pre-existing hypothesis you can test it this way and it’s not so bad an idea. However, if the data are exploratory you don’t do the test because, among other reasons, the p-value logic has nowhere to go when the null is rejected without an alternative.

    • http://blogs.discovermagazine.com/neuroskeptic/ Neuroskeptic

      OK, but it’s hard to see something labelled as a “p-value” and not, at least mentally, classify it as “significant”, “not significant”, “highly significant”.

  • Mitch_Ocean

    Anyone good at exploration/exploratory research is one who can recognize a set of conditions where something interesting might happen. So, the first order part is to identify an area that might have something interesting to study.

    Then, one might make a preliminary hypothesis that given x, y or z might happen within this space. If that does happen, boring–already explained by what you know. Often, however, result w shows up, and then that is when you have struck something interesting. So, you might think of the null hypothesis in exploratory research as “I can predict what the result will be”, and if so, no new science…

  • PsyoSkeptic

    Let’s not confuse p-value and hypothesis test any further in this thread. They’re not the same thing. One is merely a probability given some assumptions. The other is a, frequently logically awkward, decision process.

    • D Samuel Schwarzkopf

      I think they are being conflated because that’s the logic of NS’s blog post. Perhaps he wants to clarify whether he means there shouldn’t be hypothesis tests at all in non-registered studies? As I said below, you would still end up with informal hypothesis-testing if people start looking at confidence intervals instead.

    • http://blogs.discovermagazine.com/neuroskeptic/ Neuroskeptic

      A p-value is a probability given the assumptions that constitute the null hypothesis. True, a p-value itself is not a test, but it becomes a test once we apply a rule such as “p < 0.05 means reject the null hypothesis", and that final step is so common that I didn't bother to spell it out.

      Edit: and even if someone does calculate a p-value and doesn't explicitly threshold it, the readers will do that for themselves, because p < 0.05 is so universal.

      • PsyoSkeptic

        Thanks for the clarification. What readers do with a p-value might be important but it’s also important to make the distinction because p-values and tests are farther from each other than people think. For example, the p-value has a stronger correlation with the strength of evidence against the null than the test does. In another example, data peeking wherein one collects further data to pass a test causes the p-value at a particular stopping N to become more accurate at the same time as it is generating test errors about that p-value. It’s the cherry picking of the random variation in the p-value that’s the problem in that case and not the p-value that’s becoming more likely to be erroneous.

        I think a lot of confusion arises about the p-value and the intertwining of hypothesis tests. Worse yet, we often do things like modify the calculated p-value when multiple-testing rather than do what’s really going on and modify the cutoff value. This makes it seem like they’re interchangeable and they’re not in so far as the meaning of the p-value becomes corrupted. They’re only interchangeable in terms of test outcomes. (and paradoxically people who focus on test outcomes tend to focus on the importance of the p-value and then corrupt it’s meaning)

        • http://blogs.discovermagazine.com/neuroskeptic/ Neuroskeptic

          Fair enough – I could have made the distinction between p-values and hypothesis tests clearer!

        • MoRaw

          “But what if you run lots of different statistical tests to address the same hypothesis? Then your chance of finding an extreme result in at least one test is higher than the p-values indicate. (Multiple comparisons correction solves this problem”

          I could be wrong, but multiple comparisons correction is done when one uses the same statistical test, but, with different entities/comparisons (for example on different subjects).

          • PsyoSkeptic

            Multiple comparison correction has to be done on any fishing expedition. That can come up in either of your scenarios. I’m not sure how this is related to my comment but the issue of multiple comparisons isn’t cut and dried. The basic argument for them means you’re supposed to do it across variables in a factorial ANOVA. Ever see anyone do that?

      • http://www.facebook.com/felonious.grammar Felonious Grammar

        I’ve read recently that the p is rather arbitrary. Can you explain it briefly in laymen’s terms?

        • MoRaw

          Of course it is arbitrary (i.e., the p-value is a random variable) with a certain mean and variance. For example, if one uses permutation testing (random scrambling) to find the p-value, then, this randomization will be reflected on the p-values. If the variance of the p-values distribution is high, then we might have a problem. More, it will be normally distributed if the conditions of the central limit theorem are fulfilled.

          • PsyoSkeptic

            I’m not sure where this comes from. The p-value in permutation generation is uniformly distributed. The permuted statistic is normally distributed. Generate a permuted statistic and then calculate p for each one. The resulting distribution of p-values will be flat while the statistic is normal.

          • Epicurus

            I was talking about the randomization and not the distribution (and how it will be reflected on the p-values). Plus, the permuted statistic is not always normally distributed, for example in MVPA studies.

        • PsyoSkeptic

          One would need to know the context in which you heard it to know exactly what explanation you need.

          My guess is that they mean it’s arbitrary in terms of hypothesis testing. If you adopt a Neyman-Pearson framework of strict alpha cutoff and the commensurate feature of possible prior power calculations then the actual value of the p-value is irrelevant. It only passes or does not.

          Another way it’s arbitrary is when used as a stand in for effect size. It’s highly dependent upon N and not a direct indicator of effect size. Very small effect sizes can also have very small p-values.

  • storkchen

    Excellent points! Not only interpreting, but even in computing (!) p-values, the intention of the researcher plays a – numerically relevant – role. Think of corrections for multiple comparisons, for example. Preregistration is probably the only really solid way to protect us from HARKing.

    • D Samuel Schwarzkopf

      You’re probably right that only prereg can stop harking although what bothers me about this solution (and this may be one of my emotional rebellions to it) is that it is so restrictive. I want people to be able to be honest about what they are doing. It is a myth that only harked studies can be published. I have always tried to be explicit in papers when our prior hypotheses weren’t correct. In all honesty, to some degree this is the case in almost any study. It is actually very satisfying to write a paper that reads like this.

      Intro: If I do A, will B happen?
      Methods: We did A.
      Result: C happened.
      Discussion: Looks like we were wrong. Things are more interesting than we thought. Or possibly less. Until next time 😛

      • http://blogs.discovermagazine.com/neuroskeptic/ Neuroskeptic

        Preregistration doesn’t restrict you from HARKing, it just means that your HARKing will be clearly labelled as such and distinct from your a priori hypotheses. I would say that this is useful even for researchers who are perfectly honest and would never knowingly pass HARKing off as anything else. Because even honest people forget what their original hunches and hypotheses were!

        • D Samuel Schwarzkopf

          What I mean is that prereg is usually generally discussed as a restrictive measure “stops harking, stops p-hacking, stops QRPs etc”. Just as you say now, prereg can instead be tool for good rather than a weapon against evil. It can make people’s thinking more coherent and prevent you from fooling yourself (as Feynman would put it).

          Anyway, harking is a tricky term. The way I see it used it is generally seen as an evil. I think posthoc hypothesising is perfectly fine but that’s not what harking implies.

          • http://blogs.discovermagazine.com/neuroskeptic/ Neuroskeptic

            I agree that prereg is often sold as a way of “stopping” those things. But this is a lazy rhetorical shorthand for “stopping those things being a problem by making them transparent, and also removing the incentive to do them”.

            Advocates of prereg perhaps ought to be more clear about exactly what we mean!

          • D Samuel Schwarzkopf

            In general I think, focusing more on positives will help. This applies to the wider context too, not just the prereg discussion. Of course this doesn’t mean you can’t criticise negatives but at the moment I feel the field is in a terrible mood that is worth breaking.

            Speaking of prereg, I am really liking that hybrid prereg idea I mentioned. I think this would suit particularly the high impact journals. Basically, you carry out your “exploratory” study that produced sensational results. You then submit it for review which determines if the journal cares enough to publish it at all. Then you preregister the protocol to replicate it – plus potential improvements agreed in the review process but not allowing too much alteration unless it’s necessary – and then you carry that out in a confirmatory way. The only thing that still bothers me is what happens the finding doesn’t replicate. I don’t think high impact journals would publish that. They could publish it in a different outlet of course but that puts weight on positive replication which in turn incentivises QRPs that fly under the radar of prereg. Still it’s better than status quo anyway as you’d have a guaranteed publication.

            Obviously (and as Chris Chambers confirmed) there is nothing stopping you from doing this with RRs as they exist now. I’m just saying that this should perhaps be the way all flashy high impact journals should operate. They have a particular role to play but at the same time they must have an interest in having more solid results.

  • MoRaw

    Nice. The question is, how much exploratory research is out there? I would say 90%.

  • Pingback: Reliability and Generalisability | NeuroNeurotic()

  • Pingback: 5 years into academia – 45 things I’ve learned so far … – tobias dienlin()

NEW ON DISCOVER
OPEN
CITIZEN SCIENCE
ADVERTISEMENT

Neuroskeptic

No brain. No gain.

About Neuroskeptic

Neuroskeptic is a British neuroscientist who takes a skeptical look at his own field, and beyond. His blog offers a look at the latest developments in neuroscience, psychiatry and psychology through a critical lens.

ADVERTISEMENT

See More

@Neuro_Skeptic on Twitter

ADVERTISEMENT

Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!

Collapse bottom bar
+