The Trolley Problem With Science

By Neuroskeptic | June 20, 2013 2:33 pm

Imagine a scientist who does an experiment, and doesn’t like the results. Perhaps the scientist had hoped to see a certain pattern of findings and is disappointed that it’s not there.

Suppose that this scientist therefore decided to manipulate the data. She goes into the spreadsheet and adds new, made-up data points, until she obtains a statistically significant result she likes, and publishes it.

That’s bad.

Now, start this scenario over. Suppose that rather than making up data, the scientist throws it out. She runs the experiment again and again (without changing it), throwing out the results every time they’re wrong, until eventually, by chance, she obtains a statistically significant result she likes, and publishes it.

Is that bad?

Yes, but isn’t it less obviously bad than data fabrication? I’m talking about an intuitive level. We feel that fabrication is clearly outrageous, fraudulent. Cherry-picking is bad, no-one denies it, but it doesn’t generate feelings to the same extent.

Cherry-picking goes on in science, and I don’t know a scientist who doubts that it’s more common than fraud. Yet we treat fraud much more harshly, regardless of the extent of the manipulation or the amount of money and prestige at stake.

One fabricated point of data is misconduct; a thousand unpublished points of data is merely a ‘questionable practice‘.


I think it’s a trolley problem.

Imagine an out of control trolley was going to smash into five people and kill them. You happen to be standing by a lever that would divert the trolley onto another track, where it would hit a person, and kill him. Do you pull it?

Many people say that they would not pull the lever, even though this means that four extra deaths occur. They wouldn’t pull it because they don’t like the idea of committing the act of killing someone. Even though the decision not to act causes more harm, it doesn’t feel as bad, because it’s (in some intuitive sense) an act of omission.

Even people who do pull the lever feel conflicted about it.

I think the mentality is the same in science. Data fabrication is like pulling the lever – it’s committing deception. Cherry-picking is a sin of omission – you wait for the right data to come to you, and report that, omitting to report the rest of the data that points to a different conclusion.

The end result is the same – misleading results. But they feel different.

  • Dave Nussbaum

    I think there’s an additional layer of self-deception that can come in — that is, each iteration is not identical, the researcher changes something that was (presumably) wrong with the previous version of the experiment. So the new version is, in the researcher’s mind, improved and the flaws that had derailed it before are now fixed.

    When it comes out significant on the 3rd or 8th attempt, it’s perfectly valid in the researcher’s eyes to dismiss the null results that came previously, because they were the result of a flawed study. It’s this new one that’s got it right. It’s pretty easy to see how someone could fall into this trap, especially when they have a strong confirmatory bias in favor of their hypothesis.

    I don’t think replication can solve all of the problems science is facing, but this is one where it helps a lot. If you get the significant result on your 6th try at perfecting the experiment, then great — that’s exploratory, and you should label it as such. Then replicate it — if you get the same result again, then you can have some confidence in your result.

    • Neuroskeptic

      Good points; replication is important but by itself it wouldn’t solve the problem, only kick the can down the road somewhat.

      Which results are most promising to try to replicate? And which replications can we trust not to be, themselves, cherry-picked? If questionable means can give you a result, they can give you a result twice (although it would take longer, it would be quite possible).

      This is why as well as replication, we need preregistration.

      • Dave Nussbaum

        I think there’s a couple ways to look at this problem — external vs. internal replication (and a optimistic vs. a cynical view).

        The external replication point, which you raise in your response, is right — if people will publish exploratory results as confirmatory, then we will need a way to motivate other researchers to replicate those results and that’s a huge challenge. Pre-registration helps there.

        However, I was thinking more along the lines of internal replication — that is, the researcher themselves replicating the significant sixth experiment (after five failed ones) rather than jumping to the conclusion that it must be right. In other words, it’s incumbent upon the researcher themselves to replicate in order not to be fooled into believing what they want to believe. Pre-registration can help here also, but less — it helps by highlighting the fact that this was the sixth attempt to get the same outcome with (slightly) different experiments. But you could have pre-registered all six experiments successively, and if you truly did hone the experiment to the point that it tested what you wanted it to and the result was significant on the sixth try, pre-registration or not, it should still be replicated.

        This brings me to a final note, on the cynical vs. the optimistic view. Your trolley dilemma illustrates, at least implicitly, that people do not want to cheat, but that they can deceive themselves into poor practices under the right circumstances. The cynical view is that people will push the envelope as far as they can get away with — and that may be true (at least of some subset of researchers). The optimistic view, which I cling to, is that researchers are truly out to discover the truth, and that to the extent that they make errors (or cheat), they are errors and cheating that they do not recognize as errors (again, sometimes self-deceptively).

        If you buy into the optimistic view, then one of the most important things we can be doing is educating people about the reality of poor practices, as you do by explaining the problem with running multiple studies and only reporting the significant ones. The more people know, the harder it is to self-deceive. And if people are well-intentioned, then eliminating self-deception should improve research practices. Anecdotally, I know this is happening for many people — for example determining sample sizes in advance is something that I know many people have started doing in response to the “False positives” paper.

        Perhaps the optimistic view is naive, or insufficient, which would suggest some sort of hybrid solution. I’m not one to quote Ronald Reagan very often, but as he said, maybe we should “trust but verify.”

        • andrew oh-willeke

          The point about internal pre-replication is a good one. But, often, a scientist will simply not have the funding on his or her own to replicate all of his or her promising results in house – particularly if the duration of the study is long or the subjects are hard to locate (e.g. a study of how transgender individuals feel about gender reassignment surgery twenty years later).

      • Wouter

        I was wondering, what’s your point of view on pilot studies. I think you’d agree that pilot studies are useful for effect size estimation and sample size calculation. However quite often, one can cherry pick some effects (s)he wants to pilot. (S)he can abort the pilot and put in the file drawer if nothing comes out. (S)he can tweak the experiment settings to increase the effect size. Up until the point where the reasearcher is confident that the actual experiment will end up with the hypothesized results and then preregisters the experiment, et voilà, a valid preregistered study with nice results.

        Good practice, questionable practice, bad practice?

        • Neuroskeptic

          Good question. I think before doing a pilot, you should decide whether or not the results might be publishable.

          If so, it should be preregistered (it could be minimal). Otherwise, as you say, there is bias because you might only reveal positive pilots.

          If you decide the pilot is of the kind (e.g. a technical dry-run) that won’t give publishable results, then don’t register it. Of course you could later decide to publish anyway but readers will see that it was unregistered and hence they’ll know that the results might be unrepresentative.

      • andrew oh-willeke

        Universal preregistration can be quite burdensome. This is particularly so because a study that is, for example, conceived to be about age and cognition may when the results are in, turn out to tell you nothing about age and cognition, but a lot about gender and cognition, or culture and cognition (e.g. lots of game theory studies have turned out that way).
        But, replication is such a well defined and exact kind of activity that pre-registering replication attempts can be done very fruitfully and does further the preregistration agenda and the need of scholars to weigh the risks of being the fourth guy to replicate what two other studies being conducted at the same time are already trying to replicate which really wasn’t worth the effort.
        If the industry standard becomes (1) trust nothing that isn’t independently replicated as science, and (2) preregister all replication efforts, you get a lot of safeguards with a minimum of preregistration red tape on relatively unstructured or flexible preliminary research.

    • Zachary Stansfield

      Well, also recognize that to some extent this is what a good scientist does in many areas. If it didn’t work the first time, then tweak the experiment to fix the problem!

      This becomes bad practice when the “tweaking” phase remains indistinct from the “confirmation” phase. Once everything is working well, a good scientist needs to step back and conduct a proper and thorough test of the hypothesis, possibly followed by multiple independent replications (sans additional tweaks), all of which should be published.

      But it’s just so much easier to collect lots of data and throw out the “bad” stuff which doesn’t agree with what you are looking for. It’s also easy to delude yourself about how poor this practice really is.

      • Dave Nussbaum

        That’s exactly it.

  • Steve Curtis

    I am skeptical of the neuroskeptic’s skepticism, and Dave hit on it. What competent scientist stops after ONE good result? That is easily as immoral as fabricating data and feels like an argument made of straw. One lynchpin of science is replicating the results you expect to see. If it can’t be replicated then it means nothing.

  • Buddy199

    Both are lying. One is lying to the public. The other is lying to yourself and the public. All results and methods should reported, let the chips fall where they may. Science always gives the correct answer. If you don’t understand it or don’t like it you’re asking the wrong question.

  • highly_adequate

    I frankly don’t think that cherry picking data is half the problem that it is made out to be. If one’s mark of statistical significance is .05, then on average one would have to conduct 20 experiments before one found one significant, assuming the null hypothesis were actually true. I simply doubt that any scientist does the same experiment over 20 times. Now one might argue that instead it’s publication bias — that 20 scientists conduct the experiment, but only 1 gets the significant result, and they are the ones who publish. But is it really plausible that there are 19 scientists in the same narrow field, all of whom have conducted the experiment in question, have failed to replicate, and simply never communicate amongst each other their failures? I find that implausible. Moreover, in many of these narrow fields there are so few researchers that I don’t see how one could put together 19 times as many who might have failed for any one who has published significant results.

    And the most troubling of all are the many cases in which the researchers who find significant results find them many times over in additional experiments. There simply can’t be any explanation of such results by means of cherry picking.

    And yet this last category comprehends the best known names in a given field, because it is they who have established and repeatedly shown the supposed phenomenon, and thereby “own” the category.

    So when someone declares he has found a certain phenomenon repeatedly, and others have great difficulty replicating it, there is probably no benign explanation for it. The most charitable explanation most likely lies in a serious defect in the experimental methodology, which allows distorted results to turn up that have no basis in reality.

    The less charitable interpretation is, of course, some kind of conscious fraud.

    • Neuroskeptic

      Few researchers would actually collect 20 sets of data and pick the ‘best’ one, but the trouble is that it’s possible to get 20 different p-values from the very same set of (sufficiently rich) data i.e. if you have a few variables from a few groups, you can correlate the variables, compare the groups, in a range of different ways.

      In my post I used the most clear-cut example of cherry-picking, which is probably quite rare, but there’s a worryingly large range of questionable practices that end up being equivalent.

      • Dave Nussbaum

        If you factor in questionable methods that increase researcher degrees of freedom — or “p-hacking” as it has come to be known — then you also don’t need to run 20 experiments before you get a significant result, three or four may be plenty.

    • andrew oh-willeke

      What happens a lot is that a statistically significant result appears in a small sample study of a potential treatment for some intractable mental health condition (e.g. autism or schizophrenia), which is one of scores of treatment modalities being examined in similarly small samples for the same condition. Then, the paper and the associated PR from the study claim that the good outcome from the treatment approach is “statistically significant” before it has been replicated even once.
      If you have a hundred different schizophrenia treatment methods being explored at any one time in one year studies, you are going to get five false positives a year with p=0.05 even in the case of a null hypothesis that none of them actually work. But, if one instead used terminology like “statistically significant if independently replicated” in the paper and PR, and reserved language that seems to imply scientific confirmation that something actually works until it has been replicated (the replication should rule out 95% of false positives while ruling out very few true positives, so usually only two data sets would be needed to be sure).
      Indeed, it might be appropriate to have two different journals. One called “The Journal Of Promising Resulting” for un-replicated scholarship, and “The Journal of Psychiatry” for replicated scholarship. This would also strongly encourage people to devote time and energy to doing replications because it would be the first independent replication that would get the top scientific publication billings. The policy would also avoid reputation tarnishing after the fact revelations for big name journals and institutions that make big claims that don’t pan out.
      A conclusion about alien biology, or a cure that doesn’t pan out, or cold fusion looks far more benign in “The Journal of Promising Results.” Notable, but not making a definitive claim of truth.

  • John McIntire

    The “file-drawer” effect is certainly a problem and recognized as such within at least some areas of science. There are even statistical remedies to correct for the estimated size of this effect when doing meta-analysis and reviews and such.

    If you have never seen the video “dance of the p-values” you are missing out:

    • Wouter

      I loved that video. Thanks for sharing.

  • John McIntire

    O forgot to mention, very interesting post! Agree about the relation to the Trolley Problem, and strongly agree about pre-registration!

  • Terrell Gibbs

    I think there’s a little more to it than that. The “drop data” lever is one that we pull all the time, for perfectly good reasons. There’s nothing wrong with dropping data—if you do it for perfectly good reasons that are statistically independent of the result you are measuring. Ideally, the decision to drop the data should be made before you see the data. “I put the wrong solution in the tube” is an excellent reason to drop the data. Second best is to use a criterion that is statistically independent of whatever you are measuring. But this is tricky. I once was on the point of publication when I realized that a data exclusion criterion that I thought was statistically independent really wasn’t (the data being excluded and the exclusion criterion were both being normalized to the same control), and I had to put all the excluded data back in.

    So it is easy to fool yourself. “Now that I see the data, I’m pretty sure that I put the wrong solution into that tube. I surely would have excluded that data point if I’d remembered to note it down.” I guess this is a bit like pulling the switch tracks lever when you are “pretty sure” that nobody is standing on the other track…

  • andrew oh-willeke

    It is worth noting at a blog like this one that one of the most famous cases of probably faked data (because it was just too perfect to have any meaningful chance of actually being random) was that of Gregor Mendel and his peas.

  • David Marcos

    Great article BTW!!! But what if it’s a school paper, and the information was the result of cherry-picked information. Would that intern be the person who listed the informations fault for the grade you received?

  • PeteInBarrie

    I see attempts at cherry-picking a lot in my line of work (advertising and marketing research). If you want research, let the data stand. If you want a PR puff-piece, knock yourself out!

  • Pingback: The Trolley Problem With Science : Neuroskeptic...()

  • Pingback: Outbreaks on demand | worldsoutsidereality()

  • John Kraemer

    How did you manage to write this without mentioning Marc Hauser?

  • Pingback: FRIDAY BITES – 28/06/13 | PsyBites()

  • chatpaltam o

    happens all the time when scientists try to disprove the bible
    pick pick pick..

  • David Bump

    The worst kind of “cherry picking” omission is the data that never even comes up, because “everybody knows” there’s no use in studying THAT, or “the scientific consensus” says THIS must be true so THAT is just plain IMPOSSIBLE. The history of science is rife with things that came to light anyway, because somebody looked despite all of that and there was lots of good hard data to prove it. However, we must wonder how much delay there was before some maverick came along, and we can observe that even after they presented their data there was a delay as the old order resisted. Finally, we have to wonder what we’re missing now because people aren’t even thinking about investigating certain possibilities. Just recently we’ve seen important new information coming to light as researchers looked at “junk DNA” and found it had functions people hadn’t even thought of. Another area is the discovery of organic material in fossils that hadn’t been looked at, because it was assumed nothing organic could have survived so long, or so well.
    Here’s some samples…
    “For centuries sailors have been telling stories of encountering monstrous ocean waves which tower over one hundred feet in the air and toss ships about like corks. Historically oceanographers have discounted these reports as tall tales– the embellished stories of mariners with too much time at sea.”
    “Despite nearly a century of anecdotal reports from airline pilots, most scientists didn’t really believe in sprites until about ten years ago” (10 years before 2001)
    “J Harlen Bretz
    Endured decades of scorn as the laughingstock of the geology world. His crime was to insist that enormous amounts of evidence showed that, in Eastern Washington state, the “scabland” desert landscape had endured an ancient catastrophe: a flood of staggering proportions. This was outright heresy, since the geology community of the time had dogmatic belief in a “uniformitarian” position, where all changes must take place slowly and incrementally over vast time scales. Bretz’ ideas were entirely vindicated by the 1950s. Quote: “All my enemies are dead, so I have no one to gloat over.” ”

    Now, consider areas of science such as paleontology, where the data points are widely scattered, and the actual pattern of bunching and disappearing is largely ignored because “everyone knows” that there must have been OTHER data points “something like” (i.e. “sister groups”) at DIFFERENT locations on the charts that weren’t preserved, but formed continuous lines … because “everyone knows” that they must have — indeed to suggest anything more than relatively slight variations is considered a threat to the very nature of science, or at least a departure from science which cannot be considered or tolerated.

  • andrewkewley

    Unfortunately, omission of data and not following the planned protocol are the norm and not the exception. Wish I was joking…



No brain. No gain.

About Neuroskeptic

Neuroskeptic is a British neuroscientist who takes a skeptical look at his own field, and beyond. His blog offers a look at the latest developments in neuroscience, psychiatry and psychology through a critical lens.


See More

@Neuro_Skeptic on Twitter


Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!

Collapse bottom bar