Why a new case of misconduct in psychology heralds interesting times for the field

By Ed Yong | June 26, 2012 6:04 am

[Update: The mysterious avenger has been revealed as Uri Simonsohn. He is one fo the co-authors on the Simmons paper that I wrote about below.]

Social psychology is not having the best time of it. After last year’s scandal in which rising star Diederik Stapel was found guilty of scientific fraud, Dirk Smeesters from Erasmus University is also facing charges of misconduct. Here’s Ivan Oransky, writing in Retraction Watch:

“According to an Erasmus press release, a scientific integrity committee found that the results in two of Smeesters’ papers were statistically highly unlikely. Smeesters could not produce the raw data behind the findings, and told the committee that he cherry-picked the data to produce a statistically significant result. Those two papers are being retracted, and the university accepted Smeesters’ resignation on June 21.”

The notable thing about this particular instance of misconduct is that it wasn’t uncovered by internal whistleblowers, as were psychology’s three big fraud cases – Diederik Stapel (exposed in 2011), Marc Hauser (2010) and Karen Ruggiero (2001). Instead, Smeesters was found out because someone external did some data-sleuthing and deemed one of his papers “too good to be true”. Reporting for ScienceInsider, Martin Enserink has more details

“The whistleblower contacted Smeesters himself last year, the report says; Smeesters sent him a data file, which didn’t convince his accuser…. In its report sent to ScienceInsider, the whistleblower’s name is redacted, as are most details about his method and names of Smeesters’s collaborators and others who were involved. (Even the panel members’ names are blacked out, but a university spokesperson says that was a mistake.) The whistleblower, a U.S. scientist, used a new and unpublished statistical method to search for suspicious patterns in the data, the spokesperson says, and agreed to share details about it provided that the method and his identity remain under wraps.”

This might seem like a trivial difference, but I don’t think it could be more important. If you can root out misconduct in this way, through the simple application of a statistical method, we’re likely to see many more such cases.

Greg Francis from Purdue University has already published three analyses of previous papers (with more to follow), in which he used statistical techniques to show that published results were too good to be true. His test looks for an overabundance of positive results given the nature of the experiments – a sign that researchers have deliberately omitted negative results that didn’t support their conclusion, or massaged their data in a way that produces positive results. When I spoke to Francis about an earlier story, he told me: “For the field in general, if somebody just gives me a study and says here’s a result, I’m inclined to believe that it might be contaminated by publication bias.”

Francis has reason to be suspicious, because behaviour is surprisingly common. This is another notable point about the Smeesters case. He didn’t fabricate data entirely in the way that Stapel did. As one of his co-authors writes,“Unlike Stapel, Dirk actually ran studies.” Instead, he was busted for behaviour that many of his peers wouldn’t consider to be that unusual. He even says as much. Again, from Enserink’s report:

“According to the report, Smeesters said this type of massaging was nothing out of the ordinary. He “repeatedly indicates that the culture in his field and his department is such that he does not feel personally responsible, and is convinced that in the area of marketing and (to a lesser extent) social psychology, many consciously leave out data to reach significance without saying so.”

He’s not wrong. Here’s what I wrote about this in my feature on psychology’s bias and replication problems for Nature:

“[Joseph Simmons] recently published a tongue-in-cheek paper in Psychological Science ‘showing’ that listening to the song When I’m Sixty-four by the Beatles can actually reduce a listener’s age by 1.5 years7. Simmons designed the experiments to show how “unacceptably easy” it can be to find statistically significant results to support a hypothesis. Many psychologists make on-the-fly decisions about key aspects of their studies, including how many volunteers to recruit, which variables to measure and how to analyse the results. These choices could be innocently made, but they give researchers the freedom to torture experiments and data until they produce positive results. [Note: one of the co-authors behind this study, Uri Simonsohn, has now been revealed as the whistleblower in the Smeesters case- Ed, 28/07/12, 1400 GMT]

In a survey of more than 2,000 psychologists, Leslie John, a consumer psychologist from Harvard Business School in Boston, Massachusetts, showed that more than 50% had waited to decide whether to collect more data until they had checked the significance of their results, thereby allowing them to hold out until positive results materialize. More than 40% had selectively reported studies that “worked”8. On average, most respondents felt that these practices were defensible. “Many people continue to use these approaches because that is how they were taught,” says Brent Roberts, a psychologist at the University of Illinois at Urbana–Champaign.”

I look at the Smeesters case and wonder if it’s just the first flake of the avalanche. If psychologists are developing the methodological tools to root out poor practices that are reportedly commonplace, and if it is clear that such behaviour is worthy of retraction and resignation, there may be very interesting times ahead.

Image by Chagai

CATEGORIZED UNDER: Neuroscience and psychology

Comments (26)

  1. Tony Mach

    “If you can root out misconduct in this way, through the simple application of a statistical method, we’re likely to see many more such cases.”

    You should read Steve McIntyre’s blog – but of course he’s been publishing his statistical analysis for over 10 years now and (almost) nobody gives a $#*&.

  2. A. Bikh-I-Ni

    The report about this alleged misconducted remains shaky, to say the least and may well be nothing more than settling scores between rancorous or individuals or some kind of witch hunt. No details are given about what this ‘massaging’ data means and it is suggested that performing data manipulation equals scientific misconduct. Not only is such misleading, it is simply untrue. For example, log-log transformations is an accepted technique applied to raw data to make certain differences come out better.

    It is also suggested that the fact that the researcher’s raw data are no longer available is part of some conspiracy or proof of his misconduct. That too is misleading and simply untrue. Scientific journals require raw data to be available for only 5 years. The article does not mention that and simply takes over the implicit suggestion that such data should be available indefinitely which is quite absurd. Researchers change employers, computers do crash and sometimes not everything is backed up, so none of what the researcher has said in that context is proof of any wrongdoing.

    Thirdly, as far as the information that is given here, the inconsistencies or even likely impossibility of certain statistical outcomes may well simply be the result of poor research. If you choose the wrong statistical procedures (as simply as assuming that data are normally distributed when in fact they are not) you can arrive at wrong results. There is quite a difference between poor research and research misconduct. If the whistleblower is correct in his/her suggestions that the outcomes are likely wrong, then all it says is that the journal’s peer reviewers in which the studies appeared, either did not do their job, or simply failed to realize the discrepancies. However, on the basis of such shaky suggestions publicly humiliate a researcher and destroy his scientific career, in itself is professional misconduct.

  3. Faye

    used a new and unpublished statistical method to search for suspicious patterns in the data,

    Does anyone else find this detail a tad ironic?

    Is this person essentially searching for Type I error?

    “Massaging data” is such a vague term I don’t even know what to say? Does that mean omitting some data, there are many legit reasons to do that.

  4. KAL

    @A. Bikh-I-Ni Says

    You make some interesting points, but appear to have missed the following sentence – [Smeesters] told the committee that he cherry-picked the data to produce a statistically significant result. That’s not poor peer review or poor science although it doesn’t preclude them – that is deliberate.

    Of course a follow-up, which has probably been done before, might be the why behind the purportedly common manipulation of data. It doesn’t excuse it, but it would provide context.

  5. Hi Ed, I appreciate the attention to this issue, it’s an important one, but I think you’re getting a little ahead of yourself. A few important points to consider:

    1. The notion that his colleagues wouldn’t consider what Smeesters did unusual is a huge leap. I think there is considerable evidence that there are major shortcomings in many researchers’ methods and this is a big problem. But there’s a very big difference between that and and intentionally messing with your data to produce the intended result. Simmons et al., go to great pains to emphasize that many of the problematic behaviors in question occur with the best of intentions. That doesn’t make them unproblematic — this is an issue the field needs to and I believe has begun to address. But that’s a world apart from Smeesters.

    2. Don’t jump the gun on Greg Francis. His reviews are highly problematic — they essentially engage in exactly the same kind of cherry picking he is trying to root out. If you review the statistics to hundreds of papers then inevitably some of them will appear unusual, the same way if you flip a coin hundreds of times, sometimes you will get unusual streaks like 6 tails in a row. He’s essentially flipping a coin hundreds of times and crying foul every time he gets 6 tails in a row. Uri Simonsohn will be publishing something to this effect in the next little while.

    So is there a problem? Yes. Is the Smeesters case a sign of something that’s prevalent in the field? I don’t know, but you’re certainly not in a position to say that it is at this point — and certainly not using a quote from the offending party saying that everyone else is doing it.

  6. That’s probably fair, Dave. Thanks for commenting as always.

    1) The bit about practices being common is a reflection of Leslie John’s paper that I referenced later on which, to my knowledge, does suggest a reasonably high prevalence of statistical fiddling.

    2) I mentioned Greg Francis not to say that he’s absolutely right, but more as an example that some people are interested in doing this sort of statistical post-publication watchdogging. A version of Simonsohn’s critique of Francis is online for those who are interested: http://opim.wharton.upenn.edu/~uws/papers/it_does_not_follow.pdf

  7. Daniel Simons also has some interesting thoughts on this issue over at Google+ https://plus.google.com/u/0/107191542129310486499/posts/Qhrn5nP82tz

  8. Eskimo

    Ed, you picked up on something very interesting. While cleaning up research that has been using questionable practices is appealing, what’s disturbing to me is deciding how to deploy these data analysis/review tools. How does someone decide which papers to look at first, and insulate that decision from personal biases?

  9. Justin Tungate

    So, why not make this new top secret statistical method the standard BEFORE publication?

  10. Yeah, there’s definitely a lack of transparency here that’s troubling. I can understand why the whistleblower might want to remain anonymous but keeping the actual method under wraps is unfortunate.

    @Eskimo – this is a good point. One might also ask whether the ends justify the means? Obviously, it’s not great if people are pursuing some sort of personal vendetta but how much does that matter if the application of data-sleuthing roots out actual misconduct? I don’t know.

  11. Me

    If you reveal the method you render it useless. People will learn how to game it.

  12. @Eskimo In my investigations I try to pick papers that I think people will care about. A psychology paper published in a top journal gets a lot of attention, so that’s where I think is a good place to start. Of course, if an individual happens to care a lot about a specific topic, they should apply the investigation there as well. In general, the investigation could be applied for all findings, but the reality is that for findings that few people care about it may not be worth the effort.

    A lot of scientific investigation is based on personal vendettas, so I’m sure that will happen. As Ed noted, if the analysis is valid, then I guess that is just what we have to live with. One thing I’ve tried to stress is that the problems are likely endemic, so we should not be too harsh on researchers when evidence of questionable practices are found. In the long run, teaching researchers how to properly consider their empirical findings is the only way to solve these problems.

    Like others, I am very curious to learn how this new kind of analysis works.

  13. @Greg Francis, in your back-and-forth with Balcetis and Dunning, they suggested that your string of reports of publication bias might itself be a product of selective sampling or selective reporting. In you rejoinder you said that you were not trying to draw generalizations from your analyses and that each report should be read in isolation as a case study. In your comment above, you again state that your choice of papers to analyze is nonrepresentative and based on your personal judgment.

    But in the main article above you are quoted drawing a generalization about the prevalence of publication bias (“For the field in general…”), and you do so again in your comment (“the problems are likely endemic”).

    To be fair, you haven’t come out and said that your generalizations are based on the statistical analyses of publication bias discussed in the main article. But the juxtaposition, and the context in which Ed Yong quotes you, is likely to lead people to see a connection. So can you clarify what evidence you are using to make your generalizations?

    (I posted a similar question elsewhere but I’m not sure if you saw it there.)

  14. @Ed Yong:

    On (1) I appreciate that you were discussion the John paper — and I think it’s worth discussing — I just think it’s a matter of not lumping everything together and calling it the same thing.

    In the John paper, over 50% of researchers report that they have failed to report all of a study’s dependent measures. It would be better if they did, but this is basically scientific jaywalking in a lot of instances. The researcher has nothing to hide, although in the absence of space constraints, they should be fully transparent whenever they can be. It’s not necessarily a harmless offense, and it should be eliminated, but it hardly shakes the edifices of the science. When it comes to falsifying data, the rate is much, much, much lower.

    I think this is important on two grounds. One is that we obviously don’t want to paint everyone with one brush when there are people who intentionally falsifying data and others who inadvertently engage in practices that should be improved.

    But perhaps more importantly, it is vital that the field improve its shortcomings. The projects you’ve written about like Simmons et al., John et al., Brian Nosek’s replication project, psych file drawer, etc., are very important and need to be taken seriously. When we start accusing people and the field indiscriminately, I believe that it makes these projects less likely, and it raises resistance to them. For example, I believe that Simmons et al.’s recommendations should be adopted by the journals as part of the review process, but this isn’t likely if the project is perceived as a witch hunt. I think Simmons et al., have done everything they can to create the opposite sentiment, but there is only so much that is under their control.

    I’m worried that our eagerness to catch wrong doers can end up doing more harm than good to efforts to effectively reform the field. For example, I worry about the Francis efforts for precisely this reason. If Francis is running statistical analyses on more than just the papers he is reporting to be problematic, but not reporting that fact, then his analyses are flawed, and ironically so. The Simonsohn paper you link to makes this case very well.

    @Greg Francis @Eskimo:

    I think a systematic approach needs to be taken to selecting articles to review, they can’t be randomly chosen based on popularity or any other such criteria.

    If it’s impossible to review all papers, then we should pick a subset randomly. We should be transparent about which papers have been reviewed and apply the appropriate corrections for having run multiple tests. We should certainly not be running tests on multiple studies then reporting our results as if we had only run tests on one.

    Alternatively, it would also be appropriate to test a paper if we have an a priori hypothesis that it may be flawed. In that case, we should run tests on that paper alone.

    While, as I’ve said, I think efforts to reform the field and to improve research practices is important, it’s also important to avoid false positives in our search for misconduct. Incorrectly identifying papers as flawed not only carries great costs for the individuals and research directly affected, but it also hurts the broader effort to reform the field by undermining the credibility of other reform efforts.

  15. rob stowell

    Of course it’s not just psychology-

  16. SP

    Very scary that some commentators here appear to be defending these practices. The issues regarding the secret nature of the methodology and its possible selective application are a clear irrelevance. Here the misconduct is admitted, the statistical methodology is not the proof or even evidence of misconduct, the admission is the evidence and the proof. If papers are bad they need to be eliminated from the literature. Which papers are eliminated first is not important.

    Ed is not tainting a whole field or lumping everything together, he is raising very legitimate concerns. The existence of three high profile retractions in recent years should be enough to raise concerns, but in fact there seems to be more evidence than this of wide-spread poor practice. This is a very serious concern. Bad science is not science at all.

  17. MClean

    @Greg Francis, apparently you are not the anonymous whistleblower? Actually, you would be my first guess in light of your recent stream of papers…you might consider that as a compliment 😉

  18. Daniel

    Tomorrow they will share the method and the name of the researcher who developed it. http://www.erasmusmagazine.nl/nieuws/detail/article/5056

  19. I’ve been trying to reconstruct Simonsohn’s method from the scanty details at Erasmus university’s press release (and the now uncensored copy of the report of their committee on scientific integrity). So I may well be quite wrong … but it looks interesting.


  20. Bayesian Bouffant, FCD

    I read the “What others say” box in the right column of your blog. I suspect those quotes may have been specially selected. Ed Yong caught cherry picking his data!

  21. @Sanjay: I’ve been traveling, so I did not see your comment. I’ll address your concerns in reverse order.

    My opinion about the problems with bias being endemic is not based on any statistical analysis but on observations about the field. I’ve talked with a lot of people about these issues, and almost everyone admits to doing something like data peeking or optional stopping (I’ve been guilty of it myself). Moreover, I think given the current attitude of journals toward publishing null findings, many people have (perhaps unwillingly) had some relevant findings put in a file drawer. In my papers I raise these points, in part, because I do not think the field should too harshly judge the authors of papers with publication bias. Those authors are not practicing science much different from the rest of us. We all need to improve.

    Regarding the possibility that my publication bias studies demonstrate publication bias. It is a big part of the critique by Simonsohn. It’s a curious criticism. In one sense it is true. There are studies I have looked at that do not show publication bias (this, fortunately, includes my own work). I have not published them (nor could I, I suspect).

    However, Simonsohn’s implication is false. The implication Simonsohn makes (which reflects what Balcetis & Dunning tried to say as well) is that my analysis are invalid because there is bias. But, this is only a valid criticism if one tries to infer from my findings a statement about bias across the field in general (e.g., to say that X% of findings in psychology are biased). I do not make such an inference, and I have cautioned others to not make this inference.

    In thinking about these issues, I realized that not all publication biases are necessarily bad. For example, the field clearly does believe that significant effects are more important than non-significant effects. That will tend to lead to a bias toward reporting significant findings more often than non-significant findings. By itself this is a kind of bias, but it’s a (mostly) harmless one that simply defines what topics people investigate and report. If I look into the relationship between afterimages and schizophrenia and I find zip, I can validly choose to not publish those findings. There a bit of harm here in that people do not learn about this null finding, but you cannot have publication bias if you do not publish.

    What is not harmless is to selectively report significant findings that are all related to the same topic. If I investigate the relationship between afterimages and schizophrenia and get five experiments that reject the null and 4 experiments that do not, it is improper for me to publish one set and not the other. That presents a mischaracterization of a phenomenon.

    My analyses of publication bias are not a mischaracterization of a phenomenon. There is one investigation, with one conclusion, based on one set of findings that were identified by the original authors as contributing to their argument.

    Simonsohn points out that using a criterion of 0.1 for Type I error, means that out of 10 such tests, there is a 65% chance of making a Type I error. That’s certainly true, but it’s just part of how we make decisions under uncertainty. 65% sounds a lot bigger than 10%, but these are really just two ways of saying the same thing. We can control the frequency of making Type I errors, but that does not mean we will never make them. Note, that the 65% requires that the null needs to really be true (otherwise it was not a Type I error). Usually we do not know whether the null (no bias) is true or not. In the Balcetis & Dunning case, their reply suggests that their report was biased.

    Simonsohn has a section header titled “p-values are meaningful only if replications or file-drawers are reported”, which is total nonsense. Nothing in the calculation of the p-value requires a replication. It also is completely unaffected by other independent experiments. A single experiment may not convince you about the validity of the result, but the p-value is defined by the data in that experiment, and not by data in other experiments. Even in a biased set of experiments, the p-value of an individual experiment is what it is (as long as the experiment was run properly). When bias in a set of experiments is found, what is in doubt is the accumulated evidence across the set of experiments, which should be more definitive than any single experiment from the set.

    Any decision making process under uncertainty will sometimes make Type I errors. The best we can do is set a rate for making such errors, but we will never know whether or not any specific finding is a Type I error or a correct rejection of the null. To argue otherwise, which Simonsohn somewhat does, is to suggest that we should never make decisions at all. The extension of his view is that we should throw out essentially all findings that are based on hypothesis testing. His position is a radical one, and I think almost everyone in psychology would reject it.

    To look at it another way, if a set of findings really does _not_ have publication bias but appears to have publication bias, then I think the prudent thing for a scientist to do is to consider the set of findings to be biased. That’s what we do all the time with our experimental studies. We never know the TRUTH, but we make decisions based on the empirical evidence. It’s inherent in the stochastic nature of the topics we investigate.

    Simonsohn and I had a series of conversations on these issues. I raised these (and other) points, and he just ignored them. Simonsohn has done some good work, so I’m rather puzzled (and slightly embarrassed for him) that he does not seem to understand these basic issues of hypothesis testing.

  22. @Greg Francis

    I think you’re missing the point of the Simonsohn critique, but maybe I’m wrong, so perhaps you could explain it to me. Imagine the following example: I gather up coins from a large set of currencies from around the world, let’s say 100 in all. I flip each coin 6 times too see if any of the coins is biased. Note that I’m not testing to see whether coins in general are biased, only whether any particular coin is biased.

    Now let’s say I find that when I flip a US nickel, I flip 6 tails in a row. It seems to me that you’re suggesting that I could publish that result as a finding, independent of the other 99 coins that I flipped. If I ignore the 99 other tests, then yes, there is a very low likelihood of flipping tails 6 times in a row (less than 5%). But when you consider that you’ve tested 99 other coins, the chance of getting 6 tails in a row is actually much much higher. The US nickel probably isn’t biased, it just happened to land on tails 6 times by chance.

    To me this is the problem with the approach you’re taking. If you’re testing multiple papers but reporting only on the “unusual” ones — the ones that appear to be biased, then that’s the same thing as saying the nickel is biased. You need to correct for all the other tests that you did, you can’t pretend that the test was independent when in reality it was not. And note that at no point are we making statements about coins in general.

    So maybe I’m missing something that you can explain to me, but as far as I can tell the analyses you’re doing don’t actually say what you’re saying they say. That’s not an unproblematic point since that would mean that you’re not only unfairly singling people out who don’t deserve to be singled out, but you may also undermining other important critiques by making the whole reform enterprise appear problematic.

  23. @ Dave Nussbaum

    The case you describe will make a Type I error with a probability of 0.5^6=0.015625. If you make decisions in an uncertain environment you will sometimes make Type I errors. The only way to avoid this is to remove uncertainty or to not make decisions.

    You can control the rate of making Type I errors by adjusting the criterion for a decision, but the method I use is already quite conservative. If there really is no bias, then a set of experiments similar to the Balcetis & Dunning results would appear to show bias with a probability of 0.00895; for experiments like those in Piff et al, the results would appear to show bias with a probability of 0.01211.

    http://i-perception.perceptionweb.com/journal/I/volume/3/article/i0519ic (See “Francis response to author”)


    Consider it another way. If what you are saying is true, and I really have to consider other experiments that are being run, then when I run a study on afterimages and schizophrenia, I have to adjust my p-value in order to take into account the study run by my daughter in her statistic’s class (comparing GPA between athletes and no-athletes, p=0.15). But this is nonsense, and no one does it. It’s not required because the data sets are independent, and the experimental outcomes are unrelated. Maybe in some distant future there will be such complete theories of psychology that there is a theoretical connection between afterimages, schizophrenia, and athlete’s GPA. If my daughter and I were running these experiments to test some aspect of the theory, then I might have to worry about these issues. Until then, we are fine to treat each case independently.

    By the way, you do make a statement about coins in general. You finished the second paragraph with, “The US nickel probably isn’t biased, it just happened to land on tails 6 times by chance.” That’s a statement about prior probabilities, which is about coins (or at least US nickels) in general. That statement is influencing your interpretation of a single experiment. That’s a valid approach, if you use Bayesian methods. It plays no role in the calculation of the p value.

    Even though you were paraphrasing me, I disagree with your statement: “…the ones that appear to be biased, then that’s the same thing as saying the nickel is biased.” I would state it as, “the ones that appear to be biased, appear to be biased.” One should always keep in mind that the decision you make is itself stochastic and subject to sampling error.

    To test multiple experiments and report only the “unusual” ones is an accurate description of the field of psychology. Your proposed remedy is to adjust for multiple tests. But given the hundreds of thousands (millions?) of tests that are (or can be, do we have to do it retroactively?) run, what you advocate is essentially the halting of all experimental investigations in psychology. No one will ever be able to meet such a criterion. We don’t follow this advice because it is not necessary, and it also is not necessary for my analyses.

    If you really believe what you wrote about multiple tests, then I think you should retract essentially all of your experimental papers, because given the millions of statistical tests that have been run by thousands of scientists, I very much doubt that your p-values reach an appropriate level of significance. It’s been surprising to me that people worry about these issues when statistics are used to demonstrate that experimental findings are not valid, but they happily go along with the standard practice when using statistics to show significant effects.

    It’s also been surprising to me that some aspects of hypothesis testing have not been fully thought out. A year ago, I assumed that all of these issues had been settled decades ago. It’s more complicated than it appears.

  24. @Greg Francis

    Hi Greg, thanks for the response. I see where you’re coming from, and I have some sympathy for your position, but I still think it’s problematic because the lack of transparency does not let your audience get an accurate understanding of the test you’re conducting. As a result, we don’t know what the rate of false positives is likely to be, and these false positives are very costly to real people.

    Here’s an excerpt from a recent article that I think captures the concepts well:
    Even when “significance” is properly defined and P values are carefully calculated, statistical inference is plagued by many other problems. Chief among them is the “multiplicity” issue — the testing of many hypotheses simultaneously. When several drugs are tested at once, or a single drug is tested on several groups, chances of getting a statistically significant but false result rise rapidly. Experiments on altered gene activity in diseases may test 20,000 genes at once, for instance. Using a P value of .05, such studies could find 1,000 genes that appear to differ even if none are actually involved in the disease. Setting a higher threshold of statistical significance will eliminate some of those flukes, but only at the cost of eliminating truly changed genes from the list. In metabolic diseases such as diabetes, for example, many genes truly differ in activity, but the changes are so small that statistical tests will dismiss most as mere fluctuations. Of hundreds of genes that misbehave, standard stats might identify only one or two. Altering the threshold to nab 80 percent of the true culprits might produce a list of 13,000 genes — of which over 12,000 are actually innocent.

    My claim is that you are running multiple tests and not disclosing that fact which makes you liable to get a very large number of false positives. These false positives are very costly to the researchers involved because it’s an accusation that could hurt their careers and we should be very careful before making these sorts of accusations. I understand that correcting for multiple tests in this way will lead to fewer significant findings, but it also means many fewer false accusations.

    Your defense, it appears to me, is that to properly correct for all the tests you’ve run, you would have to include all tests ever run by anybody. I don’t see why this should be the case – your only obligation is to be transparent about all the tests you’ve conducted. If you need to test 10 studies or 1000 studies before you’re able to find one that appears to have publication bias, that’s important information for your reader to have.

    I will certainly grant you that the field of psychology (among others) also doesn’t appropriately correct for multiple tests — this is exactly the file drawer problem you’re trying to address. I commend you for trying to address this problem, I just think that in trying to address it you’re committing the same sort of error as you’re trying to redress. My “worry” is not intended to protect papers that have publication bias from being found out, it is focused on making sure that we identify the right papers.

    Thanks again for the response.

  25. @Dave Nussbaum

    I understand your concerns, but I think they are unfounded.

    You say that if readers of my analyses do not know how many studies have been investigated, then “we don’t know what the rate of false positives is likely to be”. But that’s not true. The rate of false positives is no more than 0.1, which is the probability criterion for judging the experiment set to be “rare” if there actually is no bias. That’s exactly what hypothesis testing computes. (Actually, the rate is typically much lower than this criterion because the test is very conservative, but there is some rate for false positives and it be computed for any given situation.)

    Now, what remains unknown is the _number_ of studies that are false positives. This is unknown for two reasons. First, you do not know how many studies I have investigated. Second, no one knows how many studies actually do not have bias.

    I can give you some approximate information on the first number. I keep a folder with cases I have looked at (with varying levels of analytical detail). There are 25 experiment sets there. My estimate is that 12 of those sets show evidence of publication bias. Of course, this list is not entirely complete because there are lots of experiments I read and quickly dismiss as not being worth investigating (e.g., because it looks like there will be no evidence of bias or because there are technical difficulties that make the analysis impossible). So, it’s at least 25 experiment sets; maybe more depending on how you want to define things.

    Of course, maybe other people are also applying this technique and they have their own folders of experiment sets. In a sense the number of investigated experiment sets is not well defined, and it constantly changes. Whatever this number might be (if it even exists) does not change the conclusion of a particular finding.

    Regarding the second number, some people think the number of studies without bias is close to zero. I do not believe the problem is nearly that bad, but there is really no data on the issue. In any case, even though the rate of false positives is known, the number of studies that produce false positive reports of publication bias is unknown. That’s just the nature of hypothesis testing. It’s also true for all of the empirical studies in every psychological journal.

    I am not saying that there is no bias in my reports, but it’s a bias about what is worth sharing with the field. As I noted in the reply to Sanjay, this kind of bias does not invalidate the reported findings. It’s similar to a bias I have to investigate properties of visual afterimages rather than aspects of embodied cognition. These choices are certainly a bias, but it’s a bias that does not invalidate the results of my investigations. Everyone has this kind of bias because we cannot study everything and we do not pick our topics at random.

    Regarding transparency, I would argue that my analyses are much more transparent than the original studies. If you think it is worthwhile to tell the field about experiment sets that do not have bias. Feel free. The data are all published and available for you. I am not sure why you want me to mention them.

    You finish by saying “the field of psychology (among others) also doesn’t appropriately correct for multiple tests.” There are a lot of problems with hypothesis testing, but at least in the specific instances we have been discussing, I do not think this concern is valid. Balcetis & Dunning do not need to correct their analyses because Piff et al. run some other experiments on a different topic. Likewise, my analysis of Balcetis & Dunning does not need to consider my analysis of Piff et al.

    What you (and Simonsohn) seem to be asking for is control of the family-wise Type I error rate across my analyses. To do so requires justification of what constitutes a family. I would argue that the appropriate family is the set of experiments identified by the original author as supporting their claim. Thus, I use the 0.1 criterion to investigate the claim of Balcetis & Dunning that their five experiments support their ideas about wishful seeing. I do not have to adjust that criterion when I later investigate Piff et al.’s claims about social class and ethical behavior. (Note, this approach is consistent with the views of the original authors, who also treated their studies as different families.) If you think a different family is appropriate, you need to identify it and justify the choice.

    I would argue that the analysis I’ve described should be commonly used by everyone when they have a set of experiments that are related to a common topic. Just as we (are supposed to) test to insure homogeneity of variance when using ANOVA with unequal sample sizes, we should run the power analysis to determine whether the set of findings are believable. If they are not believable, then I feel researchers should not publish the result and should think about their experimental methods to identify sources of bias. Sometimes people will have methods that are perfectly fine, but their data will occasionally indicate a problem. That’s just the nature of making decisions in an uncertain environment.

    You noted that false positives that report publication bias are very costly to real people. This is true, but biased studies are also costly. Someone reading the Piff et al. paper might conclude that rich people really are unethical. This might lead them to change their behavior in lots of important ways (e.g., who to vote for in elections, whether to start a revolution). Psychology investigates a lot of topics that I think are very important. As a science we need to be able to demonstrate to people that we can answer important questions about human behavior, but bias prevents that from happening. Moreover, I know of several real people who spent years investigating a phenomenon and were unable to make progress because they could not get a key finding to successfully replicate. These are real people who dropped out of psychological research (possibly) because a published finding was actually false.

    I’ve always liked this quote from Carl Sagan, “At the heart of science is an essential balance between two seemingly contradictory attitudes–an openness to new ideas, no matter how bizarre or counterintuitive they may be, and the most ruthless skeptical scrutiny of all ideas, old and new. This is how deep truths are winnowed from deep nonsense.” Psychology has been open to the bizarre ideas, but, painful though it may be, we need to engage in the ruthless skeptical scrutiny.

  26. I have had enough people contact me about Simonsohn’s critique (comment #6) that I decided to write up a formal rebuttal. A copy can be found at




Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!

Not Exactly Rocket Science

Dive into the awe-inspiring, beautiful and quirky world of science news with award-winning writer Ed Yong. No previous experience required.

See More

Collapse bottom bar