On the “Suspicion of Scientific Misconduct by Jens Förster”

By Neuroskeptic | May 6, 2014 4:42 pm

One week ago, the news broke that the University of Amsterdam is recommending the retraction of a 2012 paper by one of its professors, social psychologist Prof Jens Förster, due to suspected data manipulation. The next day, Förster denied any wrongdoing.

dodgy_statistics1Shortly afterwards, the Retraction Watch blog posted a (leaked?) copy of an internal report that set out the accusations against Förster.

The report, titled Suspicion of scientific misconduct by Dr. Jens Försteris anonymous and dated September 2012. Reportedly it came from a statistician(s) at Förster’s own university. It relates to three of Förster’s papers, including the one that the University says should be retracted, plus two others.

A vigorous discussion of the allegations has been taking place in this Retraction Watch comment thread. The identity and motives of the unknown accuser(s) are one main topic of debate; another is whether Förster’s inability to produce raw data and records relating the studies is suspicious or not.

The actual accusations have been less discussed, and there’s a perception that they are based on complex statistics that ordinary psychologists have no hope of understanding. But as far as I can see, they are really very simple – if poorly explained in the report – so here’s my attempt to clarify the accusations.

First a bit of background.

The Experiments

In the three papers in question, Forster reported a large number of separate experiments. In each experiment, participants (undergraduate students) were randomly assigned to three groups, and each group was given a different ‘intervention’. All participants were then tested on some outcome measure.

In each case, Förster’s theory predicted that one of the intervention groups would test low on the outcome measure, another would be medium, and another would be high (Low < Med < High).

Generally the interventions were various tasks designed to make the participants pay attention to either the ‘local’ or the ‘global’ (gestalt) properties of some visual, auditory, smell or taste stimulus. Local and global formed the low and high groups (though not always in that order). The Medium group either got no intervention, or a balanced intervention with neither a local nor global emphasis. The outcome measures were tests of creative thinking, and others.

The Accusation

The headline accusation is that the results of these experiments were too linear: that the mean outcome scores of the three groups, Low, Medium, and High, tended to be almost evenly spaced. That is to say, the difference between the Low and Medium group means tended to be almost exactly the same as the difference between the Medium and High means.

The report includes six montages, each showing graphs of from one batch of the experiments. Here’s my meta-montage of all of the graphs:

forster_linearThis montage is the main accusation in a nutshell: those lines just seem too good to be true. The trends are too linear, too ‘neat’, to be real data. Therefore, they are… well, the report doesn’t spell it out, but the accusation is pretty clear: they were made up.

The super-linearity is especially stark when you compare Förster’s data to the accuser’s ‘control’ sample of 21 recently published, comparable results from the same field of psychology:

control_papersIt doesn’t look good. But is that just a matter of opinion, or can we quantify how ‘too good’ they are?

The Evidence

Using a method they call delta-F, the accusers calculated the odds of seeing such linear trends, even assuming that the real psychological effects were perfectly linear. These odds came out as 1 in 179 million, 1 out of 128 million, and 1 out of 2.35 million in each of the three papers individually.

Combined across all three papers, the odds were one out of 508 quintillion: 508,000,000,000,000,000,000. (The report, using the long scale, says 508 ‘trillion’ but in modern English ‘trillion’ refers to a much smaller number.)

So the accusers say

Thus, the results reported in the three papers by Dr. Förster deviate strongly from what is to be expected from randomness in actual psychological data.

How so?

The Statistics

Unless the sample size is huge, a perfectly linear observed result is unlikely, even assuming that the true means of the three groups are linearly spaced. This is because there is randomness (‘noise’) in each observation. This noise is measurable as the variance in the scores within each of the three groups.

For a given level of within-group variance, and a given sample size, we can calculate the odds of seeing a given level of linearity in the following way.

delta-F is defined as the difference in the sum of squares accounted for by a linear model (linear regression) and a nonlinear model (one-way ANOVA), divided by the mean squared error (within-group variance.) The killer equation from the report:


If this difference is small, it means that a nonlinear model can’t fit the data any better than a linear one – which is pretty much the definition of ‘linear’.

Assuming that the underlying reality is perfectly linear (independent samples from three distributions with evenly spaced means), this delta-F metric should follow what’s known as an F distribution. We can work out how likely a given delta-F score is to occur, by chance, given this assumption, i.e. we can convert delta-F scores to p-values.

Remember, this is assuming that the underlying psychology is always linear. This is almost certainly implausible, but it’s the best possible assumption for Förster. If the reality were nonlinear, the odds of getting low delta-F scores would be even more unlikely.

The delta-F metric is not new, but the application of it is (I think). Delta-F is a case of the well-known use of F-tests to compare the fit of two statistical models. People normally use this method to see whether some ‘complex’ model fits the data significantly better than a ‘simple’ model (the null hypothesis). In that case, they are looking to see if Delta-F is high enough to be unlikely given the null hypothesis.

But here the whole thing is turned on its head. Random noise means that a complex model will sometimes fit the data better than a simple one, even if the simple model describes reality. In a conventional use of F-tests, that would be regarded as a false positive. But in this case it’s the absence of those false positives that’s unusual.

The Questions

I’m not a statistician but I think I understand the method (and have bashed together some MATLAB simulations). I find the method convincing. My impression is that delta-F is a valid test of non-linearity and ‘super-linearity’ in three-group designs.

I have been trying to think up a ‘benign’ scenario that could generate abnormally low delta-F scores in a series of studies. I haven’t managed it yet.

But there is one thing that troubles me. All of the statistics above operate on the assumption that data are continuously distributed. However, most of the data in Förster’s studies were categorical i.e. outcome scores were fixed to be (say) 1 2 3 4 or 5, but never 4.5, or any other number.

Now if you simulate categorical data (by rounding all numbers to the nearest integer), the delta-F distribution starts behaving oddly. For example given the null hypothesis, the p-curve should be flat, like it is in the graph on the right. But with rounding, it looks like the graph on the left:


The p-values at the upper end of the range (i.e. at the end of the range corresponding to super-linearity) start to ‘clump’.

The authors of the accusation note this as well (when I replicated the effect, I knew my simulations were working!). They say that it’s irrelevant because the clumping doesn’t make the p-values either higher or lower on average. The high and low clumps average out. My simulations also bear this out: rounding to integers doesn’t introduce bias.

However, a p-value distribution just shouldn’t look like that, so it’s still a bit worrying. Perhaps, if some additional constraints and assumptions are added to the simulations, delta-F might become not just clumped, but also biased – in which case the accusations would fall apart.

Perhaps. Or perhaps the method is never biased. But in my view, if Förster and his defenders want to challenge the statistics of the accusations, this is the only weak spot I can see. Förster’s career might depend on finding a set of conditions that skew those curves.

UPDATE 8th May 2014: The findings of the Dutch scientific integrity commission, LOWI, on Förster, have been released. English translation here. As was already known, LOWI recommended the retraction of the 2012 paper, on grounds that the consistent linearity was so unlikely to have occured by chance that misconduct seems likely. What’s new in the report, however, is the finding that the superlinearity was not present when male and female participants were analysed seperately. This is probably the nail in the coffin for Förster because it shows that there is nothing inherent in the data that creates superlinearity (i.e. it is not a side effect of the categorical data, as I speculated it might be.) Rather, both male and female data show random variation but they always seem to ‘cancel out’ to produce a linear mean. This is very hard to explain in a benign way.

  • Chris Hartgerink

    Proper dissection of the linearity method and some of its limitations. Thanks!

    The application of traditional methods to these kinds of use cases is interesting. However, you do overlook a main point when saying the entire accusation is based on the data being too linear (which they are nonetheless): the data are *consistently* too linear. With the Fisher method, the accuser clearly shows that the linearity is so consistent across the large set of studies, that it is highly unlikely that this consistency could arise in a situation where data are sampled from a population. The random element in random variables seems to be absent, which is a red flag. Note that this report was only the immediate cause for the investigation, not the final report upon which the conclusion of the (independent) committee is based. This seems to be forgotten by many people.

    I figure LOWI (the institution which ordered the investigation by the committee after the initial decision by CWI was objected to by the accuser) will make its report public soon and many of these details about the investigation will become clear. This will show why they came to this decision, and I consider it very plausible the report shows there was more evidence than just that presented in the accuser’s report. Most likely, the committee has heard people involved, has inspected the data (if possible), and possibly has involved one or multiple third parties for advice on the matter.

    I do not think that which put the nail in the coffin was the report alone, but a synthesis of this and the results of the investigation itself.

  • Pingback: Counterresponses | Pearltrees()

  • matus

    Forster’s data are just silly. One can see this from just looking at the linearity of the plots. The clumps in your simulations arise because at the upper end the variance is so low that only very few patterns match the linear trend. For some delta F-scores there are actually no patterns matching the linear trend. This could be used to argue not only that Forster’s data are unlikely but that they are downright impossible.

    If someone reports an average value 2.50 with variance 0.2 in group where 40 subjects gave rating on a 5 point scale then these reported values are certainly incorrect. The pattern with smallest variance that gives a mean of 2.5 is that 20 subjects gave response 2 and 20 subjects gave response 3. In this case the variance is (20*(-0.5)^2+20*(0.5)^2)/39 = 40*0.25/39=0.26 which is considerably more than 0.2.

    • http://www.math.leidenuniv.nl/~gill Richard D. Gill

      The data of the paper Förster and Denzler (if I remember correctly) are available, or rather, were available to UvA and LOWI. They ought to be published. The data of the other two papers were not available.

      By “the data” I mean the “final data set”, ie basically a spreadsheet which, if you feed it into SPSS and press the right buttons, will give you the averages, standard deviations, F statistics, and p-values which you need for your scientific publication.

  • Richard Morey

    But…the results were selected *precisely* for their linear look; that is, the data drove the test. The control sample was not selected for their linearity. If you compare a sample selected as suspicious for its linearity against another sample not selected for its linearity, of course you’re going to find differences. This makes the test a post hoc test, and the computed “naive” p value a meaningless number. I agree, it looks mighty suspicious, but any null hypothesis significance test without a correction for the post hoc nature of the test is completely invalid.

    • http://blogs.discovermagazine.com/neuroskeptic/ Neuroskeptic

      The results themselves weren’t selected, they represent all of the 3-group results in the papers concerned. The only thing the papers have in common is that Förster wrote them.

      It’s true that the analysis is post-hoc in the sense that Förster was investigated over data linearity because his data are linear. So there is selection at the level of researchers and yes there is a multiple comparison problem.

      But even if we grant that there’s a million researchers like Förster publishing these kinds of results, the chances of just one of them having a corpus with such consistent linearity comes out as 1 in 508,000,000,000,000 (ie Bonferroni correction x 1 million)…

      • Richard Morey

        I don’t think you want to use Bonferroni here. If you use Tukey HSD-sort of logic, we would ask what p value would be greater than 95% of null p values, if we had done N tests (or, rather, peeked at N papers). If the p value is greater than that, it would be “significant.” Since p has a uniform distribution, the maximum p has a beta(N,1) distribution. The 95% quantile (that is, assuming an alpha of .05) of this distribution is (.95)^(1/N).

        If we take the p value from the report’s analysis of the 2012 paper — p = 0.999999994 — we can ask how many papers we’d have to have examined for the corrected p value to NOT be less than .95, and it would be N = log(.95)/log(0.999999994) = 8,548,882. I can’t say I’ve read that many papers :) If we require the more stringent 0.005 suggested recently suggested by Val Johnson as being sufficient evidence to publish, that number be 835,424. If we choose an (arbitrarily) more stringent p value to account for the fact that we’d probably want more evidence that normal to accuse someone of misconduct – say .0001 – the number of studies required to make the post hoc multiple comparisons non-significant is 16,668. That’s still a lot of studies.

        This is all quickly put together (so I don’t guarantee the math), but I think this paints a better picture than the (meaningless) p values given in the report.

        • http://www.math.leidenuniv.nl/~gill Richard D. Gill

          In this case, the “victim” was a high profile researcher in the same faculty as the whistle blower, who reputedly had come second on the list of candidates for his UvA chair to Stapel, who had preferred an offer from Tilburg … I suspect that the whistleblower did not have to read many papers before he found what he was clearly highly trained to see, if it was there.

    • matus

      I don’t see your point. Visual inspection is part of good data analysis. Even the APA manual recommends it.

      Do you mean the linearity was selected post-hoc as a criterion for suspicious data? In that case we can ask, what is the set of plausible contrasts from which the accuser might choose? Well, they might have come up with a f(x)=b0+x^2+x^7+b1 exp(x) where b0, b1 are a free parameters. Obviously, we wouldn’t take seriously allegations that some researcher made up data because they fit f(x) closely. Would we consider f(x)=b0+b1x^2? Probably not. We may consider f(x)=b0 if no difference was researchers target hypothesis.

      When someone naively fabricates data that are supposed to show certain pattern of differences between groups it is most straightforward (in terms of number of parameters and functional complexity of f(x), but also common sense intuition) to think of linear trend. There is not much to choose from for accuser. No correction needed.

      • Richard Morey

        I’m not arguing against visual inspection. I’m making the well-known and accepted argument that testing a hypothesis that was suggested by the data itself leads to error rate inflation in null hypothesis significance tests, or an inflated measure of evidence in Bayesian tests. This is the whole point of post hoc tests. You need to correct for the fact that the data suggested the hypothesis, because if you went around testing all hypotheses that were suggested by the data, you’d be overstating the evidence for your hypotheses. This is not a controversial statement.

        • matus

          I’m aware of the multiple comparison/double dipping argument. It is just not clear how it works in this context. Do we know that the “data suggested the hypothesis”? Did the investigators have first looked at the data? Or did they only look on the averages in the plots? (which are strictly speaking not data). Maybe they just gave student a task to look at linear trend in recent publications by UnivA researchers…

          What is the set of all possible hypotheses that could be “suggested by the data”? If this is small we don’t need to be concerned with multiple comparisons. The most obvious patterns have been reported by Forster. What is left? As I wrote above there is not much left besides the linear trend if we look at the data with a goal of checking the elementary statistical sanity of the data. This is especially true, if the accusers are restricted to work with the summary statistics reported in the data.

          • Richard Morey

            Are you suggesting that the selection mechanism for looking at the first of Foerster’s papers was “select a random paper from the population, and compute the linearity statistic?” Because that’s what would be required for the null hypothesis significance test to work properly. And…the authors of the report know this, on some level. That’s why they took a sample of papers that were not selected by linearity, as a “control” group.

            Regarding whether you need to post hoc correct when the number of possible hypotheses suggested by the data are small, I suggest you do a simulation. Choose a mean, and form an opinion about how “far away” the mean would have to be from the true mean to “suggest” that the means were different. We now have two hypotheses (that is, a small set): means are different and means are the same. Sample 10000 means from a normal distribution with the mean you picked, and divide them into “suggests same” vs “suggests different”. Look at the distribution of p values for the “suggests same” group. They will not be uniform as required by a Type I error controlled significance test, because the selection mechanism was a function of the same statistic as drives the size of the p value.

            Note that all of this argumentation only applies to the articles that suggested the linearity test. I suspect that probably that represents only a subset of the studies. Any paper that was tested before seeing its data is fair game. Of course, we don’t know which ones these were.

            (As an aside, observed means represented in a plot *are* data, strictly speaking…just because they are graphically represented and not using digit doesn’t make them not data.)

          • matus

            you don’t need full randomization, it suffices if the selection criterion is independent of the outcome, variable. I would guess that what happened is something rather trivial – Forster’s student came for consultation to the statistics deparment or someone from stats saw forster’s poster hanging somewhere on the wall in the psych department. These selection criteria are independent of the outcome, hence irrelevant.

            I don’t see your point of your remaining paragraphs. Let’s make the analysis crystal clear. It has two steps:

            1. look at the plots, if no suspicion declare study ok if suspicion go to step 2
            2. compute F statistics if p almost 1 declare study suspicious if not declare ok.

            Each study gets a verdict. There is no double dipping. You just omit the second step if the plotted averages do not look suspicious because you know that p is considerably smaller than 1 already.

          • Richard Morey

            Your steps only work without post hoc corrections if the reason why you ended up performing the steps in the first place has nothing to do with whatever statistic you’re computing. Checking for excess linearity is not standard practice. There was a reason *why* they chose to do it, and I find it implausible that it had nothing to do with the *observed* linearity.

  • http://www.math.leidenuniv.nl/~gill Richard D. Gill

    The methodology here is not new. It goes back to Fisher (founder of modern statistics) in the 30’s. Many statistics textbooks give as an illustration Fisher’s re-analysis (one could even say: meta-analysis) of Mendel’s data on peas. The tests of goodness of fit were, again and again, too good. There are two ingredients here: (1) the use of the left-tail probability as p-value instead of the right-tail probability. (2) combination of results from a number of independent experiments using a trick invented by Fisher for the purpose, and well known to all statisticians.

    • http://blogs.discovermagazine.com/neuroskeptic/ Neuroskeptic

      Oh, of course. The peas! I stand corrected, it’s not new.

    • notPICNIC

      parallel with the technical issue’s (if there are any) there will develop a discussion about why the whole german (science) journalism Mise-en-scène snapped in a state of a total media blackout during one full week ! :


      i will be trying to share the media coverage of the many thorny roads concerning this Förster case via:


      which is just a small partial study of the many
      issue’s involved as covered in:

      University Inc: http://pearltrees.com/p/2qZ1

      • http://www.math.leidenuniv.nl/~gill Richard D. Gill

        Yes, that is interesting. Probably connected with a lot of important people’s and important institutions’ reputation, and the date of May 8 when Förster was due to receive his Humboldt Foundation reward.

  • http://www.math.leidenuniv.nl/~gill Richard D. Gill

    The LOWI report is now published, https://www.knaw.nl/shared/resources/thematisch/bestanden/LOWIadvies2014nr5.pdf
    What is needed now is (1) translation, (2) the data sets are posted on internet, (3) the R scripts of the accuser are posted on internet. Then there can be an independent and open scientific discussion. Notice: the accusers analyses are not entirely post-hoc: a pattern was noticed in one paper, and investigation of two further papers gave confirmation. All kinds of futher analyses and further information gave confirmation.

    • Wouter

      I’m afraid I don’t have time to translate the entire report, but I can tell you LOWI’s conclusions and recommendations.

      After consulting an expert (statician), LOWI concluded that it is inevitable that the data has been manipulated or deliberately adjusted due to the following reasons:
      1. not linearity itself (which could exist in real life), but the lack of variance in control group averages.
      2. The suspicious linearity is found for the total investigated groups, but not for embedded supgroups; especially men-women groups appear to compensate eachother beyond what could reasonably be expected.
      3. The results and inprobable patterns could not have been obtained through questionable research practices (QRPs), thereby agreeing with the accuser. (Mainly, they conclude this, because of how the available results were presented and reports on how experiments were carried out, which did not show signs of “sloppy science”.)

      LOWI’s judgements & recommendations:
      1. LOWI thinks it’s inevitable that the data have been manipulated. This is a violation of scientific integrity.
      2. LOWI advices the university board to reconsider its initial dicision of putting out an “expression of concern”, since the expert analysis indicates questionable behavior beyond QRPs alone.
      3. LOWI advices the board to request a withdrawal of the publication.

      • http://blogs.discovermagazine.com/neuroskeptic/ Neuroskeptic

        Thanks Wouter. I wonder if you (or anyone else) can clarify something. Forster is quoted as saying in an email to his supporters as saying that (translated into English):

        “My data files were sent to the commissions and have been re analyzed and tested in detail. The results of the re analysis and the investigations were: * the data is real …”

        Does this statement appear in the final report? It seems inconsistent with the expert’s other statements and the panels’ conclusions…

        • http://www.math.leidenuniv.nl/~gill Richard D. Gill

          I believe that the “final” data files only became available for one of the three papers – the one joint with Denzler – but everything else was “lost” by a crash of a hard disk. Sad. And this top researcher who was about to get 5 million Euro research money and a prestigious professorship seems never to have heard of the concept “back-up”, USB stick, dropbox, or whatever. Nor of the standard rule that data should be kept for at least five years *after publication*. Is that incompetence or is that incompetence? Or is that the state-of-the art professional behaviour within social psychology?

        • Wouter

          The report mentions the data a couple of times. What’s important is that the raw data files do not exist anymore, nor was Forster able to justify this absence to LOWI’s satisfaction. Of 1 study, only the final SPSS files were available, which were used for the analysis. Research assistants have acknowledged that raw data files were saved automatically and converted to SPSS format. Whether the conversion was also done automatically remains ambiguous in the text.

        • http://dave.langers.nl/ Dave Langers

          Regarding “the data is real”, I think the closest thing to that in the report is the answer of the consulted expert statistician to the formulated question 1: “are the data, as displayed in the paper …, in agreement with the raw dataset, in other words could one arrive at the correlations in the tables on the basis of the raw data?”, to which the translated answer is (ad 1, page 8 of the report):
          “It proved possible to repeat the statistical analysis of … et al. using the data on the USB-stick. The results agree exactly with those in the paper. The data files provide reasonably comprehensive information about the experiments, in the usual format of one line per subject per study, and a column per variable. The files contain substantially more information
          per subject than used in the above analysis in … et al. The data files give the impression to belong to a solidly performed experiment.”
          So the data are “real” in the sense that they “really” produce the reported outcomes. Whether they are “real” in the sense that they were “really” acquired like that remains an open question, that the LOWI later in the report answers negatively (and so does the consultant in response to the other questions 2-4, in the sense that he finds it highly likely that the data are manipulated).

      • http://www.math.leidenuniv.nl/~gill Richard D. Gill

        Altogether, by my knowledge UvA’s CWI and KNAW’s LOWI consulted at least three *mathematical statisticians* and no doubt a similar number of psychometricians – *applied statisticians* working in “methodology” departments in social science or psychology faculties. As far as I can see from the available material, all these external experts both from mathematical and applied statistics came to essentially the same conclusions.

      • http://www.math.leidenuniv.nl/~gill Richard D. Gill

        LOWI also found another curious anomaly. The linearity disappears if one splits the data (available for just one of the three papers) into male and female subjects. So the incredible, almost impossible linearity, is the result of an incredible compensation between women going one way and men going another way.

        A monkey can type Shakespeare, sure. A million monkeys will do it faster still. Förster’s defence is that his data is “possible”. Nobody can deny that.

        • http://blogs.discovermagazine.com/neuroskeptic/ Neuroskeptic

          This is a very odd thing…could it hold the key to the mystery?

          I wonder. The Delta-F test is based on the assumption that there are 3 populations of evenly spaced means. What if we instead assume 6 populations, i.e. 3 male 3 female, and that the true underlying trend is linear within each gender but with a different ‘slope’… could that bias delta-F scores? Someone needs to simulate this… and I don’t have a copy of MATLAB handy…

          Edit: But just as a thought experiment I think it could produce bias. Consider if the male & female lines have very different slopes. And consider that there is zero within-gender variance. Now in the Low and High conditions, pooling across genders you’d get a mean halfway between male and female, but you’d also get a high within group variance because it’s a bimodal population (male vs female). But in fact the means would always be perfectly linear (so long as the % of males and females was equal.)

          • http://www.math.leidenuniv.nl/~gill Richard D. Gill

            The underlying trend within each gender was over-abundantly clearly *not* linear, and moreover completely different in different studies. “Curiouser and curiouser”?

            The only obvious way to “make” data look like this (starting with some initial data which may or may not be real, it doesn’t matter) is to alter individual measurements (without taking any notice of gender) in order to reduce the within group variance, *and* to push the group means to a desired (linear!) pattern. Overdo your efforts to get the means right, and this is what is going to come out of it all …

  • Pingback: Counterresponses | Pearltrees()

  • Bryan Sim

    I’m probably misunderstanding something. The way they are using the delta-f test is to test for a null effect. That is, they are implying that a “highly non-significant” result implies “too” perfect linearity. However, p-values reflect extreme-ness of data ASSUMING that the null hypothesis is true, and NOT the probability of the null hypothesis being true (see: http://en.wikipedia.org/wiki/P-value). Thus, in this case, the p-value would NOT be the probability that the data “are too linear”.

    Put differently, when using null hypothesis significant testing, a low p-value implies that there may be a population difference between whatever (two-groups, etc.), but a high p-value is not meaningful on its own, without considering something like statistical power.

    To use the same test to answer the question of “is the data too linear”, in my mind, one would have to consider beta, and not alpha levels. I don’t know what those would be in this case. Hopefully someone can enlighten me!

    • http://blogs.discovermagazine.com/neuroskeptic/ Neuroskeptic

      Under the null hypothesis, p-values are evenly distributed between 0 and 1 (by definition). In this case, the accuser set the null hypothesis to be perfect linearity of the underlying means (NB this is not the same null hypothesis as Forster used in his studies).

      Now if the null hypothesis reflected reality, the p-values of the delta-F test would be uniformly spread between 0 and 1.

      If the real effects were nonlinear, p-values would peak at 0.

      However, there is no ‘normal’ scenario in which the p-values peaked at 1. I think this is what you mean when you said “a high p-value is not meaningful on its own”. Because under the null hypothesis, p-values are not going to be high: they are equally likely to be low, medium or high.

      However in Forster’s data the p-values peaked around 1. Within a classical statistical framework this should never happen. Thus it is suspicious.

      • Bryan Sim

        Yes, thank you! This makes things much clearer.

  • Pingback: Volgende sociale psycholoog struikelt - de zaak Förster - Kloptdatwel?()

  • Pingback: Who ya gonna call for statistical Fraudbusting? R.A. Fisher, P-values, and error statistics (again) | Error Statistics Philosophy()

  • Hellson

    Response by Jens Förster to the LOWI Report. Not sure if he did himself a favor with that one.

    • http://blogs.discovermagazine.com/neuroskeptic/ Neuroskeptic

      An unimpressive response that doesn’t even address the core statistical claims, but simply declares them wrong e.g. “the 1 in a trillion value is rhetorical” – actually it comes from a transparent and detailed calculation. It might be miscalculated or the assumptions behind it might be in error. But the burden of proof is on him to show that, and he doesn’t even try.

      I will write a post on what his reply should have been, soon.

  • Richard Jacobs

    I’m not a statistician, and not familiar with delta-F, but I notice that delta-F only takes group variances into account. Intuitively, I’d think that a test for reliable linearity should be based on within-subject differences between conditions, which could be much smaller than the between and within-group differences. Why are within-participant differences not taken into account?

    • http://blogs.discovermagazine.com/neuroskeptic/ Neuroskeptic

      This is true but all of these studies used a between-subjects design. A within-subject design would indeed raise different issues.

  • Pingback: Explaining The Oddness of Jens Förster's Data - Neuroskeptic | DiscoverMagazine.com()

  • Steve Spencer

    I can think of a QRP that would easily lead to the super linearity. One QRP that has been discussed a fair bit is p-hacking by adding small number of participants at a time and then reanalyzing the data each time until a p-value reaches less than .05. Now this specific QRP would not likely produce super linearity, but a similar stop rule (as in a decision of when to stop adding participants) QRP clearly would. What if only a few participants were added at a time and the data was checked and the experiment was only stopped when the pattern of means looked good. This would produce pretty strong super linearity and it could be something that might be done by an eager research assistant without thinking about the consequences.

  • Pingback: Homepage()

  • Pingback: Power Analysis and Non-Replicability: If bad statistics is prevalent in your field, does it follow you can’t be guilty of scientific fraud? | Error Statistics Philosophy()



No brain. No gain.

About Neuroskeptic

Neuroskeptic is a British neuroscientist who takes a skeptical look at his own field, and beyond. His blog offers a look at the latest developments in neuroscience, psychiatry and psychology through a critical lens.


See More

@Neuro_Skeptic on Twitter


Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!

Collapse bottom bar