Machine Learning: Exceeding Chance Level By Chance

By Neuroskeptic | January 18, 2015 7:29 am

A simple statistical misunderstanding is leading many neuroscientists astray in their use of machine learning tools, according to a new paper in the Journal of Neuroscience Methods: Exceeding chance level by chance.

As the authors, French neuroscientists Etienne Combrisson and Karim Jerbi, describe the issue:

Machine learning techniques are increasingly used in neuroscience to classify brain signals. Decoding performance is reflected by how much the classification results depart from the rate achieved by purely random classification.

Suppose you record activity from my brain while I am looking at a series of images of people. Some of the people are male, some are female. You want to determine whether there is something about my brain activity (a feature or pattern) that’s different between those two classes of stimuli (male and female). Now suppose you find a pattern that allows you to ‘read my mind’ and determine whether I’m looking at a male or a female image, with 70% accuracy. Is that a good performance? Well, you might think: guessing at random, flipping the proverbial coin, we would only be right 50% of the time. 70% is much higher than 50%, so the method works!

Not so fast, say Combrisson and Jerbi:

 In a two-class or four-class classification problem, the chance levels are thus 50% or 25% respectively. However, such thresholds hold for an infinite number of data samples but not for small data sets. While this limitation is widely recognized in the machine learning field, it is unfortunately sometimes still overlooked or ignored in the emerging field of brain signal classification […] while it will not come to anyone as a surprise that no study to date was able to acquire infinite data, it is intriguing how rarely brain signal classification studies acknowledge this limitation or take it into account.

The problem is that although, intuitively, we expect that random chance would be able to pick the correct choice, out of two choices, 50% of the time; but this assumption is not valid in machine learning unless we have an infinite sample size, which we don’t. The smaller the sample size, the more likely chance performance is to deviate from the ‘theoretical’ chance level e.g. 50%.

Combrisson and Jerbi note that this problem is well known to statisticians and computer scientists. However, they say, it is often overlooked in neuroscience, especially among researchers using neuroimaging methods such as fMRI, EEG and MEG.

So how serious is this problem? To find out, the authors generated samples of random ‘brain activity’ data, arbitrarily split the samples into two ‘classes’, and used three popular machine learning tools to try to decode the classification. The methods were Linear Discriminant Analysis (LDA), Naive Bayes (NB) classifier, and the Support Vector Machine (SVM). The MATLAB scripts for this is made available here.

By design, there was no real signal in these data. It was all just noise – so the classifiers were working at chance performance.

However, Combrisson and Jerbi show that the observed chance performance regularly exceeds the theoretical level of 50%, when the sample size is small. Essentially, the variability (standard deviation) of the observed correct classification rate is inversely proportion to the sample size. Therefore, with smaller sample sizes, the chance that the chance performance level is (by chance) high, increases. This was true of LDA, NB and SVM alike, and regardless of the type of cross-validation performed.

The only solution, Combrisson and Jerbi say, is to forget theoretical chance performance, and instead evaluate machine learning results for statistical significance against sample-size specific thresholds. They provide a helpful “look-up table” revealing the minimum performance that a classifier needs to achieve in order to statistically significantly exceed chance. This table offers both a yardstick by which to judge previous studies, and a guide for the future. Some neuroscientists who use machine learning may cringe at how high these figures are:

ResearchBlogging.orgCombrisson E, & Jerbi K (2015). Exceeding chance level by chance: The caveat of theoretical chance levels in brain signal classification and statistical assessment of decoding accuracy. Journal of Neuroscience Methods PMID: 25596422

  • Edden Gerber

    Isn’t all this just saying that people do not use statistical significance testing? From the abstract: “we illustrate the use of analytical and empirical solutions (binomial formula and permutation tests) that tackle the problem by providing statistical significance levels (p-values) for the decoding accuracy”… Really? I’m surprised we would still need a paper in JNS methods just to tell us to use a statistical test.

    • Neuroskeptic

      In essence yes. Perhaps what confuses researchers is the need to apply significance thresholds on top of running cross-validation. Maybe some people have been thinking that cross-validation is itself a form of hypothesis testing…?

      • D Samuel Schwarzkopf

        Haven’t read the paper yet but this would strike me as odd. Obviously they must perform a t-test vs chance. The problem is that this approach is testing against a point chance level which isn’t what the true chance distribution looks like. I assume this is what they must mean that the assumption holds for infinite samples but not smaller ones.

        So the answer to that question is, no, people do use statistical tests but they are not appropriate to test for real effects.

        • Neuroskeptic

          Combrisson and Jerbi say that in “numerous” papers, researchers use significance testing only for testing differences in classification accuracy e.g. across different classifiers – and never actually test the hypothesis that the accuracy is greater than chance. They then cite a list of papers which presumably are the offenders they have in mind…!

          • D Samuel Schwarzkopf

            If this is true, this is indeed shocking. I can’t recall any study I read where this was the case. But that doesn’t mean it isn’t common. I need to read this when I’m back at work.

          • Neuroskeptic

            This paper doesn’t seem to do any hypothesis testing, nor does this one (both of these were cited by C & J).

            Although I wonder whether p values could be inferred from other statistics provided. It’s also possible that the authors did run hypothesis tests and didn’t bother to include them.

          • D Samuel Schwarzkopf

            As I was pointing out on Twitter, I wonder if this has anything to do with the obsession about testing interactions. I’ve encountered this in reviews before as well. Sure, testing the difference between conditions A and B is important if your critical claim is about whether the magnitude of your dependent variable is different in the two conditions. But this is not what classification is about.

            Rather you are essentially testing two coins as to whether or not they are fair (i.e. at chance). If coin A is only slightly biased but coin B is not, you wouldn’t get a significant difference between them. But testing each vs chance would reveal that difference.

            That said, I think the tests most studies (including mine) have used to test vs chance are perhaps not conservative enough. I have in the past simulated chance distributions under the conditions of my experiments but I don’t think I ever included those in the published papers.

  • Jespersen

    Whoa. Are there neuroscientists out there who actually a sample size of *twenty*, or is this just a theoretical scenario? Because that’s terrifying. Genuinely terrifying.

    In NLP, for a basic 2-category classification task using Bayesian classification (such as distinguishing between positive/negative reviews, or spam/non-spam messages), a dataset of at least 1000 documents is considered a bare minimum, though most actual training datasets range between 2,000 and 10,000 documents. And even in such cases, only the top 10% features learned can actually be considered informative.

    I suppose that’s another reason #WhyWeNeedTheBRAINInitiative

    • Neuroskeptic

      Well, it costs a lot more to acquire an fMRI scan than it does to download a document. But then again, EEG is fairly cheap, and sample sizes tend to be small there as well…

      • andrewthesmart1

        Phase III clinical trials use 1,000 – 3,000 patients typically and will run you only about 10-20 million dollars. Is being able to predict whether a subject was looking at man or woman from fMRI data worth that much money?

        • Neuroskeptic

          Maybe, if doing so was a step on the path to building a mind-controlled exoskeleton for paralyzed people etc.

          Probably not, if it was purely for academic interest.

          • andrewthesmart1

            I know a certain US government agency who would gladly pay millions for a mind-controlled exoskeleton.

    • D Samuel Schwarzkopf

      If you follow the logic in this paper, the n should be lower than 20… 😛

      Anyway, as NS points out sample sizes in neuroimaging are quite expensive and time-intensive. In the end it depends on what kinds of effects you want to investigate. But yes, I would agree that it would help to maximise power if possible.

    • Anonymouse

      Well. In some szenarios the sample size is naturally limited, like when the subjects have a rare brain lesion.

    • Ondřej Havlíček

      Just some clarification questions, the “n” in the table does not refer to a “sample size” as the number of subjects in the study but to the number of classified samples in each of the participants, right? Therefore, these are the critical (binomial dist.) values that allow us to say that classification in a given subject was above chance, right? To say that over all subjects the classification was above chance, a normal t-test would be appropriate (as Martin Hebart says) and the required mean accuracy would be quite lower than in that table, depending on the number of subjects?

      • Matthias Guggenmos

        Yes, the whole paper is about classifying data from a single participant. In my view this is a huge downside of this paper, as this is (1) a very rare scenario (an exception may be BCI with rare locked-in patients) and (2) many of the alleged offenders (e.g. Bode & Haynes, as mentioned above) are only interested in and REPORTING group-level results. The whole paper is in strange neglect of this fact. The bottom line is that neuroscientists don’t have to “cringe at how high these figures are”. These figures have no relevance for the majority of neuroimaging studies. E.g., there are many examples where decoding accuracies as low as 52% are significantly different from 50% chance in standard neuroimaging paradigms with around 20-30 participants.

    • Felonious Grammar

      Aside from sample size, in the example given, what do they think they’re reading? Are the subjects all American undergrads? Since there are not universal archetypes for “men” and “women”, what are the images being shown? Are there varieties of ages and races in those images? In the subjects? Was sexual preference taken into account, or even considered?

      It seems too many studies with MRIs, like bad sociobiology, affirms stereotypes and wishful thinking.

  • D Samuel Schwarzkopf

    This sounds very interesting. Did they look at permutation tests? I think this is a much preferable approach than a look-up table.

    • Neuroskeptic

      They did, and they found that the permutation tests matched with the result of the table (which is based on the binomial theorem) very closely, in the cases they examined. However, permutation testing might give different results in other cases, depending on the nature of the data or the design.

  • Maria

    My understanding is that statistical significance is not the indisputable threshold for confirmation of a hypothesis. In conjunction with other data it may lead to certain conclusions. P value on its own is quite meaningless. I think it is reasonable to review significance in relation to the sample size.

  • Rolf Degen

    From the description, this sounds like the method that was used in the studies on lie detection with fMRI. I wonder whether this critique bears any relevance to this approach.

  • shora

    I think it is very important here to note that the IMMENSE majority of EEG decoding papers use permutation tests. 3 recent examples :

    It is also important to be careful here, the situation is obviously not perfect, but I think that having most people (i.e. reviewers) say “you only get 60% correct in you 2-way classification so it is not significant according to my look-up table” is even worse. Binomial tests are based on assumptions, which might not be true for most EEG experiments.

    • Neuroskeptic

      That’s true – and I’m sure that the authors of this paper would agree!

  • Martin Hebart

    The problem is a little more complicated: We need to think of decoding-level or “second”-level statistics. For decoding level stats (e.g. in leave-one-subject-out or leave-one-run-out), binomial tests or t-tests are wrong when using cross-validation, because accuracy estimates are not independent, i.e. there is a bias in the variance. We are soon submitting a paper on that issue (for a superficial treatment, see Noirhomme et al., 2014 in Neuroimage Clinical). This should also apply to the comparison of classifiers.

    The only solution is the correct permutation test, where labels are permuted within all cross-validation chunks before cross-validation is repeated. This is the second problem, because many permutation procedures are wrong (e.g. when permuting only training data or when mixing label permutations across original cross-validation chunks).

    Third, these problems don’t apply to the usual “second”-level stats, because to our current knowledge the mean accuracy of each subject is an unbiased estimate. At the group level, large variability in accuracy would kill our significance. To our current knowledge, t-tests are appropriate at the group level. There are exceptions to this, but they are probably not severe problems.

    Take-home message: Do the right stats and everything is ok.

    • Martin Hebart

      I just checked the paper and I believe their permutation procedure is also wrong: They first permute the labels, then set up cross-validation chunks and run cross-validation. This destroys potential systematicities in the original chunking of data. The correct order is to use the same chunks as for the original classification and only permute within run. That leads to a strong deviation from the binomial distribution.

      • Neuroskeptic

        Wow, this is potentially a serious issue. But would this mistake lead to substantially different results given the data that was considered in this study? i.e. were there systematicities in the chunking of these data?

        Either way, it’s important to get the methods right in a methods paper!

        • Martin Hebart

          Ok, I checked the code and indeed, the permutation procedure is not really correct. But to be fair, maybe my definition of exchangeability is rather strict. My reasoning is that if the chunking of data itself has an influence on classification results, then this chunking should be preserved for permutations. To understand this, assume that there are only 2 labels per chunk, 1 and 2. Then a correct permutation procedure would have only two possible permutations: [1 2] and [2 1]. A random permutation across runs would allow [1 1] and [2 2], as well. If noise in one run just by chance drives results away from chance (which of course happens!), then equating both labels eliminates this possibility. Hence the permutation will deem it less likely that above chance happened just by chance, i.e. the result would rather become significant.

          How serious is the “problem”? I can’t really tell. It depends on how many samples are in each chunk and how structured the noise is. Kai Görgen who is the first author of the to-be-published paper ran lots of simulations, but I don’t know if he tested the impact of retaining chunks before permuting. Also Jo Etzel (mvpa meanderings) and I discussed this issue a while ago.

    • Alberto González

      This was actually the topic of my master’s thesis in biomedical engineering last year. In “why machines can not guess how we feel yet” I studied two papers [1],[2], where are tested several machine learning algorithms for determining levels of valence and arousal in EEG, under controled experimental conditions. In these two experiments, ratings of valence and arousal are given by subjects that are watching videos and their EEG signals recorded. Other similar experiments for training algorithms for detecting valence and arousal haven’t been studied since these don’t take in consideration the experimental conditions or use far less data to obtain conclusions, or use devices to record EEG signals that can not be considered rigurous, such as Emotiv. Nevertheless, they employed similar tools and methods in their analysis and conclusions.

      During my analysis, I made a comparison between the issue of emotion recognition with EEG, and the famous example of overfitting from Banko and Brill in 2001 in the task of natural language processing. In order to perform such comparison I applied the analysis of the learning curve, very basic in machine learning. This analysis consist in determining the accuracy of the algorithm when the number of training examples increases. With this technique one can adress if an algorithm fails because it suffers from bias or overfitting, the most common problems in machine learning and no one considers in brain signal classification. I wanted to know whether the low accuracy in the predictions are because emotions are so personal that is imposible to generalize a pattern of activations in EEG to detect high/low levels of valence/arousal, therefore any sort features extracted from EEG would be useless in this task (this problem would be showed in the learning curve as a bias problem), or if it could happen that the emotion recognition issue is so complex that it is not enough to use leave-one-out cross validation applied to 40 examples as in [1] or 16 as in [2] to test an algorithm to classify emotions more universally.

      I used the dataset recorded in [1] and publicly available. With this analysis, the learning curve passed through the accuracy ratios (and compared with F1-scores since classes are unbalanced) that were showed in the two papers, 33,51% for 16 examples (leave-one-out cross validation), less than 60% with 40 examples (leave-one-out cross validation) and 72% with the arrangement of the dataset made in my thesis, with 1024 examples for training and 256 for test (80%-20%). This shows a logarithmic growth such as the one obtained by Banko and Brill in 2001. Lastly, I suggest it might exist another explanation apart from the commonly suggested about the emotion recognition in EEG (we are so unique that it is impossible to have a universal emotion detector). In natural language processing nearly each word could be considered as a new training example, being necessary 1000 million words to train an algorithm to interpret correctly a text (around 95% accuracy). In emotion recognition it might happen that the brain talks in a language where it is not enough 40 examples to have a general idea of the activation patterns in EEG, but also millions of examples in order to understand this language.

      For more information:

      – My thesis,

      And the papers:

      [1] – “DEAP: A Database for Emotion Analysis using Physiological Signals”, S. Koelstra et al. IEEE Transaction on Affective Computing 2011.

      [2] – “Individual Classification of Emotions Using EEG”, S. Valenzi et al. J. Biomedical Science and Engineering, 2014, 7, 604-620

      Banko and Brill, 2001, “Scaling to Very Very Large Corpora for Natural Language Disambiguation”. Michele Banko and Eric Brill.

      • Neuroskeptic

        Thanks for the great comment!

  • Pingback: Markierungen 01/20/2015 - Snippets()

  • Clemens Brunner

    The finding that (cross-validated) results should always be compared with the practical chance level (which depends on the number of trials) is not new (even in the neuroscience community). It was already published in 2008 ( and in a review book chapter (

  • Nikolaus Kriegeskorte

    The error described here (mistaking chance level for a statistical threshold) is a very naive one, which I have never seen committed in a published neuroscience paper. I hope very much that Combrisson & Jerbi cite papers that commit this error, otherwise it’s just a strawman. The points made above are widely understood among the decoding community. Moreover the binomial test they seem to advocate is known to be inappropriate in the context of crossvalidation, because the results from different folds are not independent.

    • Neuroskeptic

      Here’s how Combrisson & Jerbi see the problem:

      “[…] Not all the previous brain decoding reports suffer from the caveat of using theoretical chance-level as reference. However, numerous studies only apply statistical assessment when testing for significant differences between the performance of multiple classifiers, or when comparing decoding across experimental conditions, but unfortunately neglect to provide a statistical assessment of decoding that accounts for sample size (e.g. Felton et al., 2007, Haynes et al., 2007, Bode and Haynes, 2009, Knops et al., 2009, Kellis et al., 2010, Hosseini et al., 2011, Sitaram et al., 2011, Hill et al., 2006, Wang et al., 2010, Bleichner et al., 2014, Babiloni et al., 2000, Ahn et al., 2013 and Morash et al., 2008; Neuper et al., 2009; Kayikcioglu and Aydemir, 2010 and Momennejad and Haynes, 2012). A number of such studies use theoretical percent chance-levels (e.g. 50% in a 2-class classification) as a reference against which classifier decoding performance is assessed. By doing so, such studies fail to account for the effect of finite sample size. This may have little effect in the case of large sample size or when extremely high decoding results are obtained, however, the bias and erroneous impact of such omissions can be critical for smaller sample sizes or when the decoding accuracies are barely above the theoretical chance levels.”

      • Nikolaus Kriegeskorte

        The question is whether the claims made in these studies required a test against chance level. In general, it is harder (requiring more data) to detect that one decoding accuracy is higher than another (since both are noisy estimates). So this is in fact a *higher* statistical bar. If these tests support the claims made, I see no problem. Imagine someone wrote a paper pointing out this error at an even more general level. The paper could be called “Exceeding zero by chance” and the abstract could include this passage “When some estimate A is greater than 0, this does not mean that the difference is significant. Although t tests are widely used, some studies only rely on two-sample t tests to compare estimates A and B acquired under different experimental conditions.” I’m sure we could find lots of examples of this. Whether it’s a problem depends on the claims made in those papers.

      • Jo Etzel

        This strikes me as a strawman argument as well; while we may argue about how to properly establish significance (eg how *exactly* to do a permutation test), no one is claiming that we just need to say the accuracy is above 0.5.

        I glanced at Bode & Haynes 2009, simply because it’s the listed paper I’ve read most recently. I agree with Combrisson & Jerbi that the second-level statistical test was not adequately explained, but they do refer to “… subjected to second level statistical analyses across subjects to find regions that were predictive significantly above chance (50%).”, and putting the searchlight accuracy maps into SPM and using FWE correction. So, I assume they did a voxelwise across-subjects t-test on the accuracies, then a FWE on the t-values, which presumably Combrisson & Jerbi would be ok with.

  • Mohammed Al-Rawi

    Getting 70% accuracy when theoretical chance level is 50% is not a problem if one could get 70% in few permutations. In fact, there is a chance to even get 100% accuracy in far few permutations (according to infinite monkey theorem). But the percentage of getting 70%, or even 100%, by chance is what makes the difference; which gives us how likely the accuracy we got by chance (the p-value).


  • Amber

    Just read the article. I’m not an expert in signal classification, for me it was eye-opening. They show in real brain signals (which they arbitrarily assigned to different classes) how low sample size leads to high decoding well above expected chance levels. This shows it’s not just a theoretical limitation, it actually happens. Thanks for posting!

  • Pingback: Morsels For The Mind – 23/01/2015 › Six Incredible Things Before Breakfast()

  • soundray

    Small sample sizes are a fact of life in many lines of neuroscientific inquiry. This paper is important because it shows exactly how difficult it is to show a true effect with a small sample. Numerical simulations using real data are the way to go.

  • Pingback: Childhood Stripped Of All Joy By 'Where's Waldo' Search Algorithm -

  • stormchaser1983

    I dont understand this post at all. Anyone who is going to use linear classifiers by default has to follow those steps. How do u report classifier accuracy without a p-value? If anything, the post should address a common mistake people make in the fMRI MVPA: use cross validation accuracy as true accuracy. in reality,data should be randomly partitioned into training and testing datasets and feature scaling from the training dataset should carry over to the testing dataset, training dataset HAS to be balanced (testing dataset need not be) etc….



No brain. No gain.

About Neuroskeptic

Neuroskeptic is a British neuroscientist who takes a skeptical look at his own field, and beyond. His blog offers a look at the latest developments in neuroscience, psychiatry and psychology through a critical lens.


See More

@Neuro_Skeptic on Twitter


Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!

Collapse bottom bar