Deceiving with data

By Razib Khan | August 7, 2012 11:11 pm

Matt Yglesias on the enthusiasm for data mining in economics:

Betsey Stevenson and Justin Wolfers hail the way increases in computing power are opening vast new horizons of empirical economics.

I have no doubt that this is, on the whole, change for the better. But I do worry sometimes that social sciences are becoming an arena in which number crunching sometimes trumps sound analysis. Given a nice big dataset and a good computer, you can come up with any number of correlations that hold up at a 95 percent confidence interval, about 1 in 20 of which will be completely spurious. But those spurious ones might be the most interesting findings in the batch, so you end up publishing them!

Those in genomics won’t be surprised at this caution. I think in some ways social psychology and areas of medicine suffered a related problem, where a massive number of studies were “mined” for confirming results. And we see this more informally all the time. In domains where I’m rather familiar with the literature and distribution of ideas it is often easy to infer exactly which Google query the individual entered to fetch back the result they wanted. More worryingly I’ve noticed the same trend whenever people find the historian or economist who is willing to buttress their own perspective. Sometimes I know enough to see exactly how the scholars are shading their responses to satisfy their audience.

With great possibilities comes great peril. I think the era of big data is an improvement on abstruse debates about theory which can’t ultimately be resolved. But you can do a great deal of harm as well as good.

MORE ABOUT: Epistemology

Comments (7)

  1. Chad

    The solution to this is two fold:

    1) All raw datasets (regardless of field) once published should be made open access and deposited in a databank somewhere.

    2) Journals must demand more stringent and detailed methodology reports. Far far far too often journals allow people to get away with saying “program X was used to analyze data” and that is it. The authors should be explicit in all aspects of the analysis, the settings/arguments, the code, everything. You give me a set of data and I can create all kinds of spurious results and if all I write in the methods section is “program X was used” then its very easy for me to pull a fast one on people.

  2. Siod

    This drives me insane. People don’t realize it’s intellectually dishonest to cherry pick, and they don’t realize drawing conclusions is interpretation. We have make this behavior as taboo as just making shit up.

  3. Chad

    I have absolutely no problem with drawing conclusions. That’s what you do. You get data, you analyze, and you try and draw conclusions. The way you overcome cherry picking is to expose it. The way you expose it is by demanding absolute clarity in methods.

  4. Me and Sean recently published a paper on a related problem in linguistics Constructing Knowledge: Nomothetic Approaches to Language Evolution. We argue that these sorts of statistical studies should be confined to hypothesis-generating, with other methods (e.g. experiments) then being used to test these assumptions and provide greater explanatory power.

  5. #3, perhaps the example is not valid, but i often put up my GSS variables, etc. but very few commenters actually extend/critique my analysis through replication. that’s one reason i get lazy and just put up the link. i’ll provide variables if asked, but that rarely happens.

  6. Siod

    #3, I don’t think there’s anything necessarily wrong with interpretation. What I meant is that people will look at a body of supporting evidence in support of their intuitions and then draw a conclusion from that evidence. This is interpretation. And then other people (experts typically) will have reviewed *all or most* the literature or evidence and draw a sound conclusion (that is still interpretation, but it’s interpretation given all the evidence).

    The problem comes in when 1) the people that are drawing conclusions don’t realize they’re interpreting a body of evidence and 2) the audience doesn’t realize how the person drawing conclusions interpreted the body of evidence or what evidence they decided to include in that body.

    Of course, there are some conclusions that are just painfully obvious and undeniable, but these are typically in the harder sciences like physics.

  7. Divalent

    “… about 1 in 20 of which will be completely spurious …”

    Er, not exactly. All of them could be spurious. ~1 out 20 things you test for at the 0.05 significance level will be “spurious”. What fraction those are the pool of things you find to be statistically significant depends on how many things you test for actually are real correlations.


Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!

Gene Expression

This blog is about evolution, genetics, genomics and their interstices. Please beware that comments are aggressively moderated. Uncivil or churlish comments will likely get you banned immediately, so make any contribution count!

About Razib Khan

I have degrees in biology and biochemistry, a passion for genetics, history, and philosophy, and shrimp is my favorite food. In relation to nationality I'm a American Northwesterner, in politics I'm a reactionary, and as for religion I have none (I'm an atheist). If you want to know more, see the links at


See More


RSS Razib’s Pinboard

Edifying books

Collapse bottom bar