You can't scoop someone else's brain

By Razib Khan | July 21, 2012 11:07 pm

arXiving our papers:

I, and I’m sure other people, have worried about being scooped and beaten to publication due our arXived papers. But really this is silly as we’ve usually given talks, posters, etc on them at big conferences, so the idea that people somehow don’t know about our work before it appears in print is ridiculous. It is far better to get work out, once you consider it worthy of publication, so it can be read and cited by others.

This is in reference to the paper The Geography of Recent Genetic Ancestry across Europe. Go and read the materials and methods. I’m sure that a substantial minority of the readers of this weblog have used every single piece of software listed therein. Phasing and such requires a little bit of computational muscle, but that’s not an impossible hurdle. Additionally, many readers with academic affiliations could get their hands on the POPRES data set. But the generation of a paper, from methods to results to discussion, is not simply a robotic sequence of running data through software or algorithms. You need a first-rate statistical geneticist (e.g., the authors) to actually assemble the pieces together together coherently and with insight even granting the fundamental units of the whole.

Then there are sections of the methods with explication such as this:

You can try and cut & paste this, but you’d come off as a fool if you didn’t know what you were talking about. The Coop lab has put up a substantial number of their quant bio papers up on arXiv, and I’m skeptical that that’s resulted in other groups cheating off them. On the contrary, in an idealized scientific environment the spread of insight will have spillover effects, positive externalities. The scientific community is one where there should be greater returns to scale due to the synergistic power of cross-fertilization.

On the other hand there is the flip side of this: the recent rash of data fraud and fudging impacting some of the more ’empirical’ sciences. The community of science is based on trust, and sometimes I wonder how it persists. When the juice is in the collection and publication of data, rather than clever or deep analysis of data already commonly circulated, one can see the margin on cheating the system, or hoarding your cache.

I don’t have any clever solutions for how to prevent cheating in medical or psychological science. But I can hope that in the future genomic data sets will be constantly liberated, so that everyone is working from the same general script. And faking genomic data so that it would pass muster probably isn’t worth the time and energy. If you can manage to do this I think there’s a much better angle in going to Wall Street and screwing others for profit rather than scientific small-time fame.


Comments (1)

  1. Chad

    Your comments on “trust” and the liberation of genomic data sets reminds me of one of my pet peeves when reviewing any paper for publication (Data Not Shown).

    This instantly makes me suspicious that they are trying to pull a fast one on me. Its also interesting the things that are often not shown. For instance, a qPCR validation of a microarray experiment. Why are they unable to take 10 mins to make a bar graph or excel sheet to be slipped in as supplemental data in the paper? Or just recently, a paper using RNA-seq data. They claim a certain number of genes differentially expressed, but then present only a subset of differentially expressed genes that support their hypothesis. Meanwhile, they also make brief reference to certain classes of genes, but do not show the data………

    While I’m on this rant (forgive me) let me add another pet peeve, vagueness in the methodology…in particular vagueness in the informatics. I can not tell you how many sequencing papers I have read where they briefly say “reads were mapped with bowtie/tophat/bwa/etc” and thats it. No mention is made of the parameters used, any of which could alter the data in subtle ways. Unfortunately a lot of bench scientists venturing into informatics simply cut out these details, considering them minor. Even the Bioinformaticians are guilty of it at times. How could I reproduce their results or catch a mistake without knowing the parameters/commands used in the analysis? At least with the statistical genetics papers they actually have equations so you know (if you understand) how they are calculating their results.

    Fortunately a lot of journals are now requiring that any new genomics data be deposited in public databases (GEO, SRA, etc) which is a good first step, but its not enough. Typically this means the researchers dump just the raw data, which takes days, even weeks to reanalyze. Of course if you do not know the exact informatics methods used, it can be hard to reproduce exactly (almost impossible if they don’t tell you version numbers of the programs).

    The next step in openness will not only be making the data publicly available, but setting accepted guidelines for the reporting of results and methodologies so that it is clear and open to everyone.


Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!

Gene Expression

This blog is about evolution, genetics, genomics and their interstices. Please beware that comments are aggressively moderated. Uncivil or churlish comments will likely get you banned immediately, so make any contribution count!

About Razib Khan

I have degrees in biology and biochemistry, a passion for genetics, history, and philosophy, and shrimp is my favorite food. In relation to nationality I'm a American Northwesterner, in politics I'm a reactionary, and as for religion I have none (I'm an atheist). If you want to know more, see the links at


See More


RSS Razib’s Pinboard

Edifying books

Collapse bottom bar