Science Without Open Data Isn’t Science

By Neuroskeptic | August 16, 2016 12:57 pm

A new position paper published in the New England Journal of Medicine (NEJM) has generated a lot of controversy among some scientists: Toward Fairness in Data Sharing.

It’s not hard to see why: the piece criticizes the concept of data sharing in the context of clinical trials. Data sharing is the much-discussed idea that researchers should make their raw data available to anyone who wants to access it. While the NEJM piece is specifically framed as a rebuttal to this recent pro-data sharing NEJM article, the arguments advanced apply to science more generally.


Here’s my take.

There is a strong prima facie case that raw scientific data should be made freely available. It is widely recognized that nullius in verba“on the word of no-one” or “take no-one’s word for it” – is one of the fundamental principles of the scientific endeavor. Scientists do not believe something just because someone (or even everyone) claims that it is so. Evidence, not opinion, is what science is about.

Without open data, a scientific paper is little more than a statement that, in the author’s opinion, some evidence supports a certain set of claims. Without access to the raw data, a reader of a paper has no way of checking whether the results really do support the conclusions. So, without access to the raw data, the reader is asked to take the results essentially on faith.

It might be said that nullius in verba is an impossible standard. After all, even with open data, readers will still need to take the authors at their word that the data were collected in a certain way as described in the paper, and that the results were not manipulated, cherry-picked or otherwise comprimised.

I agree that we will never be able to able to achieve perfect transparency in scientific communication – there will always be an element of trust. But if we’re serious about nullius in verba, we should strive to minimize the degree to which readers are expected to just trust the authors – and this means data sharing.

As a result, in my view, we should hold any attempts to limit the scope or effectiveness of data sharing to a very high standard, because open data is (or should be) a fundamental principle of science. “Towards Fairness in Data Sharing” doesn’t discuss such fundamentals, but focusses on practical objections to data sharing, such as the concern that it will incur financial costs for the producers of raw data, or will put them at risk of being “scooped” by other researchers who analyze their data before they have a chance to. In short, the problem with data sharing, according to the NEJM piece, is that it risks being unfair to scientists.

These may be real concerns, but even if they are, if we allow such concerns to determine our policy, we are effectively saying that fairness to scientists is more important than science itself.

  • non_sig

    I agree 100%. (The only concern I’d have is the anonymity of the participants, because some data (like images or DNA) may make indiviuduals identifiable to some other people, even if attept is made to anonymise them.)

    • Neuroskeptic

      Yes, anonymity must be protected, but in most cases this can be ensured by removing the identifiable variables.

    • m242424

      You can identify anyone for anything, if you so wished.
      There is no such thing as anonymity in scientific research for a start.
      You MUST be able to identify the data to the result and you MUST be able to repeat the experiment with the same data points.

      Therefore you must know the data source.

      Most scientific research does not require even the slightest bit of protection.

      It becomes clouded when political or religious things are involved and can be resolved with a scientific mind.

    • Jacob

      Databases like dbGap exist for handling genomic data, they manage it by access control. The crucial difference is to access data in dbGap one has to jump through a lot of hoops, agree to keep data securely, agree not try to de-identify anybody, and actually apply to get access to data (which will typically only be granted to academics at respected institutions) rather than letting just anybody download it.

      Having had to jump through these hoops I can say it’s a pain, but it’s a small price to pay for patient privacy.

  • A. Tasso

    The issue of wanting to publish more LPUs from the data is a reasonable concern to express given how medicine and public health academia work. But why not just shift the field to more of an economics model where no such incentives exist? Economists generally publish one paper from a dataset, post the data online, and move on.

  • FSE

    > if we allow such concerns to determine our policy, we are effectively saying that fairness to scientists is more important than science itself.

    Clever, but hollow. Because you could equally have said that the problem with data sharing is that it *undermines* scientists. And from a realistic perspective, what is the difference between undermining scientists and undermining science itself?

    • Neuroskeptic

      I don’t think data sharing undermines scientists as a whole. As I see it, it may shift the balance in favor of certain types of scientists (“data parasites”) over others (“data hoarders”) which is arguably unfair to the ones who lose out, but it doesn’t undermine scientists in toto in the same way that e.g. a blanket funding cut would.

      As A. Tasso points out in his comment, in some fields, open datasets are the norm. So the open system can work. It may be unfair to impose this system on a group of scientists who aren’t ready for it, but it’s not undermining anyone.

  • m242424

    Yes the data needs to be released when the result is released. If no result is released then the data should be released after a set period of say 1 year from the end of collection.

  • OWilson

    All data collected and methodology used in studies that are ultimately paid for by the taxpayer, or used as a basis for directing public policy, MUST be openly shared with all, because the intellectual property belongs to the taxpayer.

    A special case may be made for security reasons, but that is what Intelligence Committees are for.

    Remove the “smoke”, and you will eliminate 85% of “conspiracy theories”, and the associated political infighting, not only between the major Parties, but between the government agencies themselves who are often suing each other.

    (Simultaneous hard drive “crashes”, while the Director of the IRS Pleads the Fifth, or the State Department claiming that it will take “75” years to deal with FOI Requests for Clinton emails, should be immediately the subject of a Special Prosecutor Investigation, the perps brought to justice and their supervisors summarily fired.

    Or we can just accept that 75% of the population know that the government is corrupt, through and through!

  • Jamie

    I’m not sure I agree with you here. You have a problem with the idea that ‘fairness to scientists is more important than science itself’, but does science exist in a vacuum in which there are no scientists? If people are going to spend years and millions of dollars collecting data from intricate experimental designs that they have come up with, why should someone who is not even involved in the inception of the ideas behind the study have equal access to its fruits? It is important that the people who actually conduct research are given the opportunity to get everything out of the data that they have put the effort, financial cost, and opportunity cost into collecting.

    I agree that ultimately the data should be open, but (and perhaps this is what is suggested in some of these articles) if the data is going to be available there should be some kind of cool down period in which the people who collected the data are given ample opportunity to get everything they want out of it – I’m not sure exactly what this would be or how the start of it would be determined though…

    • Flyover_sci

      I second this. Given the current incentive structure for research, isn’t it also unfairly rewarding certain types of science in favor of others? You will de-incentivise certain research efforts (e.g . Work with hard to sample clinical populations, large multi-site collaborations, longitudinal studies) that require years on the data collection side, since the strong open data expectation is that on the first paper, the whole data set is released. This is fine if you’re a big established group with lots of funding and can wait until you can submit 5 papers at once, but if I’m a junior collaborator, or smaller group, then why would I bother collecting that data or studying that question?

      Open data advocates like to say that you’ll get more citations, but until there is hard evidence that a lower # of high-citation open data publications (where you might be one of several middle authors) is = to more lower cited first/last author papers in the eyes of granting agencies and hiring/tenure committees, I wonder if this would unfairly push science away from clinical populations, multi-site studies, or longitudinal datasets, or at least severely limit the number of datasets available for analysis

      • Boris Barbour

        The above two comments argue that nobody will do the work if the data must be shared alongside the first publication. I suggest an experiment where such sharing is a condition of getting the grant. Your prediction is that nobody would apply. My prediction is that plenty of researchers would be happy to accept that condition.

  • Jacob

    One thing people gloss over is the ‘unintended consequence’ effect that the article brought up. If the data needs to be released alongside the paper, it just means scientists are going to delay publication of the first paper until they’ve written followup papers as well. With the net result being a delay of publication and data, instead of just the data.

    I don’t know if I agree quantitatively with their timetable but I think the concept is fair. Maybe a publication emargo (where the data is available but people aren’t allowed to publish on it until after a certain date) is more appropriate, as it allows for checking published results while removing the fear of getting scooped.

  • Neurosiscientist

    Designing experiments is a creative and genuinely scientific work. So no one should be able to publish papers out of my experiments without me as a co-author. Everything else is stealing.

    I absolutely agree that my data has to be available to be reanalyzed and checked by anyone. However, any new publication must be with my name on the paper, or, if I do not agree to that, without it but accompanied by a commentary from my part. Thus I cannot prevent publication of a different opinion or view.

    • Neuroskeptic

      I’d be happy with that system. Credit where credit’s due, but we should work out a system that assigns credit fairly without getting in the way of scientific principles (i.e. in the way of data sharing).

    • Boris Barbour

      I don’t think it is unreasonable for data-gatherers to be given the option of authorship (for instance with a contribution statement explaining the situation), but there should be no influence on access or on publication, even in the case where a reanalysis invalidates the original conclusions. Basically we agree.

  • Sven

    Global Warming ….cough cough

    • LincolnX

      CO2 choking you?

  • David Jarrett

    My memory may be off, but I seem to recall that in the old “Cold Fusion” incident of the late 80’s, the electrochemsts involved in making their claims of cold fusion were very reluctant to release their data. That whole debacle should be considered a case study on what happens when data isn’t freely shared…

  • Dan Ashley

    Does this work for NOAA and global warming too?

  • Robert Davidson

    Does anyone know who “The International Consortium of Investigators for Fairness in Trial Data Sharing” is anyway? their arguments are quite frankly myopic, illogical and inconsistent.

    • Neuroskeptic

      I’ve never heard of them before, and they only seem to exist in the context of this paper.

    • bobroehr

      There is a link with the article (supplementary material?) to a list of those who have signed on. I took some comfort in the fact that so few Americans were on it.

  • ncgh

    Sharing data will not eliminate outright fraud, but it will make a big step in revealing mathematical errors, overlooked alternative hypotheses, or even spotting statistical skewing that the original author may have overlooked.

  • bobroehr

    As I commented elsewhere:

    Perhaps the most depression aspect of their views is the belief that they are actual the sole proprietors of the data generated by a study. The reality is that the role of other stakeholders, including funders and patients who participated in the study, is just as important in generating that data and should be taken into account.

    “Justice delayed is justice denied” is a central maxim in the area of law. A parallel maxim is equally applicable to the delay of release of information generated in clinical trials. And all for the “convenience” of a senior investigator. A well designed and implemented clinical study should be easy to tabulate at its end and will confirm a hypothesis posed in the protocol. The pressure to publish quickly should should be seen as an opportunity to involve less senior investigators in parallel drafting of secondary papers generated by the study so that they are published in a timely manner.

    One encouraging aspect of the effort to dawdle is that relatively few American investigators seem to have signed on to it. Another is that pharma has largely gone along with, and often has led the effort to share clinical data. They realize that it is the only reasonable approach to the exploding volume of information generated by modern “omics.”

    In the end, it comes down to the golden rule; if the investigators do not like the rules requiring data sharing that most major funders are imposing, then they should not take the gold. Let them conduct a study with their own wallets and bodies.

  • daingelt

    You have two systems that countervail one another: a philosophical system that seeks truth and replicability and thus requires some level of altruism (i.e., science), and a status system that passes out rewards based on the innovation/novelty and quantity of published research. Both are absolutely necessary, but they are inherently incommensurable. And of course you have the familiar “free-rider” problems with altruistic systems.

    I don’t think there is any easy solution for this contradiction. At best, I think you will have to steeply reward original data collection…something to compensate researchers who put lots of effort into collecting data only to see it handed over to those who are, essentially, competitors.

    That’s my 2 cents.

