Should the Data be Public?

By JoAnne Hewett | June 23, 2006 2:04 am

This was the hot topic at a panel discussion last week at SUSY06. There were two evening panel sessions at the conference, the first was on the Anthropic Principle and was reported on by Clifford. The seond was entitled, “Getting Ready for the LHC” and was infinitely more rancorous than the Anthropic session which was tame by comparison! Who would have guessed that?

The LHC panel members were: Gordy Kane (Michigan), Giacomo Polesello (CERN/Pisa), Maria Spiropulu (CERN), Konstantin Matchev (Florida), Howie Baer (Florida State), Tao Han (Wisconsin), Tilman Plehn (Edinburgh) and Joe Lykken (Fermilab) served as moderator. Each panel member spoke for a few minutes, then the floor was turned over for general discussion. However, Tao Han brought up a push-your-buttons topic during his presentation: he proposed that the LHC data should be made available to the community as maximal openess would only benefit the physics. He admitted that while us non-LHC experimenters could not comprehend the raw data, he proposed that LHC- experimenters store their data in ASCII and make it available to the public. First a gasp and then audible silence swept the audience as this has been a controversial topic for years.

(Off-topic, but I have to mention another statement of Tao Han’s that I really liked. He asked: How are we prepared for the LHC? And then noted that he himself has been working on this physics since 1987 and that after these long yrs, he declared that “I am ready for the LHC!” I could not have empathized more.)

Han’s public data proposal completely dominated the lively and sometimes heated discussion afterwards. Joe Lykken called Maria Spiropulu up to the podium to defend the bastion of the secret data experimental world, noting that the astrophysics community does make its data public (although I could not find a site while looking tonight – anybody know a URL?). Maria stood silent for a minute, then turned directly towards Tao and said a single word: “ASCII?” It brought the house down. Then she started on the usual diatribe on how their data would be useless as us theorists don’t understand the detectors, their data format, blah blah blah. Frankly, I think she (and experimenters in general) misunderstand the point and underestimate us. Tao Han did not ask for raw data – nobody without the proper background or code can comprehend that – he asked for the 4-vectors (the energy and momentum read-outs) in ASCII. In other words, he asked for the data after it had been processed and sifted, and churned into a useable format. It is the form of data that us particle theorists deal with in our Monte Carlo codes and is what the experimenter works with in the end. It is a reasonable request, but not likely to happen.

So, just who “owns” this data anyway? The experimenters feel that they worked hard and suffered to build the detector (and they have indeed), so the data and any discoveries are theirs. But, who came up with the theories that are being tested? Who did the calculations to see what type of machine should be built? Who convinced the politicians to build the machine? And last, but by no means least, who footed the bill to pay for the machine? So who really owns this data and why is it kept under lock and key?

(Photos courtesy of Bob Yen.)

  • Brett

    Not all observatories make their data public, but I think it is increasingly common. Here’s one example of an astronomical data archive (more here). The next stage is to combine all the data from various observatories and wavelengths into virtual observatories). It’s certainly understandable that people would want exclusive access to the data they’ve obtained (particularly in particle physics, perhaps, given the years of effort they put into building their detectors etc), but some embargo period (say 2 years) should be enough to give them a chance to publish first.

  • Troublemaker
  • franjesus

    Since you asked, useful data archives in astro:

    VizieR: with all kind of tables already publised in papers available in electronic format.


    MAST: many satellites in one place. Includes HST.

    Eso archive

  • chimpanzee

    This reminds me of the Human Genome Project, where there was a HUGE rift between Francis Collins (public funded effort @NIH, “Standard Model” freely available) & Craig Venter (Celera, “Alternative Model”, based on radical new approach using probablistic algorithms..”selling the data” as part of their business-model). It got really personal & heated, & finally when the HGM was nearly completed..President Clinton got wind of this rift. He ordered a compromise, & both sides came in for arbitration/mediation..get this: over Pizza at someone’s house! “Search for Common Ground” (BTW, the name of a Int’l effort by astronomers. I personally corresponded with some Iranian physics students on eclipse photography), was achieved over “tasty food”..something CVJ would relate to. “The way to a Man’s Heart is thru His Stomach” (the old-fashioned stereotypical role of a wife), & in this case it (kinda) “married” 2 opposing parties.

    So, you get my analogy. Get the 2 parties over some “common ground”, a place/time over some good food, with a mediator. “Get over It [ differences]”, & let’s move on. It reminds my of my own Technology Initiative (where my competition has gone to my website & stolen information/images):

    “I want a CLEAN solution, without any infighting..we want a United Front”
    “United we Stand, Divided we Fall”

    [ my Market is notorious for people who are disorganized: “It’s like trying to herd CATS..they scatter all over the place!” ]

    I think it’s a Universal Problem in any group, there are factions (in this case Experimentalists & Theorists) & someone is being “territorial”. I mentioned this in my conversation to L. Susskind/Stanford, about the whole rift between String Theorists VS opponents.

    BTW, at the dinner cruise I was sitting next to Dr. Han, & he was saying “this data is a treasure for Mankind”. I.e., it’s Knowledge..which should be available to Open Dissemination. It reminds me of a quote from Discovering Women/ “Silicon Vision”:

    “I’m interested in KNOWLEDGE, not Product”
    — Dr. Misha Mahowalkd

    There is a preliminary 2 hr video of LHC panel here. It’s a 480mb file, I’m still waiting for clearance by a party (guess who?) before it’s officially blogged & podcasted to the General Public. I have 2 other video-cameras that shot the LHC panel.

  • Jennifer

    Hi JoAnne, Maria sounds very funny, and Tao very brave, this is a great post, makes me feel like I’m right there. I can understand reasons for and against making the data public. For me it boils down to the format of the public data. Is it in a format that the non-expert could work with but without understanding detectors and the whole system could get results – very wrong results – which would then be published and have to be refuted? Then forget it. But if the data can go through some kind of standardized pipeline processing such that the end products can be mis-analyzed but not in a catastrophic way, then I’m all for it being public.

    My experience is with the Chandra x-ray telescope, a bit with the XMM-Newton x-ray telescope, and the data is private for one year and then goes public. This is because scientists compete for time and if you win time you get one year to publish without being scooped. The exception is a chunk of time that the director has to give out, DDT (director’s discretionary time), so that if something awesome and unexpected happens, I believe recently a white dwarf exploded real time, and scientists let him know they want Chandra to observe this object, he can give them the time but the data is public right away. I think that’s fair since they didn’t go through the peer review process.

    Anyway, Chandra has a great public database, and easy step-by-step data analysis “threads” that even your grandmother could do, I’m serious:

    That one is for the data, this one is for analyzing it:

    Usually the standardized processing I’m talking about takes a while to set up though. I think for Chandra it was 2 years, and the latest infrared telescope, the beautiful Spitzer, it was the same. In fact now that I’m thinking about it, I remember a friend who is a calibration scientist on Spitzer, thinking he found a planetary disk around a star, very unexpected for this star, and was writing the paper, before recognizing it as a glow from a previous exposure. If this happens once or twice, papers coming out which are absolutely wrong because the experiment is not well enough understood, that is understandable, but if everyone and their mother had access to that data it would be an hep-ph catastrophe. I’m starting to agree with Maria now, but I don’t know enough about the LHC to have a strong opinion….

  • Science


    If it is tax-payer funded, the public owns the data.

    The raw data should be made public. The argument that knowledge is dangerous in large doses belongs to the time of the Inquisition.

    If the experiments are so costly that others can’t readily replicate the data independently to check it, secrecy will be a ready-made cloak for fraud and conspiracy.

    Conspiracy naturally arises when a group of people hold the right to censor the data which is released. This is actually the definition of a “conspiracy”.

    If the data somehow shows a way anyone can destroy the universe or something of that kind, they would have a reason to suppress data. Otherwise they should not set out trying to be control freaks. It leads to the “we’re right and everyone else is wrong because we have secret data you don’t have access to” stuff.

    If you want science based on arbitrary data censorship, go ahead. The devil is always in the detail, at least in physics. This is why the details must be made public when people have to trust the data reduction and processing techniques. There are always alternative theories, and often they require different forms of analysing and reducing raw data.

  • Tony Smith

    JoAnne said, about LHC data:
    “… Tao Han … asked for the 4-vectors (the energy and momentum read-outs) in ASCII. In other words, he asked for the data after it had been processed and sifted, and churned into a useable format. It is the form of data that us particle theorists deal with in our Monte Carlo codes and is what the experimenter works with in the end.
    It is a reasonable request, but not likely to happen. …”.

    I think that it is not only a reasonable request, but a realistic look at the experience of Fermilab shows that it is needed to avoid serious problems with respect to physical interpretation of data. Here (probably in more detail than is normal for a comment, but I feel such detail is necessary to describe this very important issue) is my view of what happened with Fermilab T-quark data:

    Around 1992, there was a consensus in the particle physics community, particularly in Fermilab, that:
    1 – the T-quark was a simple quark with a single simple ground state, just like the other 5 (much lighter) quarks; and
    2 – the T-quark mass was around 160 GeV, based on LEP reports of its best data fit.

    Therefore, when
    Kondo produced an event analysis indicating T-quark mass around 130 GeV
    Dalitz-Goldstein-Sliwa produced an event analysis indicating T-quark mass around 120 GeV
    the Fermilab community reacted with hostility,
    on two grounds:
    the analyses were done outside the Fermilab structure;
    the 120-130 GeV range was lighter than the 160 GeV range expected by Fermilab’s consensus.

    Kondo avoided confrontation, withdrawing plans to publish in PRL, and restricted publication to less-well-read-outside-Japan Journal of the Physical Society of Japan.
    the Dalitz-Goldstein-Sliwa result was written up by The New Scientist in a very sensationalist way, leading to a public controversy and much ill will and hard feelings, despite the fact (emphasized by Richard Dalitz (see my PS to this comment) that NO data was “stolen” from Fermilab, because the event described had already been published by Fermilab).

    The Fermilab consensus view of a single simple T-quark ground state with a T-quark mass around 160 GeV prevailed and hardened,
    leading to the present situation in which the consensus view is that the T-quark has a single simple ground state at around 173 GeV,
    with all event data indicating other mass/energy levels (especially those around 130 GeV and 225 GeV) being considered spurious and/or background and being ignored.

    Unfortunately (in my opinion) the hardened consensus has prevented consideration of alternative points of view, one of which is:
    The T-quark, due to its much stronger connection with Higgs, should be considered as part of a Standard Model (nonsupersymmetric) Higgs – T-quark – Vacuum system for the non-supersymmetric Standard Model with a high-energy cut-off at the Planck energy of 10^19 GeV
    which on a plot of Higgs mass m_H v. m_T T-quark mass
    with Triviality Bound for high m_H and Vacuum Stability for high m_T
    shows 3 important physical points ( m_ H, m_T ) (in GeV) :

    at the intersection of Triviality and Vacuum Stability Bounds ( 225, 225 )

    decaying by running down the Vacuum Stability Bound to ( 145, 173 )

    decaying by running at fixed Higgs mass of 145 GeV to ( 145, 130 )

    Note that this picture accounts for all 3 reasonable candidate values for m_T observed in Fermilab events: 130 GeV ; 173 GeV ; and 225 GeV.

    If a similar myopic consensus of a simple 173 GeV T-quark mass and a simple single value of a Standard Model Higgs mass were to be applied by LHC with respect to Higgs and T-quark event interpretation,
    not only would T-quark events around 130 GeV and 225 GeV be missed by LHC data collection and/or analysis,
    only one value (most likely 145 GeV) would become a hardened consensus value for the Standard Model Higgs, with the 225 GeV Higgs state being missed,
    some interesting interrelationships among the Higgs, the T-quark and the Vacuum would be missed and excluded from the canon of consensus particle physics.
    Such interesting interrelationships might be describable in terms of models related to Nambu-Jona-Lasinio, as constructed by Yamawaki et al and Bardeen et al.

    Tony Smith

    PS – Since Richard Dalitz is now deceased, and since he told me that he felt that insufficient attention had been paid to his true point of view, I hereby quote for the record his letter to The New Scientist (15 August 1992, page 47) regarding the matter, captioned “Top quark”:

    “With regard to William Bown’s article on the so-called discovery of a top quark (This Week, 27 June), when I spoke with him I did not claim to have found the top quark. That is a job for an experimenter, whereas I am a theoretical physicist. The “earlier paper” he mentions gave a speculative analysis of an event already published by the collider detector (CDF) group at Fermilab, but there was no claim that this event was due to top-antitop production and decay.

    We were completely open and told Bown the current situation in this research, and even sent himm copies of our three papers on top-antitop event analysis.

    The CDF group at Fermilab is not blocking the publication of any paper of ours. I should note here that we would never publish data from any group, unless it has given us formal permission to do so or has already published it itself. We have never done so in the past, and will not do so in the future.

    When Bown asked me what, supposing that a top quark were found now, would be the effect on the Tevatron Main Injector project, I told him that this upgrading programme would then have the highest priority, since the Tevatron would be the only accelerator capable of top quark studies before the next century. His statement in the last paragraph that money spent on the Tevatron upgrade would be wasted is opposite to what I said.

    Richard Dalitz
    Department of theoretical physics
    University of Oxford”

    PPS – The 1992 LEP indirect electroweak value of a T-quark mass around 160 GeV became a hard consensus value despite the facts that:
    1 – it was based on the (probably unrealistic) assumption of a 300 GeV Higgs mass; and
    2 – the bulk of ALL indirect T-quark mass indications, at the times around 1992, was centered around 130 GeV ( see figure 6 of Chris Quigg’s paper at hep-ph/0001145 which shows that although one 1992 indirect value data point was around 180 GeV, values for 1991 and earlier, and for later in 1992 to 1993, were mostly centered below 150 GeV ).

    PPPS – The Higgs – T-quark – Vacuum system has been described by Froggatt in hep-ph/0307138.
    Alternative analysis of particular Fermilab events was made possible for an outsider like me by publication of:
    details of some D0 events in the 1997 UC Berkeley PhD thesis of Erich Ward Varnes at
    details of some CDF events in hep-ex/9802017 and hep-ex/9810029

  • Peter Erwin

    Essentially all space-based telescopes (Chandra, HST, Spitzer, etc.) make their data publically available after one year, and often have very well organized interfaces and support, such as pre-processing the data to remove instrumental effects.

    More and more ground-based telescopes are offering publically available archives, though it can be very hit and miss. The best in terms of availability and depth are probably the UK-Dutch telescopes of the Isaac Newton Group in the Canary Islands, with observations going back to 1987:
    ING archive

    and the European Southern Observatory telescopes located in Chile:
    ESO Archive

    A couple of people have already pointed to the Sloan Digital Sky Survey, which makes both data (images and spectra) and detailed catalogs based on their analyses of the data available.

    The main exceptions, curiously, are US ground-based telescopes. The 8-meter Gemini telescopes (one in Hawaii, one in Chile) do have a public archive, but none of the other US national telescopes, such as those at Kitt Peak in Arizona. This is even more true of private telescopes run by universities, such as the 10-meter Keck telescopes in Hawaii. (The US national radio telescopes, on the other hand, do have public archives.)

    (Caveat: I don’t follow solar astronomy, so I don’t know the status of data availability in that field.)

    As for the argument that the instrument builders deserve rewards for their hard work — in astronomy this is usually handled by giving the instrument team extra observing time with the telescope, outide the usual competitive application process. But this may not be a good analogy, since CERN isn’t really like a telescope where you can add new instruments with (relative) ease, or where outsiders apply to use the facility instruments for lots of different short-term projects.

    In practice, if the instrument is sufficiently new and complex, the instrument team are probably the only ones with the knowledge and skills (and software) to properly reduce and anlyze the data, which gives them a practical monopoly.

  • Alejandro Rivero

    It is an issue of responsability. Think of the Pionner anomaly, with the interesting early data already lost. If the data is secret, the responsability of keeping it intact increases, as one could want to review it a lot of years later due to new models (or disregarded old ones) or simply due to more powerful methods for background calculation and substraction. I hope that LEP data, with all these small two and three sigma bumps here and there, will be carefully kept.

  • Adrian Burd

    I find the debate concerning access to data a fascinating one. I’m a cosmologist
    turned oceanographer. In oceanography it seems to be standard practice
    that data collected during large programs (do a web search for things like
    JGOFS – Joint Global Ocean Flux Study – and WOCE) are made public after
    a reasonable amount of time – usually a couple of years. Yes, even
    cosmologists and theoretical physicists can access these data over the web.
    Many of us make versions of our computer codes (including forcing data etc)
    available as well. Data access policies are often required in the proposal and
    have to be adhered to.

    Generally, I think this is a good thing. Simulations can be tested using different models but the same forcing set; data can be used for other projects; global
    data sets can be compiled and cross referenced against each other. There are
    also disadvatages – QC of the data and metadata, commenting codes so that
    they are comprehensible, having people email you to ask how to do something with your code, and the embarassment of having someone point out an error
    in the data or code (though fortunately this doesn’t happen too often).

    So I would encourage those in favour of public release of the data to keep
    pressing. The benefit to the community is generally a positive one.


  • Ponderer of Things

    but for all practical purposes, who is going to look at data and why?
    It’s a lot of data, and I can only imagine that people interested in it are the people who are doing the analysis anyways. Maria’s point about uselessness of raw data to non-specialists is a good one. However, if you presented “sifted and analyzed” data, then it is pre-selected and modified in some way, which reduces the value that raw, “unbiased” data has. As to the argument that data belongs to the public, does it mean anyone is free to publish analysis of the data, even if they do not belong to collaboration? That sounds chaotic…

    Bottom line – I don’t think it makes any difference in the end. If public feels better knowing the data is out there, they should do it. However, I also fail to see why this is an important question. What about other fields of science (or physics) that use taxpayer’s money? Should they make all their unpublished and raw results open to public, and why?

  • Count Iblis

    Ponderer, it can make a difference if you have some great idea and want to test it using the LHC data. E.g. the DAMA dark matter search project did not make their raw data publically available. Some people wanted to look for evidence of DM streams which can be extracted from their data.

    Ice Core Data in ASCII Format :)

  • quasar9

    Science, the more I read your comments the more I like your thinking:
    “we’re right and everyone else is wrong because we have secret data you don’t have access to” stuff.

    Technical or ‘knowledge’ advantage is still sought by individuals and nations. If there is a race to develop the first nuclear weapon, the first cure for an illness or disease, the first whatever. Capitalism the pursuit of advantage, and personal ‘egos’ all contribute to: (1) Making people try harder (2) Play dirtier (3) Buy knowledge or Minds (body & soul)

    But I agree with you if the data is owned by the public it is publicly owned, and should be available to the public. And if it is for the ‘common’ good it removes most of these competitive streaks. Surely the real reward for a discovery is the discovery, the material or pecuniary rewards come just the same whichever political ‘government’ you serve. Hence Russia was actually able to keep ajead in certain ‘specific’ fields in weapon development and space research. Proof that competitive capitalism is not the only way to drive or motivate people to try harder or keep the ‘edge’

    Ask Beckham, from the next post >>> whether he tries to score for money or for the pleasure of scoring. The money he already has. What keeps Microsoft and Apple competitive, money? or the desire to be better or bigger or faster or …

    Another thing I would add, data which can only be interpreted by a ‘few’ can be misused and falsified as much as data only available to a ‘few’ – But of course this doesn’t happen in the real world of ‘altruism’ and the pusuit of excellence per se. – or does it? Q.

  • Peter Woit

    From what I’ve seen of how data is analyzed in a huge experiment like the LHC ones will be, the data is not readily comparable to astronomy data, which I gather is often the output of some conceptually rather simple device at a focal plane. HEP collider data is extremely complex, there are good reasons these experiments will involve 1000 physicists, one to two orders of magnitude more than the case typically in astronomy, and it’s not just because they need hands to solder stuff. The issues of triggering and backgrounds can be very complex. Honestly, if a theorist published a paper based on his or her analysis of raw LHC data, uninformed by actually working on the experiment, I wouldn’t believe the conclusions.

    After these experiments have been under way for a few years and the experimentalists completely understand their detector and backgrounds, maybe they should release suitably massaged data for theorists to see what they can do with it. But I think, completely independent of the issue of who deserves first access to this data, at least initially it is only going to be the experimentalists who will be able to extract anything reliable out of it.

  • chimpanzee

    [ The correct link for the LHC panel video is here ]

    Professional Ethics

    There are some really interesting cases of “accessible data” in the Astronomy community, being used unethically by outside researchers..Scientific Misconduct:

    Ortiz/CSIC & Brown/Caltech

    G. Marcy/Berkeley & D. Charbonneau/Caltech (at the time)

    In both of these cases, the “accessible data” was used for personal-gain by an outside party, which infringed on the original researcher’s discovery. This is an issue that concerns me, because during my PhD research my ideas were taken (by a egotistical “scoundrel”, without any credit to me), used for a major discovery. (however, their work was sloppy w/errors, & I was given credit in a major journal for finding this. To this day, my case has never been resolved, & I still haven’t been given proper will change soon, however)

    Here’s what’s going on:

    Competitive Model VS Collaborative/Cooperative Model

    In the zeal for “being 1st” (as part of the Competitive Model), people have resorted to unethical/illegal activity (Scientific Misconduct) to “get ahead”. Nobel Prizes are given to “physcists who make Discoveries, not necessarily the best physicist” (as pointed out to my Dad, by a Nobel Physics Laureate). In that NOVA program on String Theory, I clearly remember a woman researcher (Canada) who emphasized w/waving-arms… the extremely *competitive* nature of Science research.

    I think CERN knows well about this “-” aspect of publicly available data (& other recent cases of Scientific Misconduct that has rocked the Scientific Community, faked-data by Schon/Lucent & Liburdy/Lawrence-Berkeley), & is concerned about “scoundrels” (a minority in the community) who might use the data unethically for self-gain. Myself, I’m in agreement with many of the above posters, of the “Open Architecture” approach to Science: freely available data. However, that only works in a “Perfect World”. As we all know, we live in an Imperfect World (one only has to look at the clowns/morons/fools in the White House) & people often behave according “their own self-interests” (how my Caltech/CS prof friend phrases it diplomatically):

    “In an “Equal World” [ sarcasm ], some people are more equal than others”

    So the issue is “gray” (not clear-cut in favor of 1 side).

    This issue has affected me personally, since the issue of “data availability” has led to a case of Scientific Misconduct in Pro(fessional)-Am(ateur) collaboration in Astronomy

    [ 2 JPL “scientists” (non-PhD) in charge of the Pro-Am component of the IHW (Int’l Halley Watch back in ’85/’86, which I participated & my comet-photos were published alongside professional astronomers) received an amateur astronomer’s observation of Comet Halley in “outburst”, scoffed at it as being crackpot, went out & confirmed it (“God, we missed it!”), realized they were about to get scooped by an amateur (1st naked-eye recovery of Halley since 1910), STOLE it, reported it as their OWN observation, got written up in the press (& a major Comet book by Dr. XXX/JPL on Comet Halley). Became famous, are in the history books. This was widely known among the amateur-astronomy community, but never pursued further because of the “Power” issue. I myself (& other scientists) have run into issues with the Pro-Am issue (“insulted by amateurs”), & have given up on any further Cooperative/Collaborative activity in Pro-Am collaboration. Those JPL imposters actually had the nerve to call the police, fabricate a LIE about me, had me thrown out of a pro-am “conference” (they have professional astronomers as keynote speakers & talks, incl Dr. Sallie Baliunas/Harvard, Dr. Harrison Schmitt/geologist/Apollo astronaut). I guarantee you, you will shortly hear about ANOTHER Scientific Misconduct scandal, this time involving Caltech/JPL ]

    I end this post with a few statements:

    “The Integrity of the Data”
    [ who made the observation (Experimentalists), who analyzed it (Theorists), who reported it (Accountability) ]

    “Nothing is as Sacred, as the Integrity of the Mind”
    — Emerson
    [ this was Frank Lloyd Wright’s favorite quote ]

    Scientific Method:

    – Open Eyes [ observation ]
    – Open Ears [ observation ]
    ..& above all
    – Open Mind [ analysis, reducing the data ]

    “It’s all about TRUST”
    — Mario Andretti
    [ referring to teamwork among various entities in a Team: “Win as a Team, Lose as a Team” ]

    That recent scandal involving the S. Korean Biotech researcher (faked data), led to an appearance on CNN by one of the journal editors. He stated that “the whole submission/review process is based on TRUST”, & that the peer-reviewers simply DON’T HAVE TIME to go checking for faked-data. True. So, the this issue of “Trust” has to be addressed. Possibly, put in *writing* the “guidelines for using the Data”. I guarantee you, all of the above problems came about as the LACK of 1) Rules 2) Enforcement.

    I’m not SURPRISED these issues come about, with the lack of 1) & 2). See above article on Scientific Misconduct by David Goodstein/Caltech, he had to create a course on Science Ethics (by necessity). Same thing happened at my university (UIUC), where M. Louie/EE did the same thing. Happened to me, with my unfortunate experience in Amateur Astronomy (yuck!). I unwittingly became an expert on Scandals (in general), with special focus on Scientific Misconduct.

  • Jeff

    It’s an interesting suggestion, but I have to come down on the side of temporary secrecy here (disclosure: I’m an experimentalist, but not one involved in collider physics).

    I agree with Peter Woit’s point: the data are very complex, and an enormous amount of work goes into truly understanding the relevant systematics, backgrounds, and statistics. It’s not just a photograph (though I’m sure an astronomer could describe how complex interpreting a photograph can be). The LHC could eventually publish the data in the suggested 4-vector form, but I think it should only be done after the physics analysis is mature (i.e. after many publications) and these caveats and simulations can be provided in a useful form. This isn’t something that will happen fast, though I agree it should happen eventually.

    This feeds into another practical issue: having experimentalists who know the machine interpret their own data with internal peer review prevents spurious discovery claims. No insult intended to theorists, but there are a lot of them out there who are pretty data-starved (waiting since 1987, as Tao noted). Most theorists would probably do excellent analyses, but I’d guess that the signal-to-noise ratio of preprints would go down a bit.

    Finally, there’s a sociological point which I’ll make overly-bluntly: experimentalists don’t like to act as trained monkeys for theorists. One motivation for working on a cutting-edge experiment is to be among the first to learn something truly new about nature. Imagine you are an experimental grad student. You already have to compete with hundreds of other students to get an analysis project for your thesis. The data comes in and you’re ready to go, but you’re still an experimentalist – much of your time will still be spent working on the machine itself. Suddenly you see a preprint one morning doing your thesis analysis for you. It turns out the data were released publicly and a theorist who has more time to do analysis than you do got a result first. It isn’t crazy for said student to feel that they were treated a bit unfairly.

    Anyway, I’m generally in favor of public data and I think the LHC data should eventually be made public – just not immediately, and not until the caveats are very mature.

  • graviton383

    I am a phenomenologist & have been waiting for this data even longer than least since 1983! But to make it available in a raw form can invite problems if people do not know how to analyze it…and maybe it may lead to outrageous claims of discoveries that are not really there. This can be damaging to our field. There is at least one well-known embarrasing example of this in the last 5 years. I DO believe the experimenters themselves should keep the data BUT they should be very open with us theorists about what they have. Let them look for excesses or resonances or what have you in every possible channel or combination thereof. They have the software for this & have had years to grease it…this is something we phenomenologists cannot do no matter what some people may claim and I have been doing this kind of physics for 30 years. Then they should come to `us’ when they find something..not just me but everyone they can. Then we all can argue interpretation OPENLY..maybe on an LHCPheno Blog..and propose ways to test our various hypotheses. I cannot imagine a more open approach than this.

    PS..thanks for the blog JoAnne

  • Richard E.


    The Lambda archive at Goddard has vast swathes of CMB data,

    including both “raw” data that can be used for mapmaking or searches for non-Gaussianities, and “derived” products, such as the actual steps in the Markov chains used for parameter extraction.

    There are differences between the data from an accelerator experiment, and an astronomical image. In an accelerator there will be all sorts of triggers and the “events” are — to some extent — a function of these triggers, as well as the design of the detectors, and the question of *what* counts as “the data” is a little hazy here.

    However, as someone who works primarily with astrophysical data (when I work with data at all, that is!) I think the openness within the astro community has paid a huge dividend.

    The one problem with providing open access to LHC data will be a slew of papers making bogus claims to have found evidence for a variety of bizarre and arcane theories. If 100 people fit 100 models to the same set of data, we can be pretty sure that at least one of them will find support for their ideas at the 3sigma level, purely by chance :-)

  • Brad Holden

    I am going to ask a practical question. There are two reasons why some astronomy data archives have been very productive. First, there are teams of scientists who do nothing but create and maintain the archives. Second, there are funding programs that will pay for outside scientists to do stuff with the data. When one of the commenters above mentioned that many American private observatories do not have good archives, the simple reason why is money. One of the observatories I frequently use can only afford half a person to work as an archive scientist, as opposed to the teams of people the space based observatories or the Sloan Digital Sky Survey has.

    A serious effort to provide a useful data archive would require the LHC plan, from the beginning, to have a team of people whose job is to provide a variety data products ranging from raw streams to some sort of calibrated final output. This team would have to not only update and maintain the archive, but how the users interact with it, software for utilizing the data. There are a lot of underutilized astronomical data archives in the community because the observatory just sticks raw frames in a database and lets the user work out all of the calibration issues themselves. A bad LHC archive would, in the end, basically require the original experimentalists to use it, unless there is a serious effort in the beginning to build a quality data archive and support it over the lifetime of the LHC. So, does anyone want to do this and pay for it?

  • Rien

    I agree with Peter Woit and Jeff (and I am a particle phenomenologist), the data is simply too complex to analyze for outsiders and I would probably not believe an experimental paper by theorists.

    This was said during the discussion session at SUSY06 by one physicist in the audience who is also editor of Physical Review Letters – it is very hard to judge the quality of an experimental analysis from the outside, but with a collaboration you have a very tough internal review and you can also assume that they know what they are doing with their detector. But with a bunch of theorists…

    Also, you would need a pretty big harddrive to download the raw data. And then, what would you do with it? Write your own analysis code and detector simulations? How would you include all the knowledge of the characteristics and deficiencies of different parts of the detector (such as energy resolution, misidentifications, cracks…)? Good luck.

    As regards four-vectors: the processed data isn’t a collection of particles with well-defined momenta, energies or even identities. It’s a collection of tracks recorded in various parts of the detector.

    My opinion is that if you want to get your hands on the data you have to work with experimentalists. I also think we theorists should get our hands on the data. Isn’t that contradictory?

  • Rien

    I see graviton said the same thing while I was typing…

  • Sean

    In principle I’m in favor of releasing the data, in practice I doubt that it would work. Without an intimate knowledge of the idiosyncrasies of the detector, too many spurious results would be hard to resist.

    In fact I think that people tend to underestimate the extent to which the experimental collaborations will sweep up all the low-hanging fruit, when it comes to interpreting surprises in the data. These folks know about all the major models, and they’ll definitely put a lot of work into matching the data to the theories before they ever release any results. Which is okay — theorists, instead of ambulance-chasing, will be left to do the hard work of puzzling out the results that don’t fit into any of the popular models lying around.

    One related (and, I would think, simpler to solve) problem is the closed nature of the collaboration-based publication process. Given all the blessing and godparenting and so on that must take place before an analysis sees the light of day, it’s hard-to-impossible for an experimentalist to actually collaborate directly with a theorist on attacking some particularly interesting puzzle. And that is just a shame.

  • Troublemaker

    Peter Woit said: …the data is not readily comparable to astronomy data, which I gather is often the output of some conceptually rather simple device at a focal plane.

    This is not always true. Radio interferometry is conceptually quite intricate and requires that a fair amount of thinking be done in Fourier space. GLAST, scheduled to be launched next year, is a stack of particle detectors and calorimeters.

  • anonymous

    “As regards four-vectors: the processed data isn’t a collection of particles with well-defined momenta, energies or even identities. It’s a collection of tracks recorded in various parts of the detector.”

    Not really. The final production data consists of photons, electrons, muons, taus, and jets, constructed out of all of the things recorded in the detector.

    Of course, this means some set of tracks and energy deposits and so on that have passed certain identification criteria, and one must understand fake rates and so on.

    Nonetheless, the final objects of an analysis are, at some approximate level, just four-vectors labelled as a certain particle type, with extra supplementary information.

    Of course, outsiders analyzing data are always troublesome: see e.g. de Boer’s claims of dark matter discovery in EGRET data. On the other hand, after some reasonable length of time, it seems wise to make the data public. (How many interesting LEP analyses remain that aren’t happening because experimenters have [reasonably] moved on?)

  • Thomas Dent

    I spoke to one theorist at the conference (Dermisek… DE Kaplan is doing similar things) who has a model with a light Higgs decaying in non-standard channels and can matchup to the largest excess of ‘Higgs-like’ events from LEP II … now to really compare the model with data you need to look in more detail at what these events are. But that is still locked up inside what is left of LEP collaborations. So a model is de facto “untestable” if the data are being kept secret and none of the experimentalists feels like making an analysis of it. Maybe LEP data are being ‘kept safe’, but it makes no difference if they will never be seen or used again.

    A few more points. “Experimental papers written by theorists will never be believed” – maybe true, but this doesn’t prove that data should be secret. In fact the more true it is, the bigger disincentive there is for theorists to encounter raw data and the more likely that experimenters will have the field to themselves.

    “Theorists might scoop experimental grad students” … this seems in contradiction with the previous point. If experimentalists are so much better at dealing properly with detectors, backgrounds, etc. then there is no way they can get scooped because you need to control those aspects before making any credible claim of discovery. Something is a bit odd if a major scientific discovery has to be delayed until one student has the time to write up a thesis. But I don’t think experiments really work like that.

    There is no real theory-experiment conflict. The experimental work is for example designing and implementing a trigger and doing detector simulations for some type of signal and calibrating the detector and background once the machine is running, and without this there would be nothing at all for theorists to use. Morally, any ‘discovery’ paper should credit those experimentalists who were crucial to the existence of the data. But traditionally, individuals are *not* credited, the experimental collaboration publishes collectively and is cited collectively and no-one knows exactly who did what. The question is, how is the experiment run (democracy? benevolent dictator?), who decides when and how and by whom data analysis is done, how do individual experimenters get credit for their own work apart from word-of-mouth? These are not questions which involve theorists unless they are really competent to do data analysis. I don’t think experimental secrecy will solve any problems.

    How about if the equations of GR or supersymmetry were kept secret from anyone who wasn’t a theorist…

  • collin

    Fun topic… There are HEP experimentalists who believe the data should be made public. But it isn’t as easy for the experimentalists as just writing out an ascii file of four vectors of the objects in an event.

    The main problem is that in order to make all the data public, you have to understand all the data in a very global way. This isn’t the way particle physics of this scale is traditionally done. On any given analysis, you have some touchstones to ensure your sanity and some control regions to ensure your methods and some signal regions to measure or search for something. And all this is on just a small portion of the data. Generally, any given analysis will have some object ID requirements and some fudge factors, such as the probability for a jet to fake an electron or the k-factor applied to the leading order Z+3jet x-section. While these are somewhat standardized across analyses, not everybody working on every analysis will use the same thing. For example, not everybody on an experiment will agree on what objects are in a given event. If I’m measuring the W mass in the W->e nu channel, my definition of an electron is going to be much tighter (rightly so) than if I’m looking for ZZ->eeee where not only do I have four electrons, but I also have two mass constraints.

    Another problem is that a list of four vectors from ATLAS will not mean the same thing as a list of four vectors from CMS. So then, not only do you have to have the data for each experiment, but also the monte carlo.

    Finally, this is all a moving target. Algorithms change. Object ID changes as parts of the detector are better understood. At what point do you publish the data? When the experiment is over and no new changes are going to be made? As soon as it’s well understood? Once you publish it, do you get to go back and change it? What happens if someone produces a bogus result on old data?

  • Peter Erwin

    One of the things that makes astronomical archives so useful is the fact that many observations made for one purpose will contain data useful for other purposes. A simple example would be a long-exposure image made to study a quasar, which will automatically also have data on galaxies and foreground stars in the same field of view. These are irrelevant for the quasar study, but may turn out to be quite useful later for other projects.

    So an interesting question is whether something like this might be true for LHC datasets. Not being a high-energy physicist in any way, shape, or form, I have no idea. (Thomas Dent’s comment suggests that perhaps there could be such cases.)

  • Peter Erwin

    Finally, this is all a moving target. Algorithms change. Object ID changes as parts of the detector are better understood. At what point do you publish the data? When the experiment is over and no new changes are going to be made? As soon as it’s well understood? Once you publish it, do you get to go back and change it? What happens if someone produces a bogus result on old data?

    Some of these issues exist for astronomical archives. For example, some of the instrumental idiosyncrasies of the Hubble Space Telescope become better understood over time, and the post-processing and corrections get updated (or new correction stages are introduced). The archive implements an “on-the-fly” recalibration system, so if you request data from the archive, they are always processed with the latest approved algorithms and calibration files. In a sense, the data are continually “re-published”, and it’s up to the archive users to make sure that the data they retrieved a year ago haven’t been made obsolete by significant improvements in the calibration since then (this is quite rare in practice).

    Of course, the fact that the HST data come in small, discrete chunks (be they images or spectra) makes it easier to implement such a system. But “publishing” data from an instrument does not have to be a one-time, can’t-go-back-and-fix-it affair.

  • lmot

    The fact is, that theorists who cultivate relationships with the right experimentalists, will often get a heads up on upcoming discoveries months before they are made public. This seems unfair and opens possiblities for corruption, but it is understandable why experimentalists would want to hold on to this power.

  • anonymous

    Collin writes:

    The main problem is that in order to make all the data public, you have to understand all the data in a very global way.

    Of course, you know at least one experimentalist who claims to take such a global, um, vista.

  • Pingback: The Story So Far… » Blog Archive » Swear To God. I Am Going To Start A Separate Science Blog Roll()

  • superweak

    There’s a catch here: by the time a dataset is understood enough to be ready for release, the collaborations will, as Sean puts it, have swept up the low-hanging fruit. It takes years for complex detectors to be fully understood, and the calibrations, systematics checks, and corrections are on the whole done by people using that information to do an analysis. (Even the Quaero public interface for testing hypotheses against D0’s data restricts you to a few well-understood samples.) Any early public release of all the data would most likely result in lots of junk preprints as people saw badly-understood detector effects and called them new physics — if CDF II had just gone and immediately published the four-vectors rolling out of its reconstruction software, I’m sure someone would have noticed a huge excess of monojet+missing energy events. Certainly there are theorists who are conversant with issues of triggers, fake rate, and such. However they are not paid to sit around all day thinking about how the information from the detector could be wrong, and experimentalists are.

    In my experience experimentalists are extremely suspicious of results, since they know what actually goes into making them — hence the tradition of requiring confirmation from an independent experiment for discovery claims. Without some kind of (at least short-term) data encapsulation, problems will arise: imagine experiments looking at each other’s data! (Related reasons are behind the rise of “blind analyses,” where a collaboration hides its data from itself, for fear that it will find what it wants to find.) Even if only a small fraction of a collaboration reads a paper thoroughly, that’s still an awful lot of experience-years.

    And finally, a point that vaguely amuses me: we are used to a feedback system where either (a) theorists predict something, experiment finds it, theorists claim vindication because the prediction was ante hoc instead of post hoc, or (b) experiment finds something unexpected, everyone scrambles to see how models could accomodate this result, some things can’t and are excluded. What happens to this waltz if the theorists get to look at the data at the same time the experimentalists do?

    [Note: none of this implies that I don’t think processed HEP data should be released after a (longish) while, or that short-term data release might not be a good thing if we find ourselves with a one-detector ILC.]

  • adam

    I’m on the side of the ‘proprietary period then fully open’ model for data distribution (so that the team get to use the data in the short term, then all the raw data and data products get released; the problem here is serving potentially large amounts of raw data, of course, so there might be some fees for getting the raw data).

    The data belongs to taxpayers, so far as I’m concerned.

  • Richard E.

    I have been thinking more about this, and I think the argument that “theorists will always make a hash of data analysis” is bogus.

    Again turning to the analogy with cosmology/astrophysics, I suspect many theorists in cosmology (myself included) are learning more about Bayesian statistics, priors, Markov Chains and all the rest of it than we would have ever dreamed. And I can certainly point to some deeply flawed papers in the literature that might never have seen the light if the raw data was not freely available. However, theorists *can* learn this stuff, and don’t like to look silly in public, so they have plenty of motivation for doing so.

    In the end the theorists will either learn enough of the subtleties to do it themselves, or work with experimentalists who know how to perform the relevant analyses. Many (most?) papers are flawed in some way, and the community would react to the flurry of theorist-written data-driven papers by hiking its overall level of skepticism a notch or two. Just as it did with the arrival of the Arxiv, which does an end-run around peer review (for what that is worth, but don’t get me started)

    As Sean pointed out above, one side-effect of the present system is that it is very hard for particle theorists to collaborate with experimentalists, if the whole collaboration needs to sign off on papers that any single member writes (and this is *after* the data is in the public domain). Again speaking from my own experience, my foray into the world of data-analysis has largely been conducted in collaboration with someone who understood the issues involved at the outset (although not an “experimentalist” in the strict sense of the term), and it is a singularly productive mode of collaboration. To the extent that the “rules” of experimental particle physics discourage this sort of collaboration they are clearly ounter-productive.

    Secondly, the cosmological community has benefitted greatly from the development of the Cosmomc package which greatly simplifies the Monte Carlo Markov Chain analyses of cosmological data. (It is not theorist-proof however, as I have seen several publicly displayed figures that showed chains which, to my now practiced eye, were clearly unconverged). My guess is that if more experimental particle physicss data was made publicly available it would seed a small industry in the development of software tools that facilitated its analysis.

  • Amara

    Noone has yet mentioned the Mars Rover data (link for Spirit). That is one of the most visible and successful open planetary science databases that exists now.

  • Paolo Bizzarri


    let me comment this phrase of yours:

    “In principle I’m in favor of releasing the data, in practice I doubt that it would work. Without an intimate knowledge of the idiosyncrasies of the detector, too many spurious results would be hard to resist.”

    My idea is that the release of data is really similar to release of software code in open source projects (my professional field).

    For example, for software products like Netscape/Mozilla/Firefox, the code was originally proprietary and secret; then, there was the decision to release the code of the product itself. The idea was to create a large community of developers, able to contribute to the improvement of the product.

    However, for more than one year after the release of the code, the contribution from the developers outside the original development team was minimal. There was a lot of interest from other developers, but they were not able to provide any significant change to the code.

    The reason was understood shortly after. The code itself was only part of the knowledge that had built the product. Each line of code was the result of several decisions made by the developers, and contained assumptions there were not easy to make explicit.

    In short, the code was the result of a long and complex process, but in order to contribute to the code, you had first to became part of the process. Only after the assumptions became clearer, it was possible for other people to make significant contributions.

    Which is the relation I see with LHC data ?

    Data are the result of complex processes, where there is a lot of hidden knowledge that is necessary in order to understand what a number really mean in a certain context. People outside the process cannot understand what the raw data can really mean, without a proper understanding of the process itself.

    However, if the parallel I have made is anything significat, IT IS useful to make data available, as far as you understand that you have to make clear which is process through which they are produced and elaborated.

    Then, other people can make useful proposal on how to improve the understanding of data. In fact, making the process public has significatively improved the process itself.

  • David Heffernan

    On Belle we do make small amounts of data available on request, but only a fraction of the total data set. Students use it for high school science projects, for example. Are there any other HEP experiments that do this?

    I think the biggest problem with releasing data from the LHC experiments would be the shear volume. How much would CMS or ATLAS record in a day? What kind of background reduction are people expecting here?

  • Pingback: Ars Mathematica » Blog Archive » Releasing LHC Data()

  • Nathaniel

    I too, am an experimentalist (neutrinos) and I disagree with public data. Technical issues have already been discussed, but here’s the rub:

    After working for six years on MINOS, I will get ONE (count ’em) paper. OK, I’ll be fair. Three papers. Out of two hundred authors. If the data were made available publicly, then this paper wouldn’t even get cited… some theorist would come along, do a slightly more sophisticated analysis, and I the paper wouldn’t even get cited.

    Even worse, to make the data public we now have to publish the methods and documentation how to use the data (which will NEVER just be a list of 4-vectors; there are correlations and resolution functions on every experiment) and that will take the experimentalists even more work.

    Don’t get me wrong.. I love what I do. But I slave over computer code, measure crosstalk, invent calibration sources, crawl under dusty machines, travel, travel, travel, sit on interminably phone calls (every day) so that I can get those few weeks of analysing the data before everyone else. Now I can’t even do that?

    Happily, there’s a simple solution that should satisfy you theorists nicely: JOIN THE EXPERIMENT! I need a three more people in my calibration group to measure attenuation curves. I need two more to get automated processing running and document things. We need people to think deeply about statistics, and to make sure our MC models are good. We need people who understand the theory well to suggest what fits to make and the best way of presenting the data. But, of course, that’s a lot of work, so not many of you take us up on the offer.


  • Tony Smith

    Nathaniel, an “experimentalist”, said that he “… disagree[s] with public data. …” because he will only get “… After working for six years, Three papers. Out of two hundred authors …”.

    Nathaniel goes on to say that “… there’s a simple solution that should satisfy you theorists nicely: JOIN THE EXPERIMENT! …”.

    A flaw in Nathaniel’s solution is that not every theorist/analyst will get to be affiliated with the experiment collaboration.

    It seems to me that a more comprehensive, even simpler, solution would be to make the data public, in a format that is the work-product of Nathaniel and his fellow experimenters, by a paper authored by Nathaniel and his fellow experimenters.
    Then, any theorist/analyst (whether or not affiliated) should cite that paper, so that Nathaniel et al would have a very high citation rating.

    Further, if any theorist/analyst might ask Nathaniel et al for help in understanding the data, Nathaniel et al should be listed as coauthors for providing such help.

    I have tried to follow that spirit in stuff that I have written. For example, in my writings about Fermilab T-quark data, I give explicit credit to Erich Ward Varnes whose 1997 UC Berkely PhD thesis contained data that I found very useful.

    Tony Smith

  • Pingback: Particle Physics 2.0? « Charm &c.()


Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!

Cosmic Variance

Random samplings from a universe of ideas.

See More

Collapse bottom bar