Spreadsheet Risks in Science

By Neuroskeptic | August 6, 2017 6:08 am

Errors in the use of spreadsheets such as Microsoft Excel could pose risks for science.

That’s according to a preprint posted on arXiv from Ghada AlTarawneh and Simon Thorne of Cardiff Metropolitan University.

AlTarawneh and Thorne conducted a survey of 17 researchers from the University of Newcastle neuroscience research centre, ranging from PhD students to senior researchers. None of the respondants had any formal, certified training in spreadsheet use, with most (71%) being self-taught. Despite this, the majority of the researchers rated themselves as “intermediate” users of Excel.

Worryingly, only 20% reported that they had their spreadsheets checked for errors by colleagues (‘peer reviewed’). Most said they did all the testing themselves, or even that they did none at all.

spreadsheet_risk_scienceAs AlTarawneh and Thorne comment, these results raise the possibility that researchers may have a tendency towards overconfidence in their use of spreadsheets:

Superficially, spreadsheets seem to be simple and straightforward tools but error rates show they carry significant risk… overconfidence tends to increase specially in people who are highly educated.

Although this was a very small study, the results absolutely ring true to me. I’ve been using Excel since the first year of my PhD, and the vast majority of my colleagues also use it heavily. Very few of us have any formal training. In fact, people come to me for Excel tips, even though I’m entirely self-taught, and I have caught myself making numerous mistakes in my own spreadsheets (and how many are there that I don’t know about?)

I previously wrote about my first major Excel blooper, which generated a spurious positive result by chance, during my PhD:

In my data, as in most people’s, each row was one sample (i.e. a participant) and each column was a variable. What had happened was that at some point I’d tried to take all the data, which was in no particular order, and reorder (sort) the rows alphabetically by subject name to make it easier to read.

How could I screw that up? Well, by trying to select “all the data” but actually only selecting some of the columns. I must have reordered them, but not the others, so all the rows became mixed up.

I’ve been more careful since then, but just the other week I found out about another error I had made in a sheet I created three years ago. This Excel sheet was set up to calculate the total score on a questionnaire based on the raw items. The formulas in the sheet were all working as designed, but I had misunderstood how the questionnaire was meant to be scored, so my design was wrong.  I only spotted the error after revisiting the old spreadsheet for an unrelated reason.

Fortunately, I caught both of the errors described above before the data in question had been published, so I didn’t need to correct or retract any papers. But if I had submitted the miscalculated data for publication, I doubt the peer reviewers would have spotted it. In neuroscience, it is very rare for reviewers to inspect the raw datasheets for errors, or even to have access to them. A worrying situation, indeed.

CATEGORIZED UNDER: papers, science, select, Top Posts
ADVERTISEMENT
  • smut clyde

    Those diagrams were created in Excel, weren’t they? For options that aren’t mutually exclusive and needn’t add up to 100%?
    Just because Excel offers pie charts, doesn’t mean people have to get carried away with over-confidence and self-teaching, and actually use them.

  • Nick

    A colleague and I found that a published article in what I believe is known as a “Tier 1 journal” was based on analyses done in Excel where the wrong rows were selected in the formulas. One of the more amusing consequence of this was that the mean and SD of Condition 1 were included as cases in Condition 2.

    We contacted the author, who promised that a correction would be issued. 18 months later, it still hasn’t been.

  • Nick

    By the way, I have a partial solution to the “only sorted some columns” problem.

    In column A, place the numbers 1-9999 (or however many rows you have).
    In column X (assuming your last column is W), do the same.
    In column B, put the formula =(A1-X1) and drag it to every cell.
    Set conditional formatting in column B so that it goes red if not zero.
    You can make columns A and X zero-width, and B only 15 pixels or so.
    Now, if you do a partial sort, the formula in column B will make it go red. Hopefully you will spot this reasonably soon after the mistake is made.

    • Ruth Dixon

      Another workaround is to shade all the cells in some rows a different colour. After sorting, those rows should still show continuous bands of colour.

  • http://www.mazepath.com/uncleal/qz4.htm Uncle Al

    One baby playing with razor blades is a tragedy. A million babies playing with razor blades demands government standards. Not giving babies razor blades is heinous discrimination.

    • http://www.facebook.com/Kieseyhow KieSeyHow

      LOL, Well said!

  • Cliff Clavin

    No one ever questions the data. They just believe whatever they’re told.

    • Not_that_anyone_cares, but…

      Or, perhaps, don’t believe anything they are told. The few who are in between these two extremes get ridiculed from both sides.

    • http://www.facebook.com/Kieseyhow KieSeyHow

      I question EVERYTHING.

      • Cliff Clavin

        You are rare.

        • http://www.facebook.com/Kieseyhow KieSeyHow

          I have learned that the NUMBER one thing I need to question. constantly, is what assume I know.

  • RMacCoun

    Search on “spreadsheet errors” in Google Scholar for a vast
    literature on this.

    • Systems Modelling

      Many published by Eusprig conferences, where that paper appeared.

  • Jacob

    Read and weep: “Gene name errors are widespread in the scientific literature” https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1044-7

    >The spreadsheet software Microsoft Excel, when used with default settings, is known to convert gene names to dates and floating-point numbers. A programmatic scan of leading genomics journals reveals that approximately one-fifth of papers with supplementary Excel gene lists contain erroneous gene name conversions.

    Emphasis mine. Genes with names like “SEPT4” get turned into “32493”. Everybody I work with knows about this problem but it’s just so easy to slip up and make a mistake by accident, particularly in a spreadsheet with thousands of rows

    • CHEMST

      Autocorrect is the enemy.

    • http://www.facebook.com/Kieseyhow KieSeyHow

      I have found that the more a person knows in one field, the less they are likely to know in another. Academics are far too arrogant to spot their own foolish mistakes.

  • Dennis Moore

    I hope real scientists don’t use Excel. I hope they have the good sense to use Gnumeric. Much better open source spread sheet for scientific or any other use.

    • Daniel Rymes

      Most errors described in the article are primarily user mistakes not addresseable by tools (a wrong expression could be written in Google spreadsheet as easy as in R). The sorting problem is an exception – that one could have been prevented by Excel data model (something not available in Gnumeric)…

      • jrkrideau

        Yes but it’s easier to audit a Python or R script, methinks. And either keep the program and the data separate.

        I’ve been reading about spreadsheet errors for years and have come to the conclusion they are too dangerous anything other than a Saturday shopping list and even that’ dicey.

        The Reinhart and Kenneth Rogoff fiasco, alone, was to convine me to that real researchers should never use a spreadsheet.

        http://www.newyorker.com/news/john-cassidy/the-reinhart-and-rogoff-controversy-a-summing-up

        Here, btw, is one Excel cell I found a couple of years ago somewhere. I have no idea any more what it did but there are 1307 characters in one cell!

        =(1/(B1*E2*E4))*(B1*E2*(-0.784314*B5+0.392157*B2*E4)+(B1*E2*(-3.92114*10^13*B5*E3^2*E4+B1*E2*(7.85864*10^9*B5^2-7.85864*10^9*B2*B5*E4+1.96466*10^9*B2^2*E4^2)))/(B1^2*E2^2*E3^2*E4*(7.52884*10^33* B5^2+1.03522*10^33*B2*B5*E4+(2.39982*10^33*B2^2-9.59927*10^33*B2*B4+9.59927*10^33*B4^2)*E4^2)+B1^3*E2^3*(-1.00594*10^30*B5^3+1.50891*10^30*B2*B5^2*E4-7.54455*10^29*B2^2*B5*E4^2+1.25743*10^29*B2^3*E4^3)+SQRT(4*(3.15568*10^23*B1*B5*E2* E3^2*E4-6.32453*10^19*(1*B1*B5*E2-0.5*B1*B2*E2*E4)^2)^3+B1^4*E2^4*(E3^2*E4*(7.52884*10^33*B5^2+1.03522*10^33*B2*B5*E4+(2.39982*10^33*B2^2-9.59927*10^33*B2*B4+9.59927*10^33*B4^2)*E4^2)+B1*E2*(-1.00594*10^30*B5^3+1.50891*10^30*B2*B5^2*E4-7.54455*10^29*B2^2*B5*E4^2+1.25743*10^29*B2^3*E4^3))^2))^(1/3)+7.82767*10^-11*(B1^2*E2^2*E3^2*E4*(7.52884*10^33*B5^2+1.03522*10^33*B2*B5*E4+(2.39982*10^33*B2^2-9.59927*10^33*B2*B4+9.59927*10^33*B4^2)*E4^2)+B1^3*E2^3*(-1.00594*10^30*B5^3+1.50891*10^30*B2*B5^2*E4-7.54455*10^29*B2^2*B5*E4^2+1.25743*10^29*B2^3*E4^3)+SQRT(4*(3.15568*10^23*B1*B5*E2* E3^2*E4-6.32453*10^19*(1*B1*B5*E2-0.5*B1*B2*E2*E4)^2)^3+B1^4*E2^4*(E3^2*E4*(7.52884*10^33*B5^2+1.03522*10^33*B2*B5*E4+(2.39982*10^33*B2^2-9.59927*10^33*B2*B4+9.59927*10^33*B4^2)*E4^2)+B1*E2*(-1.00594*10^30*B5^3+1.50891*10^30*B2*B5^2*E4-7.54455*10^29*B2^2*B5*E4^2+1.25743*10^29*B2^3*E4^3))^2))^(1/3))

        • rameshpmenon

          Any body uses Wolfram alpha, mathematica?
          Its pretty much user friendly and “less complicated” than Excel. I’ve never been able to even “Self-tutor” Excel functions. I am a biologist by the way

          • http://www.facebook.com/Kieseyhow KieSeyHow

            In the old days, there were all sorts of publications available for Excel and Lotus mastery, but sadly access to viable information is gradually eroding to be be replaced by mindless pop-culture and stupidity from the likes of Google and others.

            Try Google Scholar, or Google Verbatim, in addition to the ones you mentioned.

    • neale_blackwood

      In the same way a chain saw is a powerful tool in trained hands and a lethal weapon in untrained and inexperienced hands it only takes training and correct structures to get reliable.robust spreadsheets.

      The study found most people were self taught – self taught chainsaw users are in the grave yard.

      • CHEMST

        There is nothing wrong with self-taught spreadsheet use, The key is to be willing to spend the time to test built-in functions before and after using them. The failures cited are either carelessness or insufficient investments spend in self-learning, neither is the fault of the tool.

      • http://www.facebook.com/Kieseyhow KieSeyHow

        Good comment. The problem is HUMANS and how the brain works, not the software. Most people blame themselves only after looking everywhere else for the problem. My direct experience with humans is that people should start with themselves when looking for issues.

  • polistra24

    Excel is a terrible design that should have been replaced 20 years ago. Its default behavior makes it easy to do lots of things that nobody ever wants to do. Why would you want a separate formula for each cell? If you want to apply a formula to one cell, a CALCULATOR will do it.

    Everything you WANT to do requires special tricks that aren’t listed in the Help file.

    I’ll sometimes use Excel to generate a graph, but when I really need to analyze or tabulate data I’ll write a Python routine. Faster, far more certain, easy to check.

    • http://www.facebook.com/Kieseyhow KieSeyHow

      I disagree. It is excellent software that just requires skill and determination to learn. I had no problem with Lotus123 either. Also excellent software.

  • CHEMST

    One lesson for everyone from this article. Save your data repeatedly with version numbers appended to the name. I’ve done the same stupid thing sorting and am sure almost everyone using Excel has. If you different versions of your work you won’t have to type it all in again. More importantly, used correctly a spreadsheet can be very useful. We do not dispense with all chainsaws because some guy lost a finger somewhere.

  • http://www.dreamgrow.com/ Priit Kallas

    This is basic literacy on that level. Mistakes happen, you have to find ways to catch the errors. If you can’t get Excel right what are the chances the equations you use are right in the first place.

    • http://blogs.discovermagazine.com/neuroskeptic/ Neuroskeptic

      I disagree. It is possible to know enough Excel to almost correctly implement a function, yet make a critical mistake, and lack the Excel skills to spot the mistake.

      You could say that such a person lacks basic Excel literacy but they would disagree. After all, they implemented the function. And they don’t know it’s wrong.

      • http://www.dreamgrow.com/ Priit Kallas

        Yes. I get it. But the same thing is abiut math, programming, and grammar. Your equasions are wrong or you create a programm to run an algorithm and mess that up. Or you express yourself ambiquously.

  • Charles Carter

    Okay, this is almost a settled issue in my experience. The fundamental issue is NOT Excel or spreadsheet errors, but opacity. Errors exist because the work is opaque. The solution is “reproducible research.”

    What I mean, specifically, is using a tool where the raw data is completely open, the calculations (programming) are completely open, and the analysis is completely open. Thus, another researcher can take the exact data, run the exact code, and obtain the exact results, or if the results seem questionable, he can examine the code (all of it) line by line for errors.

    Two very good tools widely accepted are the R programming language with the knitr package, and the Python programming language using Jupyter notebooks. These don’t insure against mistakes and errors, they just make the mistakes and errors obvious to somebody.

    • http://www.facebook.com/Kieseyhow KieSeyHow

      I wonder if any scientists have considered using code repositories or IT solutions like Slack, or Github for their data? This would allow a truly open data policy and provide a path for provisions and suggestions. Why reinvent the damn wheel???

  • joseph2237

    Microsoft has and answer for the troubled PhD #userproblem

  • Pingback: Weekend reads: Predatory fraud; risky spreadsheets; how to report issues in a paper - Retraction Watch at Retraction Watch()

  • Nick Danger

    All of these “solutions” offered here avoid the actual solution: Do not use Excel for data analysis. Use SAS, SPSS, R.

NEW ON DISCOVER
OPEN
CITIZEN SCIENCE
ADVERTISEMENT

Neuroskeptic

No brain. No gain.

About Neuroskeptic

Neuroskeptic is a British neuroscientist who takes a skeptical look at his own field, and beyond. His blog offers a look at the latest developments in neuroscience, psychiatry and psychology through a critical lens.

ADVERTISEMENT

See More

@Neuro_Skeptic on Twitter

ADVERTISEMENT

Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!

Collapse bottom bar
+