Can Google Books Really Tell Us About Cultural Evolution?

By Neuroskeptic | October 10, 2015 6:34 am

In 2009, Google made available Google Books (also known as the Ngram corpus), a database that now includes over 8 million books from libraries around the world. The books comprise a collection of words (over 500 billion English words) and phrases and this dataset is freely available for research use. The Books corpus allows researchers to examine changes in the frequency of word use in books over time, dating back to 1800.


This has led a lot of striking findings. So for instance, it has been shown that “individualistic words and phrases” increased between 1960 and 2008 in American books; that “books average the previous decade of economic misery”; and that “male and female pronoun use reflects the status of women.” – among many other claims, some published in the highest-ranked journals.

However, a new paper just published in PLoS ONE could throw a spanner in the works of the thriving Google Books research paradigm.

According to Eitan Adam Pechenick, Christopher M. Danforth, and Peter Sheridan Dodds of the University of Vermont:

Our findings call into question the vast majority of existing claims drawn from the Google Books corpus, and point to the need to fully characterize the dynamics of the corpus before using these data sets to draw broad conclusions about cultural and linguistic evolution.

What Pechenick et al. found is that over the course of the 20th century, the Books corpus seems to contain an increasing proportion of scientific, medical and technical publications. This is only an inference, because the nature of the corpus means that it doesn’t contain the titles or identities of the books – it’s just a ‘bag of words‘. However, the evidence Pechenick et al. put forward is compelling.

For instance, the word Figure (capitalized) has become much more popular over time while figure has not.


Figure, capitalized, is a word used heavily in technical publications, as a caption or a reference to an image. It would only occur in normal text if a sentence started with “figure” and I can’t see that being common. So the rise of Figure is evidence that the corpus is becoming increasingly full of technical texts.

Pechenick et al. present other evidence of this. Between the 1950s and the 1980s, the fastest rising ‘words’ were ( and ) – brackets, most common in science. Other top risers were model, data, [, ] (more brackets!), percent, % and al (as in et al.)

Overall, it seems that the composition of the Google Books dataset has changed, making it difficult to interpret any changes in word frequencies. There is one ray of hope, however. Pechenick et al. say that the most recent (second) version of the “English Fiction” sub-corpus, released in 2012, seems to be sufficiently filtered that it may be free of technical texts. The first release, from 2009, was contaminated with scientific words however. The authors conclude:

When examining these data sets in the future, it will therefore be necessary to first identify and distinguish the popular and scientific components in order to form a picture of the corpus that is informative about cultural and linguistic evolution. For instance, one should ask how much of any observed gender shift in language reflects word choice in popular works and how much is due to changes in scientific norms…

The Google Books corpus’s beguiling power to immediately quantify a vast range of linguistic trends warrants a very cautious approach to any effort to extract scientifically meaningful results. Our analysis provides a possible framework for improvements to previous and future works which, if performed on English data, ought to focus solely on the second version of the English Fiction data set, or otherwise properly account for the biases of the unfiltered corpus.

ResearchBlogging.orgPechenick EA, Danforth CM, & Dodds PS (2015). Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution. PloS one, 10 (10) PMID: 26445406

CATEGORIZED UNDER: history, methods, papers, select, Top Posts
  • Uncle Al

    Druids gained eternal knowledge by eating hazel nuts. Psychology claims to grasp at Dilberts rather than filberts. Psychology embraces Druid craop paradigms: sweeper, harvester, nut cart, and forklift.

  • dconklin

    Headline: “Can Google Books Really Tell Us About Cultural Evolution?”

    Of course, the correct answer is yes and no. I’m not surprised that they found a trend in literature towards the sciences. For one, we are learning more and more in the sciences and since scholars have a ‘publish or perish’ mentality, they publish! Secondly, in conversation in Google books I have suggested certain books which are no longer under copyright and they have outright lied to me and stated that they don’t scan books that are under copyright–either that, or they are abysmally stupid about how long copyrights last–given that the book was in the field of religion, I’ll go with my first hunch.

  • polistra24

    I noticed this recently when checking the word ‘louche’, used by Blumenthal to describe Boo-Hoo-Ner. The Ngram shows lots of ‘louche’ before 1850, as you’d expect, then a quick fade …. and then a new peak of usage after 1980.

    Obviously the second peak is NOT coming from a renaissance of novelists imitating Trollope; it has to be from academic lit crit discussing or quoting the Victorian novelists.

    • Neuroskeptic

      Agreed – although there does seem to be a (smaller) recent increase in “louche” in the English Fiction dataset.

    • lump1

      I’m only familiar with the word “louche” as describing Absinthe turning milky/cloudy when mixed with cold water.

  • Dr Mark R Baker

    May I humbly suggest to Pechenick, Danforth, and Dodds a new methodology – that such claims are weighted by the citation index of the book. This should serve to mitigate the “unread influencer” syndrome.

    • Neuroskeptic

      That would be a great plan – however (afaik) it is impossible because the Books dataset doesn’t include any data on the authorship, citations or title of each book.

      It’s just a collection of words contained in various books.

      So unless a future version of the dataset includes such additional info, your approach would not be possible.

      • Dr Mark R Baker

        Thanks for that observation. You are quite right. The primary data are, of course, the books themselves rather than the Books dataset, and, as is often the case, we have to go back to primary data rather than a flawed subset of the data. BTW, Neuroskeptic is a very familiar tag, but if you have a name I’d like to reference you on that observation.

        • Neuroskeptic

          “Neuroskeptic” or “the Neuroskeptic blog” is absolutely fine! Thanks

  • Pingback: Can Google Books Really Tell Us About Cultural Evolution? – Discover Magazine (blog) | digitalcollaboration()

  • Pingback: Ngram, un bon outil pour les Sciences Humaines ? | Internet news()



No brain. No gain.

About Neuroskeptic

Neuroskeptic is a British neuroscientist who takes a skeptical look at his own field, and beyond. His blog offers a look at the latest developments in neuroscience, psychiatry and psychology through a critical lens.


See More

@Neuro_Skeptic on Twitter


Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!

Collapse bottom bar