In 2009, Google made available Google Books (also known as the Ngram corpus), a database that now includes over 8 million books from libraries around the world. The books comprise a collection of words (over 500 billion English words) and phrases and this dataset is freely available for research use. The Books corpus allows researchers to examine changes in the frequency of word use in books over time, dating back to 1800.
This has led a lot of striking findings. So for instance, it has been shown that “individualistic words and phrases” increased between 1960 and 2008 in American books; that “books average the previous decade of economic misery”; and that “male and female pronoun use reflects the status of women.” – among many other claims, some published in the highest-ranked journals.
However, a new paper just published in PLoS ONE could throw a spanner in the works of the thriving Google Books research paradigm.
According to Eitan Adam Pechenick, Christopher M. Danforth, and Peter Sheridan Dodds of the University of Vermont:
Our findings call into question the vast majority of existing claims drawn from the Google Books corpus, and point to the need to fully characterize the dynamics of the corpus before using these data sets to draw broad conclusions about cultural and linguistic evolution.
What Pechenick et al. found is that over the course of the 20th century, the Books corpus seems to contain an increasing proportion of scientific, medical and technical publications. This is only an inference, because the nature of the corpus means that it doesn’t contain the titles or identities of the books – it’s just a ‘bag of words‘. However, the evidence Pechenick et al. put forward is compelling.
For instance, the word Figure (capitalized) has become much more popular over time while figure has not.
Figure, capitalized, is a word used heavily in technical publications, as a caption or a reference to an image. It would only occur in normal text if a sentence started with “figure” and I can’t see that being common. So the rise of Figure is evidence that the corpus is becoming increasingly full of technical texts.
Pechenick et al. present other evidence of this. Between the 1950s and the 1980s, the fastest rising ‘words’ were ( and ) – brackets, most common in science. Other top risers were model, data, [, ] (more brackets!), percent, % and al (as in et al.)
Overall, it seems that the composition of the Google Books dataset has changed, making it difficult to interpret any changes in word frequencies. There is one ray of hope, however. Pechenick et al. say that the most recent (second) version of the “English Fiction” sub-corpus, released in 2012, seems to be sufficiently filtered that it may be free of technical texts. The first release, from 2009, was contaminated with scientific words however. The authors conclude:
When examining these data sets in the future, it will therefore be necessary to first identify and distinguish the popular and scientific components in order to form a picture of the corpus that is informative about cultural and linguistic evolution. For instance, one should ask how much of any observed gender shift in language reflects word choice in popular works and how much is due to changes in scientific norms…
The Google Books corpus’s beguiling power to immediately quantify a vast range of linguistic trends warrants a very cautious approach to any effort to extract scientifically meaningful results. Our analysis provides a possible framework for improvements to previous and future works which, if performed on English data, ought to focus solely on the second version of the English Fiction data set, or otherwise properly account for the biases of the unfiltered corpus.
Pechenick EA, Danforth CM, & Dodds PS (2015). Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution. PloS one, 10 (10) PMID: 26445406