Google Books Helps Reveal How Words Come and Go

By Sarah Zhang | March 21, 2012 12:05 pm

Who thought a paper on the history of words could have so many graphs? Enter “culturomics,” an emerging field that drops data-crunching into the laps of humanities professors. Armed with the scanned corpus of Google books, researchers published in 2011 the first culturomics paper, which examined the changing popularity of words over time. The paper hinted at all sorts of possibilites: tracking the evolution of irregular verbs, mapping a politician’s rise to fame, identifying censorship when a name suddenly drops in popularity, etc.

A group of physicists have taken up culturomics with a new study that models the birth and death of words in three languages: Spanish, Hebrew, and English. At the same time they’re crunching serious math, they also have an eye on history. Here are a few of their in findings:

Them’s Fighting Words

War has a dramatic effect on the birth and death of words. The figure above depicts variability in how fast words change in popularity: a high variability over a short period of time is likely due to an influx of new words. Comparing the English and Spanish language corpuses during WWII, the researchers found English shakes up while Spanish remains relatively stable. The pattern reflects the relative importance of the war in English and Spanish-speaking parts of the world. Analyses of English through the 19th and 20th centuries also revealed high variability during the Civil War, WWI, and the Vietnam War.

Other historical events can also be seen in the historical record. In Hebrew, for example, there was a five-fold increase in word births around 1917, when the Balfour Declaration laid the foundations for modern Israel and revived Hebrew as a spoken language.

Synonym v. Synonym 

The researchers also looked at how synonyms battle it out in print. “Xray” eventually wins out over “radiogram” and “Roentgenogram” in this graph, which is a figure from the paper that we’ve recreated in Google’s ngram viewer to show their changing popularities over time. The changeover seems to happen around 1980.

Update: See the comments for a more complete analysis of this trend using Google’s ngram viewer. 

Tipping Point

30 to 50 years after they’re introduced, words get sorted into the ones that go and the ones that stay. Variability in the word popularity growth, depicted in this graph, peaks in that 30 to 50 year period, during which words either die a slow death or steadily become more popular. Data like this from culturomics can provide fodder for sociologists or linguists, who might in interested in why a universal tipping point happens at three to five decades: Is it because that’s the length of a generation? Or the lifecycle of events and technologies? (How long will words like “VCR” or “Walkman” still be part of our language?)

Overall, the paper concludes that the birth rate of words is increasing and death rate decreasing as the languages become saturated with all necessary words. Linguist Mark Liberman at the blog Language Log, however, casts an intrigued but cautious eye on the paper’s conclusion about the long-term evolution of words:

One critical consideration, however, is that this paper is not really about words at all — it’s about contiguous letter-strings in optical-character-reader output for scanned printed books. Different inflected forms of a word are different “words”; different word spellings are different “words”; word-fragments split typographically across lines are different “words”; typos are different “words”; OCR errors are different words”.

Liberman’s critique is well worth a read (especially if you like math and language history!), but it has less of an impact on the findings that depend more recent word data. As he details in the rest of his post, many problems arise from irregular spelling and use of the long s that may have skewed data from the early 19th century. But scanning and OCR technology will surely get better and eliminate those problems as the culturomics moves forward.

All of these data have been sitting in libraries for hundreds of years, but technology has only just let scientists start exploring this searchable database. Heck, why don’t you go explore yourself on Google’s ngram viewer. That the data are available to any curious person is part of what makes it wonderful.

[via WSJ]

Images courtesy of Petersen et al, Scientific Reports

  • Peter Ellis

    There are three big issues with the roentgenogram/radiogram/xray plot. First, Ngram viewer is case sensitive. Second, it splits hyphenated words into component parts. Thirdly and finally, by far the dominant term is in any case “radiograph
    (capitalised form is about 1/5 of the total)
    (again, capitalised = 1/4 to 1/5 of the total)
    (Capitalised form is ~1/5 of the total up till about 1960, thereafter negligible)
    (complex, but in recent years all are significant. Nice to see that the recent slight decline of “xray” and synonyms is paralleled by the rise in “MRI scan”.)

    Putting them all together, we get something like this:

    Bearing in mind that you need to add together all the “x ray” permutations, and increase roentgenogram/radiogram by about 1/5, the real story is that “roentgenogram” got chased out by “radiograph” some time in the 60’s, but fought a rear guard action until further overtaken by the various permutations of “x ray” in the mid 80’s.

  • Woody Tanaka

    “around 1917, when the Balfour Declaration created a Jewish state”

    Historical correction: the Balfour Declaration did not create a Jewish state and, in fact, did not express interest in a state but a “national home” (whatever that is supposed to be).

  • Spike Lenox

    Growing up in the 1950s and 1960s, the only term I ever heard in spoken language for that radiological process was “x-ray”. This leads me to consider two possibilities; 1) that books are laggardly in adopting new terms in comparison to speech, or, 2) (perhaps more likely) the difference in spelling might have thrown the research results off.

  • http://N/A Charles

    I did a search for Houdini,Cayce. very interesting result. Looks like in about 1965, their popularity switched, though Houdini later came back.

    This is a valuable tool! Thanks!

  • Sarah Zhang

    @Peter Ellis — Thanks for doing that thorough analysis of synonyms for X-ray — I’ve updated the post to point to your comment. I think that your analysis is a pretty good argument for why having open data is a good thing.

    @Woody Tanaka — Tweaked the wording in the post.

  • Margaret Bartley

    Another big factor in these analyses are the books themselves. In considering the Roegetgram v. xray v. radiogram discussion, as a kid growing up in the 50s and 60s, I never heard the latter two terms, yet I am sure they were preferred terminology in the written material.

    Until recently, authors were very careful to avoid vernacular (slang, or common language) in their writings. If this is carefully phrasesd in the headlines, then it is not a problem, but if you look at only the words used in the books scanned, this does not really reflect “language”, which most people assume includes the spoken word, as well as the written word.


Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!


80beats is DISCOVER's news aggregator, weaving together the choicest tidbits from the best articles covering the day's most compelling topics.

See More

Collapse bottom bar