Google Books Helps Reveal How Words Come and Go
Who thought a paper on the history of words could have so many graphs? Enter “culturomics,” an emerging field that drops data-crunching into the laps of humanities professors. Armed with the scanned corpus of Google books, researchers published in 2011 the first culturomics paper, which examined the changing popularity of words over time. The paper hinted at all sorts of possibilites: tracking the evolution of irregular verbs, mapping a politician’s rise to fame, identifying censorship when a name suddenly drops in popularity, etc.
A group of physicists have taken up culturomics with a new study that models the birth and death of words in three languages: Spanish, Hebrew, and English. At the same time they’re crunching serious math, they also have an eye on history. Here are a few of their in findings:
Them’s Fighting Words
War has a dramatic effect on the birth and death of words. The figure above depicts variability in how fast words change in popularity: a high variability over a short period of time is likely due to an influx of new words. Comparing the English and Spanish language corpuses during WWII, the researchers found English shakes up while Spanish remains relatively stable. The pattern reflects the relative importance of the war in English and Spanish-speaking parts of the world. Analyses of English through the 19th and 20th centuries also revealed high variability during the Civil War, WWI, and the Vietnam War.
Other historical events can also be seen in the historical record. In Hebrew, for example, there was a five-fold increase in word births around 1917, when the Balfour Declaration laid the foundations for modern Israel and revived Hebrew as a spoken language.
Synonym v. Synonym
The researchers also looked at how synonyms battle it out in print. “Xray” eventually wins out over “radiogram” and “Roentgenogram” in this graph, which is a figure from the paper that we’ve recreated in Google’s ngram viewer to show their changing popularities over time. The changeover seems to happen around 1980.
Update: See the comments for a more complete analysis of this trend using Google’s ngram viewer.
30 to 50 years after they’re introduced, words get sorted into the ones that go and the ones that stay. Variability in the word popularity growth, depicted in this graph, peaks in that 30 to 50 year period, during which words either die a slow death or steadily become more popular. Data like this from culturomics can provide fodder for sociologists or linguists, who might in interested in why a universal tipping point happens at three to five decades: Is it because that’s the length of a generation? Or the lifecycle of events and technologies? (How long will words like “VCR” or “Walkman” still be part of our language?)
Overall, the paper concludes that the birth rate of words is increasing and death rate decreasing as the languages become saturated with all necessary words. Linguist Mark Liberman at the blog Language Log, however, casts an intrigued but cautious eye on the paper’s conclusion about the long-term evolution of words:
One critical consideration, however, is that this paper is not really about words at all — it’s about contiguous letter-strings in optical-character-reader output for scanned printed books. Different inflected forms of a word are different “words”; different word spellings are different “words”; word-fragments split typographically across lines are different “words”; typos are different “words”; OCR errors are different words”.
Liberman’s critique is well worth a read (especially if you like math and language history!), but it has less of an impact on the findings that depend more recent word data. As he details in the rest of his post, many problems arise from irregular spelling and use of the long s that may have skewed data from the early 19th century. But scanning and OCR technology will surely get better and eliminate those problems as the culturomics moves forward.
All of these data have been sitting in libraries for hundreds of years, but technology has only just let scientists start exploring this searchable database. Heck, why don’t you go explore yourself on Google’s ngram viewer. That the data are available to any curious person is part of what makes it wonderful.
[via WSJ]
Images courtesy of Petersen et al, Scientific Reports
-
Peter Ellis
-
Woody Tanaka
-
Spike Lenox
-
http://N/A Charles
-
Margaret Bartley


