Just as petrified fossils tell us about the evolution of life on earth, the words written in books narrate the history of humanity. They words tell a story, not just through the sentences they form, but in how often they occur. Uncovering those tales isn’t easy – you’d need to convert books into a digital format so that their text can be analysed and compared. And you’d need to do that for millions of books.
Fortunately, that’s exactly what Google have been doing since 2004. Together with over 40 university libraries, the internet titan has thus far scanned over 15 million books, creating a massive electronic library that represents 12% of all the books ever published. All the while, a team from Harvard University, led by Jean-Baptiste Michel and Erez Lieberman Aiden have been analysing the flood of data.
Their first report is available today. Although it barely scratches the surface, it’s already a tantalising glimpse into the power of the Google Books corpus. It’s a record of human culture, spanning six centuries and seven languages. It shows vocabularies expanding and grammar evolving. It contains stories about our adoption of technology, our quest for fame, and our battle for equality. And it hides the traces of tragedy, including traces of political suppression, records of past plagues, and a fading connection with our own history.
As the team says, the corpus “will furnish a great cache of bones from which to reconstruct the skeleton of a new science.” There are strong parallels to the completion of the human genome. Just as that provided an invaluable resource for biologists, Google’s corpus will allow social scientists and humanities scholars to study human culture in a rigorous way. There’s a good reason that the team are calling this field “culturomics”.
The project began back in 2007, when the duo published a paper showing that verbs become more regular over time. From Beowulf to Harry Potter, the past forms of many irregular verbs have taken on the standard “-ed” suffix, in a way that fits a startlingly simple mathematical formula. On wrapping up the project, they marvelled at how hard it was to collect the data in the first place. Michel says, “We realized that the study of the evolution of culture needed something like a genome, a dataset so powerful that it would enable such analyses to be done rapidly, on any topic, not just irregular verbs. And we noticed that some of those really obscure books we used… had meanwhile popped up on Google Books. We put two and two together.”
Getting Google involved wasn’t hard. Lieberman-Aiden reminisces, “From the earliest stage, they realized that this had a lot of potential. We talked with them, the door was opened to advance the project some, we showed results, the door opened further. Eventually the door was just open.”
The team eventually worked with a third of the full corpus, selecting those books that were dated most accurately and scanned most crisply. They ended up with over 5 million books published in English, French, Spanish, German, Chinese, Russian and Hebrew, and dating back since the 1500s. Together, the texts include 500 billion words (represented in the word cloud above with the most common ones like ‘a’ and ‘the’ removed). As Michel writes:
“The corpus cannot be read by a human. If you tried to read only the entries from the year 2000 alone, at the reasonable pace of 200 words per minute, without interruptions for food or sleep, it would take eighty years. The sequence of letters is one thousand times longer than the human genome.”
Rather than expose the full texts to the public (and themselves to copyright infringement), the team Aiden have simply tracked and stored the frequency of billions or words, or sets of words, over time. The result is a “big table” that you can download and explore at www.culturomics.org. Otherwise, you can play around with Google’s real-time browser. In the meantime, here are some of the best results:
Contrary to warnings about its imminent demise at the hands of teenagers and Americans, English is booming. In the last 50 years, its vocabulary has expanded by over 70% and around 8500 words are being added every year. The team worked this out by scanning the corpus words for solo words that turned up at least once per billion. They took random samples and culled any non-words (“l8r”), typos and foreign words. By the end, they estimated that English had 544,000 words in 1900, rising to 1,022,000 in 2000 (see above left)
Dictionaries aren’t keeping pace with this rapid change. Over half of the words added to the American Heritage Dictionary in 2000 were already part of the English language a century ago. There are plenty of missing words too. The current Oxford English Dictionary only has 615,000 solo words, and even proper nouns and compound words can’t explain the gulf between that and the million-plus count from the corpus.
Instead, it seems that modern dictionaries aren’t very good at including rarer words. Both the OED and Merriam-Webster comprehensively list words that are found once in every hundred thousand words, but they only had a quarter of one-in-a-billion words (see above right). Missing words include technical words like ’aridification’ (the process by which a geographic region becomes dry) and obscure ones like ‘slenthem’ (a musical instrument). This hidden lexicon may be rare but it’s also massive, accounting for around 52% of the English words. The majority of our vocabulary isn’t documented in the big dictionaries.
All of this started with verbs, and the new corpus allowed Michel to study their evolution on a grand scale. He found that over the last 200 years, 16% of irregular verbs have become more regular. In the past, ‘chide’ would have become ‘chid’ or ‘chode’, it now simply turns into ‘chided’. Common verbs are more resistant to change. Michel writes, “We found ‘found’ 200,000 times more often than we finded ‘finded’.” For comparison, ‘dwelt’, which is 10 times rarer, is only 60 times more common than ‘dwelled’.
Some groups of verbs are ripe for regularisation, especially those with past forms that end in –t (burn/burnt, smell/smelt, spoil/spoilt). The US has led the charge in this area and while the –t versions have more staying power in Britain, they are losing ground there too. Every year, “a population the size of Cambridge adopts ‘burned’ in lieu of ‘burnt’”.
Some verbs, however, buck the trend. In 1800, you would have ‘lit’ your candle as you ‘woke’, just as a modern person would, but in the intervening years, people who have ‘lighted’ their lamps when they ‘waked’. Meanwhile, ‘snuck’ has appropriately snuck into the language since the 1920s. Around 1% of the English-speaking world makes the switch from ‘sneaked’ to ‘snuck’ every year. Again, the US is leading the way; as Michel dryly notes, “America is the world’s leading exporter of both regular and irregular verbs.”
When the team looked at the frequency of individual years, they found a consistent pattern. In their own words: “’1951’ was rarely discussed until the years immediately preceding 1951. Its frequency soared in 1951, remained high for three years, and then underwent a rapid decay, dropping by half over the next fifteen years.” But the shape of these graphs is changing. The peak gets higher with every year and we are forgetting our past with greater speed. The half-life of ‘1880’ was 32 years, but that of ‘1973’ was a mere 10 years.
The future, however, is becoming ever more easily ingrained. The team found that new technology permeates through our culture with growing speed. By scanning the corpus for 154 inventions created between 1800-1960, from microwave ovens to electroencephalographs, they found that more recent ones took far less time to become widely discussed.
The corpus allows you to chart the rise and fall of people, as well as verbs and dates. Michel and Lieberman-Aiden found that today’s stars, at the height of their celebrity, are more famous than their historical predecessors, but they’re being forgotten more quickly. The team took every one of the 740,000 people with their own Wikipedia pages, removed those who share a name, and sorted the rest by birth date. They did the same thing with the
They found that in the early 19th century, celebrities started rising to fame at the age of 43 and it took 8 years for their prominence in books to double. By the mid-20th century, they were starting at 29 and doubling in just over 3 years. However, while the spotlight upon them is more intense, their time in it is briefer. Celebrities tend to peak in fame at the ripe old age of 75 (remember, this is measured by their mentions in books). A century ago, it took 120 years after that for their fame to halve; now, it takes just 71.
The absence of words can be just as informative as their presence – they can represent the cultural fingerprints of censorship and suppression. There’s no shortage of examples. Tiananmen Square became massively more common in English books following 1989, but the frequency of the equivalent characters in Chinese texts remained stable. The names of the Hollywood Ten – a group of alleged Communist sympathisers – were mentioned far less often in English texts after 1947.
This repression was never clearer than in Nazi Germany. The corpus carries the traces of the Nazis’ politics, as they banned and burned the works of authors and artists. Jewish artist Marc Chagall became increasingly prominent in English books since 1910 but his presence in German ones plummeted after the mid-20s. He was almost entirely absent between 1936 and 1944. He wasn’t alone.
The names of artists, writers, political academics, historians and philosophers all became increasingly rare among German texts during the Third Reich, while the names of Nazi party members became six times more common (above middle). None of this is surprising in a historical context, but in the future, the corpus could help to identify victims of censorship in a rapid way, for current or recent events.
This is but the beginning. “The paper is kind of a zoo,” says Michel, “whose exhibits are meant to show the variety of themes that one can explore and the variety of methods that prove useful.” Many of the analyses depended on a starting list (e.g. of verbs or famous people) that were equally tough to put together in the first place. Others relied on far simpler searches.
‘Men’ and ‘women’ are converging in frequency, especially since 1960. ‘The Great War’ fell after 1939 as it took on a new name: ‘World War I’. Peaks in ‘influenza’ match those of previous pandemics. ‘Steak’ has been part of English literature since before the 1800 but ‘ice cream’ only started becoming common after 1850, and ‘hamburgers’ rose after 1920. And finally, in Michel’s own words, “‘God’ is not dead; but needs a new publicist” (see the slideshow at the bottom)
‘Evolution’ became increasingly common after 1859 when Darwin published On the Origin of the Species, but it started to decline after 1910… that is, until ‘DNA’ started its meteoric rise in the 1950s. Did the double helix reignite public interest in this most important of concepts? Meanwhile, even though Galileo, Darwin and Einstein have all stayed at respectable levels, it’s Freud who has truly stormed into public recognition.
As more books are added to Google’s corpus, from different periods of history and in different languages, the value of the data will grow, ironing out any possible biases in the books included within. However, as Lieberman-Aiden says, “Books are not representative of culture as a whole, even if our corpus contained 100% of all books ever published. Only certain types of people write books and get them published, and that subclass has changed over time, with the advent of things like public literacy.” Eventually, they’ll have to digitise “newspapers, manuscripts, maps, artwork, and a myriad of other human creations.”
Mark Liberman from the University of Pennsylvania (and author of the esteemed Language Log blog) thinks that the corpus certainly opens new avenues for research, but comparisons to the Human Genome Project aren’t quite apt. He says, “The “culturome” is not a very well-defined entity. There isn’t any overall consensus about how to identify the elements that make up a “culture”, or how to inter-relate them, or how to determine whether a piece of text describes one of them.”
There’s also an issue with attitudes among people in the field. “Biologists were already convinced that genes and genomic variation were key to understanding problems in their field,” he adds. “Social scientists and humanists do not now work with large digital text collections, and relatively few of them now believe that they should do so.”
Finally, Liberman says, “Few questions come down to things that can be measured in terms of simple word frequencies,” although he notes that the authors have found some interesting ones. But he says that the real value will come from analysing how words fit into the meaning of their sentences – a far harder task. “You can do this sort of thing by combining sampling techniques with human annotation, or by developing automated “taggers. [And] the whole collection will be available for researchers to process as they please, which will allow others to attempt more difficult sorts of analysis.”
Reference: Science http://dx.doi.org/10.1126/science.1199644
A note on language: This was one of the most well written papers I’ve ever had the pleasure to read, full of wit and flair. I’ve highlighted papers from these researchers before for exactly the same reason and they haven’t disappointed this time round. It’s vexing for a science writer, really – when you could just as well edit a paper down for length rather than translating it, it makes you question the future of your profession!
More on language:
- New languages evolve in rapid bursts
- New Nicaraguan sign language shows how language affects thought
- Guerrilla reading – what former revolutionaries tell us about the neuroscience of literacy
- The evolution of the past tense – how verbs change over time
- Bacteria and languages reveal how people spread through the Pacific
- Language evolution witnessed in lab experiments
More interesting/amusing/silly searches:
- Sex and love
- Aluminum and aluminium
- Television, radio, internet and newspaper
- Pirate, ninja, robot and zombie
- Schizophrenia and hysteria
- Correlation and causation
- Faith and reason
- Various dinosaurs
- Dude and bro
- Swear words
- Utopia and apocalypse
- Tortoise and hare
- U2 and Beatles
- Awesome and cats
- ‘It all went wrong’ and ‘We can fix it’
- War and peace
- Capitalism and communism
- Global warming, climate change, acid rain, ozone layer, deforestation