The cultural genome: Google Books reveals traces of fame, censorship and changing languages

By Ed Yong | December 16, 2010 2:00 pm

Four_percent_of_all_the_booJust as petrified fossils tell us about the evolution of life on earth, the words written in books narrate the history of humanity. They words tell a story, not just through the sentences they form, but in how often they occur. Uncovering those tales isn’t easy – you’d need to convert books into a digital format so that their text can be analysed and compared. And you’d need to do that for millions of books.

Fortunately, that’s exactly what Google have been doing since 2004. Together with over 40 university libraries, the internet titan has thus far scanned over 15 million books, creating a massive electronic library that represents 12% of all the books ever published. All the while, a team from Harvard University, led by Jean-Baptiste Michel and Erez Lieberman Aiden have been analysing the flood of data.

Their first report is available today. Although it barely scratches the surface, it’s already a tantalising glimpse into the power of the Google Books corpus. It’s a record of human culture, spanning six centuries and seven languages. It shows vocabularies expanding and grammar evolving. It contains stories about our adoption of technology, our quest for fame, and our battle for equality. And it hides the traces of tragedy, including traces of political suppression, records of past plagues, and a fading connection with our own history.

As the team says, the corpus “will furnish a great cache of bones from which to reconstruct the skeleton of a new science.” There are strong parallels to the completion of the human genome. Just as that provided an invaluable resource for biologists, Google’s corpus will allow social scientists and humanities scholars to study human culture in a rigorous way. There’s a good reason that the team are calling this field “culturomics”.

The project began back in 2007, when the duo published a paper showing that verbs become more regular over time. From Beowulf to Harry Potter, the past forms of many irregular verbs have taken on the standard “-ed” suffix, in a way that fits a startlingly simple mathematical formula. On wrapping up the project, they marvelled at how hard it was to collect the data in the first place. Michel says, “We realized that the study of the evolution of culture needed something like a genome, a dataset so powerful that it would enable such analyses to be done rapidly, on any topic, not just irregular verbs. And we noticed that some of those really obscure books we used… had meanwhile popped up on Google Books. We put two and two together.”

Getting Google involved wasn’t hard. Lieberman-Aiden reminisces, “From the earliest stage, they realized that this had a lot of potential. We talked with them, the door was opened to advance the project some, we showed results, the door opened further. Eventually the door was just open.”

The team eventually worked with a third of the full corpus, selecting those books that were dated most accurately and scanned most crisply. They ended up with over 5 million books published in English, French, Spanish, German, Chinese, Russian and Hebrew, and dating back since the 1500s. Together, the texts include 500 billion words (represented in the word cloud above with the most common ones like ‘a’ and ‘the’ removed). As Michel writes:

“The corpus cannot be read by a human. If you tried to read only the entries from the year 2000 alone, at the reasonable pace of 200 words per minute, without interruptions for food or sleep, it would take eighty years. The sequence of letters is one thousand times longer than the human genome.”

Rather than expose the full texts to the public (and themselves to copyright infringement), the team Aiden have simply tracked and stored the frequency of billions or words, or sets of words, over time. The result is a “big table” that you can download and explore at www.culturomics.org. Otherwise, you can play around with Google’s real-time browser. In the meantime, here are some of the best results:

English_expands

English expands (Back to top)

Contrary to warnings about its imminent demise at the hands of teenagers and Americans, English is booming. In the last 50 years, its vocabulary has expanded by over 70% and around 8500 words are being added every year. The team worked this out by scanning the corpus words for solo words that turned up at least once per billion. They took random samples and culled any non-words (“l8r”), typos and foreign words. By the end, they estimated that English had 544,000 words in 1900, rising to 1,022,000 in 2000 (see above left)

Dictionaries aren’t keeping pace with this rapid change. Over half of the words added to the American Heritage Dictionary in 2000 were already part of the English language a century ago. There are plenty of missing words too. The current Oxford English Dictionary only has 615,000 solo words, and even proper nouns and compound words can’t explain the gulf between that and the million-plus count from the corpus.

Instead, it seems that modern dictionaries aren’t very good at including rarer words. Both the OED and Merriam-Webster comprehensively list words that are found once in every hundred thousand words, but they only had a quarter of one-in-a-billion words (see above right). Missing words include technical words like ’aridification’ (the process by which a geographic region becomes dry) and obscure ones like ‘slenthem’ (a musical instrument). This hidden lexicon may be rare but it’s also massive, accounting for around 52% of the English words. The majority of our vocabulary isn’t documented in the big dictionaries.

Verbs

Grammar evolves (Back to top)

All of this started with verbs, and the new corpus allowed Michel to study their evolution on a grand scale. He found that over the last 200 years, 16% of irregular verbs have become more regular. In the past, ‘chide’ would have become ‘chid’ or ‘chode’, it now simply turns into ‘chided’. Common verbs are more resistant to change. Michel writes, “We found ‘found’ 200,000 times more often than we finded ‘finded’.” For comparison, ‘dwelt’, which is 10 times rarer, is only 60 times more common than ‘dwelled’.

Some groups of verbs are ripe for regularisation, especially those with past forms that end in –t (burn/burnt, smell/smelt, spoil/spoilt). The US has led the charge in this area and while the –t versions have more staying power in Britain, they are losing ground there too. Every year, “a population the size of Cambridge adopts ‘burned’ in lieu of ‘burnt’”.

Some verbs, however, buck the trend. In 1800, you would have ‘lit’ your candle as you ‘woke’, just as a modern person would, but in the intervening years, people who have ‘lighted’ their lamps when they ‘waked’. Meanwhile, ‘snuck’ has appropriately snuck into the language since the 1920s. Around 1% of the English-speaking world makes the switch from ‘sneaked’ to ‘snuck’ every year. Again, the US is leading the way; as Michel dryly notes, “America is the world’s leading exporter of both regular and irregular verbs.”

Years_fame

Farewell history, we hardly knew ye (Back to top)

When the team looked at the frequency of individual years, they found a consistent pattern. In their own words: “’1951’ was rarely discussed until the years immediately preceding 1951. Its frequency soared in 1951, remained high for three years, and then underwent a rapid decay, dropping by half over the next fifteen years.” But the shape of these graphs is changing. The peak gets higher with every year and we are forgetting our past with greater speed. The half-life of ‘1880’ was 32 years, but that of ‘1973’ was a mere 10 years.

The future, however, is becoming ever more easily ingrained. The team found that new technology permeates through our culture with growing speed. By scanning the corpus for 154 inventions created between 1800-1960, from microwave ovens to electroencephalographs, they found that more recent ones took far less time to become widely discussed.

Fame

15 seconds of fame, 14 seconds, 13… (Back to top)

The corpus allows you to chart the rise and fall of people, as well as verbs and dates. Michel and Lieberman-Aiden found that today’s stars, at the height of their celebrity, are more famous than their historical predecessors, but they’re being forgotten more quickly. The team took every one of the 740,000 people with their own Wikipedia pages, removed those who share a name, and sorted the rest by birth date. They did the same thing with the

They found that in the early 19th century, celebrities started rising to fame at the age of 43 and it took 8 years for their prominence in books to double. By the mid-20th century, they were starting at 29 and doubling in just over 3 years. However, while the spotlight upon them is more intense, their time in it is briefer. Celebrities tend to peak in fame at the ripe old age of 75 (remember, this is measured by their mentions in books). A century ago, it took 120 years after that for their fame to halve; now, it takes just 71.

Repression

Help, help, I’m being suppressed… (Back to top)

The absence of words can be just as informative as their presence – they can represent the cultural fingerprints of censorship and suppression. There’s no shortage of examples. Tiananmen Square became massively more common in English books following 1989, but the frequency of the equivalent characters in Chinese texts remained stable. The names of the Hollywood Ten – a group of alleged Communist sympathisers –  were mentioned far less often in English texts after 1947.

This repression was never clearer than in Nazi Germany. The corpus carries the traces of the Nazis’ politics, as they banned and burned the works of authors and artists. Jewish artist Marc Chagall became increasingly prominent in English books since 1910 but his presence in German ones plummeted after the mid-20s. He was almost entirely absent between 1936 and 1944. He wasn’t alone.

The names of artists, writers, political academics, historians and philosophers all became increasingly rare among German texts during the Third Reich, while the names of Nazi party members became six times more common (above middle). None of this is surprising in a historical context, but in the future, the corpus could help to identify victims of censorship in a rapid way, for current or recent events.

What next? (Back to top)

This is but the beginning. “The paper is kind of a zoo,” says Michel, “whose exhibits are meant to show the variety of themes that one can explore and the variety of methods that prove useful.” Many of the analyses depended on a starting list (e.g. of verbs or famous people) that were equally tough to put together in the first place. Others relied on far simpler searches.

‘Men’ and ‘women’ are converging in frequency, especially since 1960. ‘The Great War’ fell after 1939 as it took on a new name: ‘World War I’. Peaks in ‘influenza’ match those of previous pandemics. ‘Steak’ has been part of English literature since before the 1800 but ‘ice cream’ only started becoming common after 1850, and ‘hamburgers’ rose after 1920. And finally, in Michel’s own words, “‘God’ is not dead; but needs a new publicist” (see the slideshow at the bottom)

‘Evolution’ became increasingly common after 1859 when Darwin published On the Origin of the Species, but it started to decline after 1910… that is, until ‘DNA’ started its meteoric rise in the 1950s. Did the double helix reignite public interest in this most important of concepts? Meanwhile, even though Galileo, Darwin and Einstein have all stayed at respectable levels, it’s Freud who has truly stormed into public recognition.

As more books are added to Google’s corpus, from different periods of history and in different languages, the value of the data will grow, ironing out any possible biases in the books included within. However, as Lieberman-Aiden says, “Books are not representative of culture as a whole, even if our corpus contained 100% of all books ever published. Only certain types of people write books and get them published, and that subclass has changed over time, with the advent of things like public literacy.” Eventually, they’ll have to digitise “newspapers, manuscripts, maps, artwork, and a myriad of other human creations.”

Mark Liberman from the University of Pennsylvania (and author of the esteemed Language Log blog) thinks that the corpus certainly opens new avenues for research, but comparisons to the Human Genome Project aren’t quite apt. He says, “The “culturome” is not a very well-defined entity. There isn’t any overall consensus about how to identify the elements that make up a “culture”, or how to inter-relate them, or how to determine whether a piece of text describes one of them.”

There’s also an issue with attitudes among people in the field. “Biologists were already convinced that genes and genomic variation were key to understanding problems in their field,” he adds. “Social scientists and humanists do not now work with large digital text collections, and relatively few of them now believe that they should do so.”

Finally, Liberman says, “Few questions come down to things that can be measured in terms of simple word frequencies,” although he notes that the authors have found some interesting ones. But he says that the real value will come from analysing how words fit into the meaning of their sentences – a far harder task. “You can do this sort of thing by combining sampling techniques with human annotation, or by developing automated “taggers. [And] the whole collection will be available for researchers to process as they please, which will allow others to attempt more difficult sorts of analysis.”

god
evolution
flu
food
men_women
scientists
slavery
wars

Reference: Science http://dx.doi.org/10.1126/science.1199644

A note on language: This was one of the most well written papers I’ve ever had the pleasure to read, full of wit and flair. I’ve highlighted papers from these researchers before for exactly the same reason and they haven’t disappointed this time round. It’s vexing for a science writer, really – when you could just as well edit a paper down for length rather than translating it, it makes you question the future of your profession!

More on language:

More interesting/amusing/silly searches:

Twitter.jpg Facebook.jpg Feed.jpg Book.jpg

Comments (35)

  1. By the way, I’ll try and collate examples of some of the more interesting searches that people are doing. For example, I really wanted to see this

  2. Bas

    In the Tiananmen graph, you see a big surge in the Chinese use of Tiananmen after 1976, the year of the April 5th Tiananmen Incident, with apparently got very little western press. With the 1989 Tiananmen square protests, exactly the opposite happened.

  3. pconroy

    Ed,

    What do you make of the small spike in “Freud” in about 1825 – before the time of Freud surely??

  4. Matt B.

    I would like to see the patterns of words being paired with other (redundant) words. Examples: refer back, permeate through, overall consensus, ATM machine, where [is it] at, left-hand turn, etc.

    Then there are some redundancies that arise as retronyms, like “conventional oven”, and, in science fiction, “ground car”.

  5. I’m slowly adding a collection of some of the more interesting/amusing/silly searches to the bottom of the post, which I and others have tried on Twitter. Remember, you can run your own searches now. Some tips: for more recent trends, it pays to either restrict searches to 2000, or to set the smoothing function to 0 or 1.

  6. Matt B.

    They should also compare the regularization of noun forms to those of verbs (antennas vs. antennae, for example).

  7. DES

    Looks like you got the DOI wrong.

  8. Great post!… will have to re-read several times to take it all in… don’t know how you found the time to bring this all together so quickly and so well!

  9. A short FAQ given some of the stuff I’ve seen on Twitter:

    1) Why is there a weird blip around 1910 if I search for internet/gene/etc.?

    Some of the books probably have the wrong date attached to them. Lieberman-Aiden told me that getting the metadata right was one of the hardest challenges. He says, “Many books are misdated, especially back then, and the collection of books is too large to correct this by hand. So we needed to write algorithms to “weed out” books with dubious dates.” There are probably still some errors though.

    2) Why does the trend climb/dip wildly in the last 5 years or so?

    There are probably some sampling biases for more recent books, in terms of which ones have been digitised – bear in mind that the initiative started in 2004. The fact that so many trends show this pattern suggest something like this.

    Also the trends are presented as moving averages of 3 years by default. So the value for 2000 is the average of the values for 1999, 2000 and 2001. You can manipulate this on the ngram browser to set the value to 0. This makes the trends more precise. To see what I mean, try searching for “terrorism” with different smoothing values and see what happens to the curve around 2001.

  10. sam gerrits

    try fight,talk,sit,sleep,eat since 1700. interesting….

  11. pconroy

    This one is interesting:
    http://ngrams.googlelabs.com/graph?content=AOL,Online,Gopher,FTP,Google&year_start=1990&year_end=2008&corpus=5&smoothing=0

    I compare AOL, Online, Gopher, FTP, Google and see that Gopher peaked in 1995, while AOL and FTP in 2003, and s sharp takeoff of Google in 2002

  12. Cathy White

    This data has to be viewed in its proper context. We’re talking about published words, which is just a slice of the entire picture of the language. So, for example, when we’re talking about a word explosion–the addition of 8500 words a year–remember that we’re seeing 8500 new words appearing IN PRINT a year. It doesn’t mean we’re inventing new words at that rate. Could it not be that what we’re seeing is a cultural leveling effect in what is published? In 1900, the published word was quite narrow, as were the audiences. Today we have a wide scope of published works for consumption across social strata: academic works, reference, popular fiction, how to, self-help, graphic novels, a booming children’s market, and of course People Magazine.

  13. Charles Munoz

    3. pconroy Says:
    December 16th, 2010 at 5:18 pm
    Ed,
    What do you make of the small spike in “Freud” in about 1825 – before the time of Freud surely??

    Beethoven’s Ode to Joy was composed in 1823. Is not “Freude” part of its name in German. But was it famous enough for even a small spike so soon?

    Charlezzzzz

  14. Charles Munoz

    Beethoven’s Ode to Joy was composed in 1823. Is not “Freude” part of its name in German? But was it famous enough for even a small spike so soon?

  15. gawp

    There are similar patterns in the medical literature (titles and abstracts), particularly with dates.
    http://www.ogic.ca/mltrends/
    Interestingly, women are uniformly mentioned more than men.

  16. MT-LA

    Ed, this post is made of win! I’m going to have some fun with this. I can’t wait until they add newspapers to the corpus.

    You are fastly becoming one of my favorite bloggers. Keep up the great work.

    (PS…google spell check doesn’t recognize “bloggers” but it does recognize “blogger”?!)

  17. oldbilbo

    That’s fascinating – and was pointed my way by an acquaintance in Science Marketing with Cambridge University.

    I’d thought to subscribe – but couldn’t. I live in England – the natural and spiritual home of English. So do I take it that English is now only for the ‘Merkins? That’s ‘cultural colonialism’…..

  18. E-FLA

    Try “all your base”

  19. MT-LA

    Try “John, Paul, George, Ringo”

    Poor Ringo…never gets any love

    (sorry…don’t know how to post URL’s)

  20. Io

    The u2/beatles one is clearly wrong (what, did nobody mention The Beatles during Beatlemania?), and I realized this is because of case sensitivity. See this for the correct (and vastly different) results.

  21. JLT

    I found this interesting;
    http://ngrams.googlelabs.com/graph?content=vaccination%2Csmall+pox%2Cpolio%2Cmeasles&year_start=1800&year_end=2002&corpus=0&smoothing=3
    It compares vaccination, small pox, measles, and polio. From it you can guess when outbreaks of e.g. measles or polio happened.

  22. Steve

    Interesting that “evolution” increased after publication of On the Origin of Species, because that work never uses the word (search the text in Project Gutenberg). It was freighted with connotations Darwin wanted to avoid, even if he did use words like “evolve”.

  23. Hi, I’m Melanie, PhD Biochemistry, always wanted to be a science writer, but never did well with deadlines. More power to you! I found your blog in 2008 when I left my post doc to start working in science ed/learning technologies, and I was really happy to see that explaining science to non scientists was being done so well. And frankly very happy that I could keep up with real science through your blog without slogging through the data myself. Your post on ballerinas’ movements becoming more extreme was the first one I remember–and after 10 years of fruit flies and TB, it was really refreshing!

    I am working on Immune Attack 2.0. Our hypothesis is that 9th graders and the public can learn complex molecular biology by playing a video game. It’s free to download immuneattack.org. And a lot of great science games are at mygameiq.com.

    I am also concerned, as you are, that actually doing science is such a hard row to hoe… like the tweet you just sent out about the lead author of the breakthrough of the year paper going into finance instead of physics. I’ve been telling all the science ed people who will listen: It may be true that 9th graders steer clear of science because they think its “hard”, but also because it doesn’t make money. Sure, some scientists get rich, but generally, scientists are 32 years old, post doc-ing for $40,000 and looking at a 1 in 9 chance of becoming a professor. Is there any doubt why our high school students don’t go into science? And, influence in our society requires money. How will scientists influence society if we are in labs 12 hours/day, broke and are less likely to be invited to be on boards of trustees, or to be able to afford to run for public office… etc. This is why your blog and other science blogs are so important: you influence people’s thinking about science.

    I think the only way you could improve this blog is by actually showing up in my office with coffee while I read it! Thank you for giving science a good name. I also love that you’ve included a dedication to your wife.
    Thank you!

    Melanie

  24. Aleksandar Kuktin

    @ sam (#10): try adding “fuck” and “sex” to the mix. :)

    I wonder what happened from 1800 to 1810? Apart from better sampling on the Ngram’s side, especially for American English.

  25. Aleksandar Kuktin
  26. What happened was that f started looking like s. http://goo.gl/mEb4m ;-)

  27. Aleksandar Kuktin

    WOW!

    You blink for a second, and off goes your non-bias.

    I wonder if they have a list of all the books that made it into the dataset somewhere? They should, it’s not too useful otherwise. :/

    But you can’t say there aren’t various fun anomalies. :)

    http://ngrams.googlelabs.com/graph?content=computer&year_start=1610&year_end=1620&corpus=0&smoothing=0

  28. This new tool can be used by any company, group or agency. Some discussion from a Navy point of view on the Navy Reads blog about the Navy Professional Reading Program and related books: Navy Reads. “Pacific Ocean, Atlantic Ocean” reveals dominance shift after Battle of 1812; Navy Core Values: “Honor, Courage, Commitment”; etc. Historians, researchers, planners and communicators rejoice… despite the anomalies.

  29. I’ve been wondering about the post-2000 trends as well. After doing a *lot* of ngram searching, I’ve found quite a few which have unusual rising trends, as well as those that have unusual falling trends. While sampling bias is still likely an issue, I’m not wholly convinced it’s not cultural. Let’s not forget that the entire world order rearranged itself in the 1990s :-)

    Anyway, for your viewing pleasure, here are some nifty visualizations of the ngram peak year & frequency of a number of “Good Things”. I’m dubbing them ‘memeograms’. Enjoy!

    http://brainoids.wordpress.com/2010/12/29/introducing-the-memeogram/

  30. Great post, and a much better overview of the research than that written by the NYT. I’m having a lot of fun checking out some of the ones you’ve listed, as well as formulating my own. It’s actually becoming somewhat addictive, so I’ve started a side blog to collect a few.

    Your interpretation of what this project signifies in terms of the story of our culture and collective imagination is right on.

    Thanks :)
    http://www.theexaminingroom.com/2010/12/google-ngrams-are-pithy/
    http://pithyngrams.blogspot.com

  31. Great article, and like drcharles mentions, much more representative of the interesting research than other articles I’ve read.

    At the risk of throwing any possibility of self effacement to the wind….I wrote a thought piece on Ngram Viewer, how some data can be interestingly manipulated, and suggestion of what the tool could mean for the scientific method.

    I’d be interested to hear people’s thoughts! :)

    http://wp.me/p155km-44

NEW ON DISCOVER
OPEN
CITIZEN SCIENCE
ADVERTISEMENT

Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!

Not Exactly Rocket Science

Dive into the awe-inspiring, beautiful and quirky world of science news with award-winning writer Ed Yong. No previous experience required.
ADVERTISEMENT

See More

ADVERTISEMENT
Collapse bottom bar
+

Login to your Account

X
E-mail address:
Password:
Remember me
Forgot your password?
No problem. Click here to have it e-mailed to you.

Not Registered Yet?

Register now for FREE. Registration only takes a few minutes to complete. Register now »