DISCOVER Magazine. Science, Technology and The Future
Current Issue
Subscribe Today »
  • Renew
  • Give a Gift
  • Archives
  • Customer Service
  • Facebook
  • Twitter
  • Newsletter
  • Health & Medicine
  • Mind & Brain
  • Technology
  • Space
  • Human Origins
  • Living World
  • Environment
  • Physics & Math
  • Video
  • Photos
  • Podcast
  • RSS
Not Exactly Rocket Science
« Blind Olympic athletes show the universal nature of pride and shame
The FOXP2 story in New Scientist »

Using our powers for good – how web security software can help to transcribe old books

Blogging on Peer-Reviewed ResearchWhat would you do if someone asked you to help transcribe an old book onto a website? Chances are, you’d say no on the basis that you have other things to do, or simply that it just doesn’t sound very interesting. And yet, millions of people every day are helping with precisely this task, and most are completely unaware that they’re helping out.

Recaptcha.jpgIt’s all thanks to a computer program developing by Luis von Ahn and colleagues at Carnegie Mellon University. Their goal was to slightly alter a simple task that all web users encounter and convert it from wasted time into something productive. That task – and you will all have done this before – is to look at an image of a distorted word and type what it is in a box. It often turns up when you’re trying to post on a blog or sign up for an account.

The distorted word is called a CAPTCHA and, playing fast and loose with the spirit of acronyms, it stands for “Completely Automated Public Turing test to tell Computers and Humans Apart”. Their point is to make users prove that they are human, because modern computer programs cannot discern the distorted letters as well as humans can. The CAPTCHAs are visual sentinels that protect  against automated programs that would otherwise overbuy tickets for sale at inflated prices, set up millions of fake email accounts for spamming or inundate polls, forums and blogs with comments.

They have become so commonplace that von Ahn estimates that people type in over 100 million CAPTCHAs every day. And even though the goals of improving web security is a worthwhile one, these efforts add up to hundreds of thousands of hours that are effectively wasted on a daily basis. Now, von Ahn’s team have found a way of tapping this effort and putting it to better use – to help decipher scanned words, and usher old printed books into the digital age.

Reverse-Turing tests

As von Ahn writes, the goal of these projects is to “preserve human knowledge and to make information more accessible to the world.” Digitising books makes them simpler to search and store, but doing so is easier said than done. Books can be scanned and their words decoded by “optical recognition software” but these programmes are still far from perfect. And any weaknesses they have are exacerbated by the faded ink and yellowing paper of the very texts they are most interested in preserving.

Unreadabletex.jpg

So recognition software is automated but only about 80% accurate. Humans are far more accurate; if two fleshy scribes work independently and check any discrepancies in their transcripts, they can achieve an accuracy of over 99%. We, however, are far from automated and usually quite expensive to hire.

The new system, aptly named reCAPTCHA, combines the best of both worlds by asking people to decipher words that software cannot, while solving CAPTCHAs. Instead of random words or characters, it creates CAPTCHAs using words from scanned texts than recognition software has struggled to read.

Two different recognition programmes scour the texts in question and when if their readings differ, words are classified as “suspicious”. These are placed alongside a “control” word that is already known. The pair is distorted even further, and used to make a CAPTCHA. The user has to solve both words to prove their humanity – if they get the control word right, the system assumes that they are genuine and gains a bit of confidence that their guess for the suspicious word is also right.

Every suspicious word is sent to multiple users and if the first three people to see it all provide the same guess, it shunts over to the pool of control words. If the humans disagree, a voting system kicks in and the most popular answer is taken as the right one. Users have an option to discard the word if it’s illegible, and if this happens six times without any guesses being made, the word is marked as “unreadable” and discarded.

At first, von Ahn’s team tested the reCAPTCHA system using 50 scanned articles from the New York Times archive taken as far back as 1860 and totalling just over 24,000 words. The reCAPTCHA system achieved an excellent accuracy of 99.1%, getting only 216 words wrong and far outstripping the meagre 83.5% rate managed by standard recognition software.

Human transcription services guarantee an accuracy of 99% or better, so reCAPTCHA certainly lives up to that exacting standard. Indeed, when humans were asked to do the same task, they made 189 errors, just 27 fewer than the programme. The neck-and-neck nature of the two scores is all the more impressive because unlike a human reader, reCAPTCHA cannot make use of context to decode a word’s identity.

Virtual security

That’s all well and good, but are there selfish reasons for a website to use reCAPTCHA, if its goal of preserving its own security (quite understandably) outweighs any interest in text conservation? Certainly, according to the researchers. Because the new system only uses words that are unrecognisable to current optical character recognition software, it’s actually more secure than current CAPTCHAs are.

Conventional CAPTCHAs use a small number of predictable rules to distort a set of characters and various groups have developed learning programmes that can them with over 90% accuracy. But the same techniques always fail to solve reCAPTCHAs because on top of the usual twists, this system has two extra levels of ‘encryption’ – the random fading of the underlying text and ‘noisy’ distortion caused by the scanning process. There’s a certain irony in making something state-of-the-art out of the old and the inaccurate.

It’s an interesting advance – von Ahn was in fact the person responsible for developing CAPTCHAs in their current form, so it’s perhaps unsurprising that his team have developed the next escalation of this technology.

Some might suggest that CAPTCHAs are a bit annoying anyway, so having to fill two out would seem like too onerous a task for today’s short attention spans. Not so – most CAPTCHAs are strings of random characters and these take just as long to solve as two actual English words.

Recycling effort

These guarantees, along with the prospect of doing something worthy, has already turned reCAPTCHA into a bit of an online hit. It’s being used by over 40,000 websites and it’s already making an impact. In its first year, web users solved over 1.2 billion reCAPTCHAs and deciphered over 440 million words – the equivalent of 17,600 books. At the moment, the programme is deciphering over 4 million suspicious words (about 160 books) every day. For human scribes to do the same task in the same time-frame, you’d need a workforce of over 1,500 people working 40-hour weeks.

It’s a fantastic idea – turning web users into unwitting satellite processors, and making constructive use of a necessary but ultimately unproductive activity. This ethos, of treating human processing power as a resource that can be conserved as electricity or gas should be, underlies a lot of the team’s other work. They have developed online games that can analyse photos and audio recordings, and their work has inspired another group to create Fold It, a game in which people compete to work out the ideal structure of a protein.

Even pictures of cats can be put to good use. A Microsoft programme called ASIRRA uses images of cats and dogs as CAPTCHAs. Users have to select all the images of one of the other, but the twist is that all the photos come from animal shelters and users who take a liking to one of the animals can adopt it.

Now if only someone could harness the countless hours of effort wasted on trolling or posting comments on YouTube, we’d all be laughing.

Reference: Science doi: 10.1126/science.1160379

Share

August 14th, 2008 by Ed Yong in Computer science | 5 comments | RSS feed | Trackback >

5 Responses to “Using our powers for good – how web security software can help to transcribe old books”

  1. 1.   EricJuve Says:
    August 14th, 2008 at 5:17 pm

    Thank you for this post. I did my first two words today, I look forward to many more.

  2. 2.   jj Says:
    August 14th, 2008 at 5:47 pm

    Nice post, I’d heard a few months ago that CAPTCHA had essentially been beaten by some program. I never trust OCR, we use it in my office to scan to word docs (works mostly) and excel (never gets the correct formatting, and I mean never). I always tell my employees, if they want a document digital and don’t need to edit it – scan it to a PDF, no need for OCR.

  3. 3.   R N B Says:
    August 15th, 2008 at 6:31 am

    So far all these uses are good, perfect examples of the power of collaboration. I thoroughly agree. If I may slightly misquote Tom Paine, it reflects all that is best about society, working together to achieve what is impossible individually. But there is a slight suspicion that sometime somewhere a powerful individual in charge of a large collaborative network will secretly use these techniques, and before we know it, we have all together helped to build a new deadly virus?

  4. 4.   speedwell Says:
    August 15th, 2008 at 7:45 am

    But there is a slight suspicion that sometime somewhere a powerful individual in charge of a large collaborative network will secretly use these techniques, and before we know it, we have all together helped to build a new deadly virus?
    The most important weakness of deadly viruses is that they kill the host (computer). To be sure a virus in development is deadly, you have to watch it kill hosts (testers’ computers). Those hosts are then presumably unavailable for future tests.
    A major strength of distributed computing is that viruses will get caught early by the people most qualified to recognize them, power users. Most power users have access to more than one computer. People would compare notes and realize what was going on. This assumes, by the way, that existing anti-virus programs don’t already flag the program as malicious based on its behavior.
    If the virus was particularly complex, then it is tempting to think that different users in the distribution would test small parts of the program in isolation. That may be possible. But at some point you have to put the pieces together to test larger subprocesses, and any that cause harm to users’ computers would undoubtedly be detected long before the full virus could be constructed.
    In order to effectively fool users and bypass existing defense mechanisms, you’d need an innovative virus almost as complicated as the one you envision producing in the distributed environment.
    In other words, good scary sci-fi, but not plausible.

  5. 5.   melior Says:
    August 17th, 2008 at 6:11 am

    It shouldn’t take long until bad guys are using a variation of this system to defeat CAPTCHA.
    Example: An owner of a porn site requires users to enter solve a CAPTCHA each time they download a file. The words to be solved are actually redirected in realtime from a queue being used by an automated program trying to get around a CAPTCHA protecting it from that access to another site.

Leave a Reply





    • About Not Exactly Rocket Science



      Ed Yong is an award-winning British science writer. His work has appeared in New Scientist, the Times, WIRED, the Guardian, Nature and more. Not Exactly Rocket Science is his attempt to talk about the awe-inspiring, beautiful and quirky world of science to as many people as possible.

      My personal website with biography, other writing, speaking engagements, and more

      Some interviews with me
      Some awards that I’ve won
      Who my readers are: 2008, 2009 and 2010 editions
      A complete list of posts from this blog

      Follow me on Twitter or Google+

      Contact me on edyong209[at]googlemail[dot]com

    • Support science writers


      Every month, I choose ten excellent blog posts and donate £3 to their authors. If you want to join me in supporting great science writing, use the first button. Any donations in June will be split evenly between these ten writers.

      If you would like to support this blog in particular, use the second button. For anything you donate, I will match a third and donate it to the month's chosen writers.

    • What others say

      "One of the best sites for in-depth analysis of interesting scientific papers" - The Times

      "One of the smartest science bloggers I read... a prime practitioner among the new generation of scientifically authoritative bloggers" - David Rowan, editor of Wired UK

      "Engaging and jargon-free multimedia storytelling about science and the digital age" - National Academy of Sciences

      "A consistently illuminating home for long, thoughtful, and thorough explorations of science news" - National Association of Science Writers

      "Head and shoulders above many broadsheet hacks" - Ben Goldacre

      "Ed Yong... is made of pure unobtanium and rides TWO Toruks." - Frank Swain

      "Ed Yong is better than chocolate, fairy lights, and kittens chasing yarn. That is all." - Christine Ottery

    • Do you want to be a science writer?

      Read origin stories and advice from over 130 science writers from around the world.
    • Not Exactly Rocket Science content

      RSS Recent Posts

      Recent Posts

      • In a scalding spring, one species of microbe is becoming two
      • Will we ever…? My new column for the BBC
      • Huge set of fossil tracks preserves march of the ancient elephants
      • Flowers regenerated from 30,000-year-old frozen fruits, buried by ancient squirrels
      • Flies drink alcohol to medicate themselves against wasp infections
      • The blue whale – how I met the largest animal that has ever existed
      • I’ve got your missing links right here (18 February 2012)
      • My Sri Lankan adventure – a species list
      Categories

      Categories

      Archives

      Archives

      • February 2012
      • January 2012
      • December 2011
      • November 2011
      • October 2011
      • September 2011
      • August 2011
      • July 2011
      • June 2011
      • May 2011
      • April 2011
      • March 2011
      • February 2011
      • January 2011
      • December 2010
      • November 2010
      • October 2010
      • September 2010
      • August 2010
      • July 2010
      • June 2010
      • May 2010
      • April 2010
      • March 2010
      • February 2010
      • January 2010
      • December 2009
      • November 2009
      • October 2009
      • September 2009
      • August 2009
      • July 2009
      • June 2009
      • May 2009
      • April 2009
      • March 2009
      • February 2009
      • January 2009
      • December 2008
      • November 2008
      • October 2008
      • September 2008
      • August 2008
      • July 2008
      • June 2008
      • May 2008
      • April 2008
      • March 2008
      • February 2008
    • RSS Twitter

    • My wife, who makes it all possible

      Alice.jpg
    • Blogroll

      Science blogs

      Science blogs

      • 80 Beats
      • A Blog Around the Clock
      • Adventures in Ethics and Science
      • Aetiology
      • Alice Bell
      • Ars Technica
      • Arthropoda
      • Atlantic Science
      • Babel's Dawn
      • Bad Astronomy
      • Bad Science
      • BPS Research Digest Blog
      • Cancer Research UK Science Update Blog
      • Child's Play
      • Cocktail Party Physics
      • Collision Detection
      • Culture Dish
      • Culturing Science
      • Deep Sea News
      • Discoblog + NCBI ROFL
      • Dot Earth
      • Dr Petra Boynton
      • Drugmonkey
      • EarthLab
      • Embargo Watch
      • Epiphenom
      • Evolving Thoughts
      • Finite Attention Span
      • Fistful of Science
      • Gary Schwitzer's HealthNewsReview
      • Gene Expression
      • Genetic Future
      • Genomeboy
      • Genomicron
      • Gimpy's Blog
      • Highly Allochthonous
      • Ionian Enchantment
      • JL Vernon Presents American Psico
      • Joanne Loves Science
      • John Pavlus
      • Just a Theory
      • Lab Rat
      • Laelaps
      • Last Word on Nothing
      • Lay Scientist
      • Loom
      • Mark Changizi
      • Mind Hacks
      • Myrmecos
      • Neuroanthropology
      • Neurologica
      • Neuron Culture
      • Neurophilosophy
      • Neurotic Physiology (SciCurious)
      • Neurotribes
      • Obesity Panacea
      • Observations of a Nerd
      • On Becoming a Domestic and Laboratory Goddess
      • Open Minds and Parachutes
      • Political Science (Evan Harris)
      • Predictably Irrational
      • Retraction Watch
      • Save Your Breath for Running Ponies
      • Schooner of Science
      • Science Punk
      • ScienceLine
      • ScienceLush
      • Sentence First
      • Sex, Drugs and Rockin' Venom – Confessions of an Extreme Scientist
      • Skepchick
      • Speakeasy Science
      • Superbug
      • Take as Directed
      • Terra Sigillata
      • Tetrapod Zoology
      • The Artful Amoeba
      • The Chicken or the Egg
      • The Examining Room of Dr Charles
      • The Flying Trilobite
      • The Frontal Cortex
      • The Gleaming Retort
      • The Great Beyond
      • The Intersection
      • The Inverse Square Blog
      • The Millikan Daily
      • The Primate Diaries
      • The Science Project
      • Thoughtomics
      • Thus Spake Zuska
      • TYWKIWDBI
      • Vagina Dentata
      • Voyages Around my Camera
      • Weird Bug Lady
      • White Coat Underground
      • Why Evolution is True
      • Wild Muse
      • Wired Science
      • Words of Science
      • XKCD
      • Zooillogix
      Other blogs

      Other blogs

      • Cafe Philos
      • Miss Cellania
    • NetworkedBlogs
      Blog:
      Not Exactly Rocket Science
      Topics:
      science, biology, news
       
      Follow my blog


  • Kalmbach Publishing Co.

    Copyright © 2012, Kalmbach Publishing Co.

    Privacy - Terms - Reader Services - Subscribe Today - Advertise - About Us