DISCOVER Magazine. Science, Technology and The Future
Current Issue
Subscribe Today »
  • Renew
  • Give a Gift
  • Archives
  • Customer Service
  • Facebook
  • Twitter
  • Newsletter
  • Health & Medicine
  • Mind & Brain
  • Technology
  • Space
  • Human Origins
  • Living World
  • Environment
  • Physics & Math
  • Video
  • Photos
  • Podcast
  • RSS
80beats
« Stone-Age Graveyard in the Sahara Recalls an Era of Lakes and Wetlands
Arsenic-Eating Bacteria May Resemble Early Life on Primordial Earth »

Computers Exploit Human Brainpower to Decipher Faded Texts


text reCAPTCHAIn a neat example of Internet-enabled “crowdsourcing,” the method of distributing a large task to many contributors, researchers are using an anti-spam program to get people to decipher damaged or faded texts, one word at a time. Chances are that if you’ve solved one of those distorted-word tests to secure an account with Facebook, Craigslist, or Ticketmaster, you’ve helped The New York Times inch a little closer to digitizing its entire print newspaper archive from 1851 to 1980 [CNET].

The program, known as reCAPTCHA, is widely used to ensure that humans, rather than spam bots, are commenting on blogs (including some of DISCOVER’s) and signing up for free email accounts. “More web sites are adopting reCAPTCHAs each day, so the rate of transcription keeps growing,” said [lead researcher Luis] von Ahn. “More than 4 million words are being transcribed every day. It would take more than 1,500 people working 40 hours a week at a rate of 60 words a minute to match our weekly output” [Telegraph]. The service is available for free to any site.

Ahn’s lab uses two different optical character recognition (OCR) software programs to scan an old book or newspaper article and convert it into a digital, searchable file. But when the programs disagree on the reading of a word, that word is added to the reCAPTCHA database, and used as part of an anti-spam puzzle. According to a report published in the journal Science [subscription required], humans decipher such words with 99 percent accuracy.

In 2000, von Ahn helped invent the first “CAPTCHA,” which stands for “Completely Automated Public Turing test to tell Computers and Humans Apart,” with a nod to the early computer scientist Alan Turing. The new reCAPTCHA cleverly slips a useful task into what has already become a mundane Internet activity. Says Ahn: “We are demonstrating that we can take human effort — human processing power — that would otherwise be wasted and redirect it to accomplish tasks that computers cannot yet solve” [Wired News].

Last year DISCOVER saw how humans could act as artificial artificial intelligence at the Amazon Mechanical Turk, another fine example of crowdsourcing.

Image: Science/AAAS

Share

August 14th, 2008 6:58 PM Tags: computers, crowdsourcing, spam
by Eliza Strickland in Technology | 28 comments | RSS feed | Trackback >

28 Responses to “Computers Exploit Human Brainpower to Decipher Faded Texts”

  1. 1.   Jeremiah Says:
    August 15th, 2008 at 7:36 am

    Um… shouldn’t that be “This aged portion of society was”? Haha.

  2. 2.   john powell Says:
    August 16th, 2008 at 11:33 am

    A Mental Blockage

    In the current is often found
    Unknown particles of sky and ground.
    Oft they appear as phantasms or as dreams
    Or oft illusions of what is or only seems.

    Nonetheless they do appear as real or imagined fear
    Or as unknowns, unnaturals, torments to eye and ear.
    Look what the fresh new breeze doth bring–
    With its mysterious voice, it doth sing.

    Soft on the air with voice or visual treat,
    It lays its bearing or bounty at your feet.
    Now it is yours, this new thought;
    By this new wind, it is brought.

    Up from the abyss or down from heaven,
    In a current, air now is given.
    It’s oft a creature of what we ingest
    That gives unto us this worst or best.

    Oh, the hazards of seeing or hearing
    That soon become our reasons for fearing!
    The things accepted without investigation
    Causes the brain its mental constipation.

    120205

  3. 3.   Sir Mildred Pierce Says:
    August 17th, 2008 at 5:06 pm

    “Um? shouldn?t that be ?This aged portion of society was?? Haha.”

    Common mistake. “society” is a plurality, and as such is treated as such in the grammar. Another good example is one might say “Queen is Freddy Mercury, Brian May…” but the proper way to say it would be “Queen are Freddy Mercury, Brian May…” etc.. the brain thinks otherwise because the previous word doesn’t end in “s”, but nevertheless it’s a plurality and thus, treated as such.

  4. 4.   Sir Mildred Pierce Says:
    August 17th, 2008 at 5:10 pm

    Or rather “This aged portion of society” as a whole is a plurality, not just “society”…

    I would like to see the famous “Roswell Memo given the treatment, as it seems previously only those biased to the answer that the memo really does talk about aliens and discs are teh only ones interpreting it.

  5. 5.   Duck Says:
    August 17th, 2008 at 5:13 pm

    Hm, how then does the system verify if the typed-in word is correct? Wouldn’t someone have to physically write out the correct answer so the CAPTCHA would know later on if someone entered the correct word, or something else. I could just write ‘poop’ and it wouldn’t catch it.

  6. 6.   Ash Says:
    August 17th, 2008 at 5:19 pm

    I’m all for typing inane responses to articles if it means the furthering of literacy.
    Imagine if Youtube incorporated it.

  7. 7.   @MildredPierce Says:
    August 17th, 2008 at 5:22 pm

    Actually, that depends on whether your speaking British English or American English. In British English, collective nouns are treated as plural, “The class were…”, “The team were…”, “U2 are…”, but in American English they are treated as singular nouns.

    Furthermore, in the example above it should be “was” no matter what side of the Atlantic you’re on. The “was” refers to “this [aged] portion”, which is clearly singular because of the “this”. If the quote were “The aged portion of society…” then it would depend on B.E. vs. A.E.

    I’m guessing the quote is an archaic formulation.

  8. 8.   @Duck Says:
    August 17th, 2008 at 5:24 pm

    The system gives the same words to multiple people. If they agree on what the word should be, then the word is accepted as correct. If some of the writers disagree, then the word is given to more people.

  9. 9.   Grimmygrim Says:
    August 17th, 2008 at 6:01 pm

    Portion is singular so “was” would be correct. Using “was” or “were” would depend on the context (are they talking about the portion or the society). I’m leaning towards “was”.

  10. 10.   ayeroxor Says:
    August 17th, 2008 at 6:07 pm

    “Um… shouldn’t that be “This aged portion of society was”? Haha.”

    It can be either. Haha.

  11. 11.   Jmar Says:
    August 17th, 2008 at 6:12 pm

    I do not understand how this would work for “new words”, yet to be deciphered. Above someone suggested it sent the word to multiple people… does the first person have to wait until enough people verify? Haha. All my experence with this CAPTCHA has been instant either correct or incorrect, from my understanding it’s asking me to verify, not decipher. Am I just not getting a “new word” or what?

  12. 12.   rprebel Says:
    August 17th, 2008 at 7:17 pm

    It sounds like CAPTCHAs, for the commenter, aren’t new words at all. When I type ‘suffolk’ and ‘chiffon’ into the little box below this bigger box, I’m not helping to decipher anything. I’m placing a vote in an election that’s already been decided. They’re also annoying, but spam is moreso.

  13. 13.   Ron Delta Says:
    August 17th, 2008 at 8:41 pm

    Wow dude, thsoe folks are pretty amazing arent they. Very smart bunch.

    RD
    http://www.anondo.alturl.com

  14. 14.   Fabrizio Says:
    August 17th, 2008 at 8:47 pm

    Andrei Broder was the first to invent a CAPTCHA when at Altavista and not Luis von Ahn

  15. 15.   Hank Roberts Says:
    August 17th, 2008 at 9:09 pm

    When all else fails, read the fine manual:

    http://recaptcha.net/learnmore.html

    “how does the system know the correct answer to the puzzle? Here’s how: Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. The user is then asked to read both words. If they solve the one for which the answer is known, the system assumes their answer is correct for the new one. The system then gives the new image to a number of other people to determine, with higher confidence, whether the original answer was correct.”

    See also: http://web.sbu.edu/history/tschaeper/Hist101/101wwwfbacon.html

  16. 16.   Jerome Says:
    August 18th, 2008 at 2:09 am

    Yes, that’s not clear to me either… if I’m deciphering the word, how does the program know what is correct?

  17. 17.   thomas Says:
    August 18th, 2008 at 3:21 am

    Here’s how they do it (From the website):

    “But if a computer can’t read such a CAPTCHA, how does the system know the correct answer to the puzzle? Here’s how: Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. The user is then asked to read both words. If they solve the one for which the answer is known, the system assumes their answer is correct for the new one. The system then gives the new image to a number of other people to determine, with higher confidence, whether the original answer was correct.”

    Very cool idea.

  18. 18.   komatzu Says:
    August 18th, 2008 at 10:59 am

    @thomas: thanks for the answer!
    I think it should have been mentioned in the article.

  19. 19.   Fat Jolly Penguin Says:
    August 20th, 2008 at 6:39 pm

    ““Um… shouldn’t that be “This aged portion of society was”? Haha.”

    It can be either. Haha.”

    Actually, it should be “was.” The subject of the sentence is “portion.”

  20. 20.   Rich Says:
    August 27th, 2008 at 10:28 am

    If I’d known I was helping the NYT i would have lied!

  21. 21.   Kevin Says:
    August 30th, 2008 at 6:46 am

    “Um? shouldn?t that be “This aged portion of society was”? Haha.”

    In most cases, since the subject of the sentence would be portion, then the correctly conjugated form would be “was” as that would agree in number with the subject. However, one thing that seems to have escaped attention would be the use of the subjunctive instead of the indicative. For example, when positing “If I were a grammar-nazi,” “were” is the correct form and not “was” even though the subject (“I”) is singular. I am not saying that this is the particular case here, but that it is a possibility…another would be that the author was bereft of grammar knowledge in the first place.

  22. 22.   ov3rcl0ck Says:
    May 18th, 2009 at 9:17 pm

    Shouldn’t they have it run characters that are 75% or so readable into a dictionary to find the most logical words then out of the list take the less readable characters and find the most suitable word, so you don’t end up with words like “niss” and “pntkm”?

  23. 23.   Brian Says:
    July 29th, 2009 at 7:22 pm

    @ov3rclock,

    If you’re saying what I think you’re saying, then that doesn’t work. If you accept the OCR engine’s result of a valid word, you miss the possibility that it made a mistake that still comes out as a real word. You might say that’s an acceptable risk, but these are known dirty documents with high error rates anyways.

    Even running the result through a grammer checker doesn’t always fix problems, though it certainly does help. Every additional cross-check helps. However nothing replaces the human being for best quality.

    I just performed an OCR job and the result was OK but not great. It consistently mis-recognized “S” as “N”. For instance, the word “this” was repeatedly recognized as “thin”. You wouldn’t think the letters could be confused, but they were and the result was still a valid word. Even at the phrase level the language still made some type of sense, as “this interface” became “thin interface”.

  24. 24.   Frollard Says:
    January 21st, 2010 at 6:18 am

    Then how does reCaptcha know if you typed it right (there goes the turing test) – if you typed it wrong, it wouldn’t know because it doesnt know what the text is.

  25. 25.   Dan Says:
    September 5th, 2010 at 5:15 pm

    Adobe needs to work on the OCR junk. You all know OCR has been around for ages right?

    This is why no one talks about it very much anymore… It was LAME. It has never worked right.

    Got me as to why they can’t separate black text from white paper most of the time. I can do it in Photoshop most often. It’s called: select with certain threshold, Contract the selection, inverse and hit the del key. THEN try to OCR this crap. Oh wait Photoshop has actions, and I bet a smart guy would be able to batch those actions…. hmmmm.

    It’s faster to just type it in….

  26. 26.   Thaddeus Kissick Says:
    April 26th, 2011 at 2:28 am

    Awesome read. I just passed this onto a buddy who was doing a little research on that. He just bought me lunch since I found it for him! Therefore let me rephrase: Thanks for lunch!

  27. 27.   Brandon Pleet Says:
    June 24th, 2011 at 8:15 am

    Appreciation for your exceptionally insightful posting, most of us could use far more sites similar to this on the net. Could you expand more about the 2nd paragraph please? I am a little bit perplexed as well as uncertain whether or not I understand your position entirely. Thank you.

  28. 28.   Olivia Clossin Says:
    July 30th, 2011 at 5:06 pm

    This particular publish generally seems to get a lot of site visitors. How can you get traffic to this? That offers a nice unique perspective on points. I assume possessing some thing genuine or significant to provide facts about is an essential element.

Leave a Reply





    • 80beats Daily Newsletter

      Enter your email address:

    • Twitter

      Follow @discovermag
    • Facebook

    • RSS Feed

      The RSS feed for 80beats is here RSS.

    • Sci News in 140

      rockahn.net
    • on 80beats

      Recent Comments

      Comments

      • Mike on The Engineer Who Has “Saved More Lives Than Any Single Person in the History of Aviation”
      • Sarah Zhang on Study: Americas + Europe + Asia Will Form Amasia, a Supercontinent in the Arctic
      • m on The Engineer Who Has “Saved More Lives Than Any Single Person in the History of Aviation”
      • Pandora on Zebra Stripes: Fashion Statement or Fly Repellant?
      • Can on Massage Doesn’t Just Feel Good—It Changes Gene Expression and Reduces Inflammation
      • Brent on The Engineer Who Has “Saved More Lives Than Any Single Person in the History of Aviation”
      RSS Recent Posts

      Posts

      • Zebra Stripes: Fashion Statement or Fly Repellant?
      • Study: Americas + Europe + Asia Will Form Amasia, a Supercontinent in the Arctic
      • Video: Coral’s Dramatic Yet Slo-Mo Emergence From the Sea Floor
      • It’s a Shark-Eating Shark–Eating–Shark World
      • Solar Panels Sometimes Pit Global Warming Against Local Ecosystems
      Categories

      Categories

      • Environment
      • Feature
      • Health & Medicine
      • Human Origins
      • Journal Roundup
      • Living World
      • Mind & Brain
      • News Roundup
      • Photo Gallery
      • Physics & Math
      • Space
      • Technology
      • Top Posts
      • Uncategorized
      Archives

      Archives

      • February 2012
      • January 2012
      • December 2011
      • November 2011
      • October 2011
      • September 2011
      • August 2011
      • July 2011
      • June 2011
      • May 2011
      • April 2011
      • March 2011
      • February 2011
      • January 2011
      • December 2010
      • November 2010
      • October 2010
      • September 2010
      • August 2010
      • July 2010
      • June 2010
      • May 2010
      • April 2010
      • March 2010
      • February 2010
      • January 2010
      • December 2009
      • November 2009
      • October 2009
      • September 2009
      • August 2009
      • July 2009
      • June 2009
      • May 2009
      • April 2009
      • March 2009
      • February 2009
      • January 2009
      • December 2008
      • November 2008
      • October 2008
      • September 2008
      • August 2008
      • July 2008
      • June 2008
      • May 2008
    • About 80beats

      80beats is DISCOVER's news aggregator, weaving together the choicest tidbits from the best articles on the day's most compelling topics.

      80beats is written by Veronique Greenwood and Valerie Ross. This team darts through each day's science news faster than the ruby-throated hummingbird that beats its wings 80 times per second. Send ideas, tips, suggestions, and complaints to [azeeberg at discovermagazine dot com].



  • Kalmbach Publishing Co.

    Copyright © 2012, Kalmbach Publishing Co.

    Privacy - Terms - Reader Services - Subscribe Today - Advertise - About Us