Computers Exploit Human Brainpower to Decipher Faded Texts

By Eliza Strickland | August 14, 2008 6:58 pm

text reCAPTCHAIn a neat example of Internet-enabled “crowdsourcing,” the method of distributing a large task to many contributors, researchers are using an anti-spam program to get people to decipher damaged or faded texts, one word at a time. Chances are that if you’ve solved one of those distorted-word tests to secure an account with Facebook, Craigslist, or Ticketmaster, you’ve helped The New York Times inch a little closer to digitizing its entire print newspaper archive from 1851 to 1980 [CNET].

The program, known as reCAPTCHA, is widely used to ensure that humans, rather than spam bots, are commenting on blogs (including some of DISCOVER’s) and signing up for free email accounts. “More web sites are adopting reCAPTCHAs each day, so the rate of transcription keeps growing,” said [lead researcher Luis] von Ahn. “More than 4 million words are being transcribed every day. It would take more than 1,500 people working 40 hours a week at a rate of 60 words a minute to match our weekly output” [Telegraph]. The service is available for free to any site.

Ahn’s lab uses two different optical character recognition (OCR) software programs to scan an old book or newspaper article and convert it into a digital, searchable file. But when the programs disagree on the reading of a word, that word is added to the reCAPTCHA database, and used as part of an anti-spam puzzle. According to a report published in the journal Science [subscription required], humans decipher such words with 99 percent accuracy.

In 2000, von Ahn helped invent the first “CAPTCHA,” which stands for “Completely Automated Public Turing test to tell Computers and Humans Apart,” with a nod to the early computer scientist Alan Turing. The new reCAPTCHA cleverly slips a useful task into what has already become a mundane Internet activity. Says Ahn: “We are demonstrating that we can take human effort — human processing power — that would otherwise be wasted and redirect it to accomplish tasks that computers cannot yet solve” [Wired News].

Last year DISCOVER saw how humans could act as artificial artificial intelligence at the Amazon Mechanical Turk, another fine example of crowdsourcing.

Image: Science/AAAS

CATEGORIZED UNDER: Technology
  • Jeremiah

    Um… shouldn’t that be “This aged portion of society was”? Haha.

  • john powell

    A Mental Blockage

    In the current is often found
    Unknown particles of sky and ground.
    Oft they appear as phantasms or as dreams
    Or oft illusions of what is or only seems.

    Nonetheless they do appear as real or imagined fear
    Or as unknowns, unnaturals, torments to eye and ear.
    Look what the fresh new breeze doth bring–
    With its mysterious voice, it doth sing.

    Soft on the air with voice or visual treat,
    It lays its bearing or bounty at your feet.
    Now it is yours, this new thought;
    By this new wind, it is brought.

    Up from the abyss or down from heaven,
    In a current, air now is given.
    It’s oft a creature of what we ingest
    That gives unto us this worst or best.

    Oh, the hazards of seeing or hearing
    That soon become our reasons for fearing!
    The things accepted without investigation
    Causes the brain its mental constipation.

    120205

  • http://smp.popamericana.com Sir Mildred Pierce

    “Um? shouldn?t that be ?This aged portion of society was?? Haha.”

    Common mistake. “society” is a plurality, and as such is treated as such in the grammar. Another good example is one might say “Queen is Freddy Mercury, Brian May…” but the proper way to say it would be “Queen are Freddy Mercury, Brian May…” etc.. the brain thinks otherwise because the previous word doesn’t end in “s”, but nevertheless it’s a plurality and thus, treated as such.

  • http://smp.popamericana.com Sir Mildred Pierce

    Or rather “This aged portion of society” as a whole is a plurality, not just “society”…

    I would like to see the famous “Roswell Memo given the treatment, as it seems previously only those biased to the answer that the memo really does talk about aliens and discs are teh only ones interpreting it.

  • Duck

    Hm, how then does the system verify if the typed-in word is correct? Wouldn’t someone have to physically write out the correct answer so the CAPTCHA would know later on if someone entered the correct word, or something else. I could just write ‘poop’ and it wouldn’t catch it.

  • Ash

    I’m all for typing inane responses to articles if it means the furthering of literacy.
    Imagine if Youtube incorporated it.

  • @MildredPierce

    Actually, that depends on whether your speaking British English or American English. In British English, collective nouns are treated as plural, “The class were…”, “The team were…”, “U2 are…”, but in American English they are treated as singular nouns.

    Furthermore, in the example above it should be “was” no matter what side of the Atlantic you’re on. The “was” refers to “this [aged] portion”, which is clearly singular because of the “this”. If the quote were “The aged portion of society…” then it would depend on B.E. vs. A.E.

    I’m guessing the quote is an archaic formulation.

  • @Duck

    The system gives the same words to multiple people. If they agree on what the word should be, then the word is accepted as correct. If some of the writers disagree, then the word is given to more people.

  • Grimmygrim

    Portion is singular so “was” would be correct. Using “was” or “were” would depend on the context (are they talking about the portion or the society). I’m leaning towards “was”.

  • ayeroxor

    “Um… shouldn’t that be “This aged portion of society was”? Haha.”

    It can be either. Haha.

  • Jmar

    I do not understand how this would work for “new words”, yet to be deciphered. Above someone suggested it sent the word to multiple people… does the first person have to wait until enough people verify? Haha. All my experence with this CAPTCHA has been instant either correct or incorrect, from my understanding it’s asking me to verify, not decipher. Am I just not getting a “new word” or what?

  • rprebel

    It sounds like CAPTCHAs, for the commenter, aren’t new words at all. When I type ‘suffolk’ and ‘chiffon’ into the little box below this bigger box, I’m not helping to decipher anything. I’m placing a vote in an election that’s already been decided. They’re also annoying, but spam is moreso.

  • Ron Delta

    Wow dude, thsoe folks are pretty amazing arent they. Very smart bunch.

    RD
    http://www.anondo.alturl.com

  • http://hpc.isti.cnr.it/~silvestr Fabrizio

    Andrei Broder was the first to invent a CAPTCHA when at Altavista and not Luis von Ahn

  • Hank Roberts

    When all else fails, read the fine manual:

    http://recaptcha.net/learnmore.html

    “how does the system know the correct answer to the puzzle? Here’s how: Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. The user is then asked to read both words. If they solve the one for which the answer is known, the system assumes their answer is correct for the new one. The system then gives the new image to a number of other people to determine, with higher confidence, whether the original answer was correct.”

    See also: http://web.sbu.edu/history/tschaeper/Hist101/101wwwfbacon.html

  • Jerome

    Yes, that’s not clear to me either… if I’m deciphering the word, how does the program know what is correct?

  • thomas

    Here’s how they do it (From the website):

    “But if a computer can’t read such a CAPTCHA, how does the system know the correct answer to the puzzle? Here’s how: Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. The user is then asked to read both words. If they solve the one for which the answer is known, the system assumes their answer is correct for the new one. The system then gives the new image to a number of other people to determine, with higher confidence, whether the original answer was correct.”

    Very cool idea.

  • komatzu

    @thomas: thanks for the answer!
    I think it should have been mentioned in the article.

  • Fat Jolly Penguin

    ““Um… shouldn’t that be “This aged portion of society was”? Haha.”

    It can be either. Haha.”

    Actually, it should be “was.” The subject of the sentence is “portion.”

  • Rich

    If I’d known I was helping the NYT i would have lied!

  • Kevin

    “Um? shouldn?t that be “This aged portion of society was”? Haha.”

    In most cases, since the subject of the sentence would be portion, then the correctly conjugated form would be “was” as that would agree in number with the subject. However, one thing that seems to have escaped attention would be the use of the subjunctive instead of the indicative. For example, when positing “If I were a grammar-nazi,” “were” is the correct form and not “was” even though the subject (“I”) is singular. I am not saying that this is the particular case here, but that it is a possibility…another would be that the author was bereft of grammar knowledge in the first place.

  • ov3rcl0ck

    Shouldn’t they have it run characters that are 75% or so readable into a dictionary to find the most logical words then out of the list take the less readable characters and find the most suitable word, so you don’t end up with words like “niss” and “pntkm”?

  • Brian

    @ov3rclock,

    If you’re saying what I think you’re saying, then that doesn’t work. If you accept the OCR engine’s result of a valid word, you miss the possibility that it made a mistake that still comes out as a real word. You might say that’s an acceptable risk, but these are known dirty documents with high error rates anyways.

    Even running the result through a grammer checker doesn’t always fix problems, though it certainly does help. Every additional cross-check helps. However nothing replaces the human being for best quality.

    I just performed an OCR job and the result was OK but not great. It consistently mis-recognized “S” as “N”. For instance, the word “this” was repeatedly recognized as “thin”. You wouldn’t think the letters could be confused, but they were and the result was still a valid word. Even at the phrase level the language still made some type of sense, as “this interface” became “thin interface”.

  • Frollard

    Then how does reCaptcha know if you typed it right (there goes the turing test) – if you typed it wrong, it wouldn’t know because it doesnt know what the text is.

  • Dan

    Adobe needs to work on the OCR junk. You all know OCR has been around for ages right?

    This is why no one talks about it very much anymore… It was LAME. It has never worked right.

    Got me as to why they can’t separate black text from white paper most of the time. I can do it in Photoshop most often. It’s called: select with certain threshold, Contract the selection, inverse and hit the del key. THEN try to OCR this crap. Oh wait Photoshop has actions, and I bet a smart guy would be able to batch those actions…. hmmmm.

    It’s faster to just type it in….

  • http://glzn.com Thaddeus Kissick

    Awesome read. I just passed this onto a buddy who was doing a little research on that. He just bought me lunch since I found it for him! Therefore let me rephrase: Thanks for lunch!

  • http://www.herpshireloldo.com Brandon Pleet

    Appreciation for your exceptionally insightful posting, most of us could use far more sites similar to this on the net. Could you expand more about the 2nd paragraph please? I am a little bit perplexed as well as uncertain whether or not I understand your position entirely. Thank you.

  • http://www.blogster.com/yangcany/ Olivia Clossin

    This particular publish generally seems to get a lot of site visitors. How can you get traffic to this? That offers a nice unique perspective on points. I assume possessing some thing genuine or significant to provide facts about is an essential element.

NEW ON DISCOVER
OPEN
CITIZEN SCIENCE
ADVERTISEMENT

Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!

80beats

80beats is DISCOVER's news aggregator, weaving together the choicest tidbits from the best articles covering the day's most compelling topics.
ADVERTISEMENT

See More

ADVERTISEMENT
Collapse bottom bar
+

Login to your Account

X
E-mail address:
Password:
Remember me
Forgot your password?
No problem. Click here to have it e-mailed to you.

Not Registered Yet?

Register now for FREE. Registration only takes a few minutes to complete. Register now »