reCAPTCHA

By Sean Carroll | November 12, 2007 4:29 pm

We’ve all seen CAPTCHA‘s — those distorted words that function as a cut-rate Turing test, separating humans from spambots on any number of websites.

image.jpg

This weekend I was at a Kavli Frontiers of Science meeting at the National Academies of Science office in Irvine, and one of the participants was Luis von Ahn — the guy who was responsible for inventing the CAPTCHA idea. He gave a great one-minute talk, in which he traced his personal feelings about being responsible for something that is so useful, yet so annoying.

CAPTCHA, you will not be surprised to hear, is ubiquitous. Luis figured out that the little buggers are filled out about sixty million times per day by someone on the web. So, as the inventer, he first felt a certain amount of pride at having exerted such a palpable influence on modern life. But after a bit of reflection, and multiplying sixty million times by the five seconds it might take to fill in the form, he became depressed at the enormous number of person-hours that were essentially wasted on this task.

Being a clever guy, Luis decided to make lemonade. What we have here is a huge number of people who are recognizing words that a computer can’t make out. Luis realized that there was a separate circumstance in which you would want the computer to recognize the words, even though it wasn’t quite up to the task — optical character recognition, and in particular the problem of digitizing old texts. Apparently, before the advent of the Internet, people would store information by binding together pieces of paper with words printed on them, forming compact volumes known as “books.” In the interest of preserving the products of this outmoded technology, various efforts around the world are attempting to scan in all of those books and store the results digitally. But often the text is not so clear, and the computers don’t do such a great job at translating the images into words.

sample-ocr.gif

Thus, reCAPTCHA was born. At this point you should be able to guess what it does: takes scanned images from actual books, with which optical character recognition software are struggling, and uses them as the source material for CAPTCHA’s. The project is up and running, and can be implemented anywhere the ordinary CAPTCHA’s are used. Now, when you get annoyed at having to make out those squiggly words with lines slashed through them, you can take some solace in knowing that you’re making the world a better place. Or at least saving some books from the trash bin of history.

CATEGORIZED UNDER: Computing
  • http://tristram.squarespace.com Tristram Brelstaff

    Apparently spammers are already using a variant of this idea to automate the breaking of captchas.

  • archgoon

    Ah, there seems to be a bit of missing information. How do they determine that the answer has been correctly entered? If you are using the CAPTCHA to figure out what the CAPTCHA says, how do you know that they got the right answer?

  • archgoon

    Ah, that’s the reason for two. Gotcha.

  • dibyadeep

    I dont really understand how it might help decipher old writing from a book. If you use these then people might as well type anything to get through, how would the computer know in the first place, what the actual word is?

  • archgoon

    dibyadeep, note in the image, you’ve got two words. One word is a known, generated CAPTCHA, which ensures that the input is coming from a human, the other is an unknown word, which is the one we want to decipher. The user doesn’t know which is the CAPTCHA, and which is the generated source (actually, this only matters if the user is pathological), so answers both.

  • archgoon

    Oh, and to guarantee that the user didn’t make a mistake, multiple people can be given the same unknown one for confirmation.

  • http://lablemminglounge.blogspot.com/ Lab Lemming

    So how do we get Google or (insert favorite mega commercial blog host here) to use this?

  • Pingback: No Football « blueollie

  • Pingback: Geoff Arnold » Blog Archive » reCAPTCHA

  • http://backreaction.blogspot.com/ B

    Needless to say, the actual problem is spam, which is – either way you turn it – an enormous waste of time and energy. It is a mystery to me why it is still legal, given the inconveniences it causes for servers and IT staff all around the world.

  • Moshe

    B., spam is illegal but enforcement is a serious issue. Look at

    http://www.newyorker.com/reporting/2007/08/06/070806fa_fact_specter

    for an interesting take on the issue.

  • Pingback: kryptos. libertas. » Blog Archive » reCAPTCHA

  • http://backreaction.blogspot.com/ B

    Hi Moshe,

    Thanks, that’s a nice article indeed. (I had no clue where the word spam comes from!) Well, I guess I’d just take all sites that are advertised in spam mails off the name servers, until they’ve proven it was a mistake. End of problem. There’s a slight chance one or the other site might temporarily be unavailable accidentally, but this seems to me like a price I’d be willing to pay. Just the mere existence of such a procedure would make a big difference.

    Best,

    B.

  • Pingback: Only Humans Allowed To Comment | Karol Krizka

  • http://www.stevepepple.wordpress.com steve

    The idea of the reCAPTCHA is compelling.

    Yet, a problem I have with CAPTCHAS in general, is there burden to users of websites. So I’m interested is ways of increasing security, without burdening people with task such as filling in a CAPTCHA form.

    On such technique is a simple client honeypot (a spammer trap) that creates a CAPTCHA, or other form field, that is invisible to the website user. The spam bot, howevers, “sees ” and tries to fill in the honeypot field. If the invisible field is filed in, then, the website knows that its a spammer or other bot hacker.

  • Thomas D

    Hmm… the portion of aged text is ungrammatical. It has a singular subject (portion) and a plural verb (were).

    Otherwise, excellent idea!

NEW ON DISCOVER
OPEN
CITIZEN SCIENCE
ADVERTISEMENT

Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!

Cosmic Variance

Random samplings from a universe of ideas.

About Sean Carroll

Sean Carroll is a Senior Research Associate in the Department of Physics at the California Institute of Technology. His research interests include theoretical aspects of cosmology, field theory, and gravitation. His most recent book is The Particle at the End of the Universe, about the Large Hadron Collider and the search for the Higgs boson. Here are some of his favorite blog posts, home page, and email: carroll [at] cosmicvariance.com .

ADVERTISEMENT

See More

ADVERTISEMENT
Collapse bottom bar
+

Login to your Account

X
E-mail address:
Password:
Remember me
Forgot your password?
No problem. Click here to have it e-mailed to you.

Not Registered Yet?

Register now for FREE. Registration only takes a few minutes to complete. Register now »