Skeptikal.org

Wednesday, April 8, 2009

Solving Semantic CAPTCHAs with Google

This came up in a discussion with a developer quite a while ago, and I've been meaning to write a proof-of-concept. So far, I haven't had a chance to, but I think the idea is interesting enough to talk about alone.

The idea of using semantic CAPTCHAs has been discussed quite a few times over, and even implemented in a few cases. So far, these implementations have worked reasonably well, but that is only because they aren't worth the trouble yet.

One example of a semantic captcha is found on php.net, the online bible for PHP developers. Each PHP function's manual page has a "comments" section where people can leave helpful tips for using the function in question. This comments form is a particularly poor implementation of a semantic CAPTCHA (it's vulnerable to replay attacks, has a very limited number of possible solutions, among other things) and also contains an open redirect hole. Let's focus on solving the captcha without using those flaws.

PHP.net semantic captcha

This captcha asks the user to solve a math problem, but substitutes words for mathematical symbols. In this example, the question is "two minus one". Anybody who has ever used the Google calculator can solve that one with a single request to http://www.google.com/search?q=two+minus+one:

Google Calculator breaking Semantic Captcha

It's a simple matter of plugging the data into google and parsing the output. It could be argued that this isn't really a semantic CAPTCHA, just a mathematical one with a semantic output format. Other developers suggested asking pop culture or trivia questions. The example that was given to me was "Is Paris Hilton A Slut?". The obvious flaw with this question is that it has a boolean solution, and 50% isn't a very good success rate for a CAPTCHA. However, if we wanted to improve our chances, we could turn to Google again:

Google Results: Is Paris Hilton a Slut

Just by parsing the titles of the returned results pages, we get two titles with negative operators. If we parsed the result descriptions, we'd get a few more, but statistically, it's pretty clear that Google thinks Paris Hilton is a slut. The AI to do this type of language processing has been around since 1966.

Labels: , , ,

4 Comments:

  • SPAM sucks. CAPTCHA sucks. So someone added this ~3 years ago to see what would happen... it works alright so change hasn't been on anyones mind. Yet.

    By Anonymous Philip Olson, At April 15, 2009 7:45 PM  

  • The CAPTCHA on the note pages are indeed crap, although not many spammers have got through it.

    The "CAPTCHA" on http://php.net/cvs-php however has still not been broken :)

    By Anonymous bjori, At April 16, 2009 12:06 AM  

  • The last part of your analysis is unfair -- you're just lucky to have found some easily parsable Google results. Try 'Is Paris Hilton rich?' and 'Is Paris Hilton a model?' and you'll have a very tough time.

    By Anonymous Michael, At June 1, 2009 1:51 PM  

  • I agree with Michael. Also, the parsing required would be much more complex than that used in Eliza.

    By Anonymous Thomas, At June 7, 2009 3:30 PM  

Post a Comment

Subscribe to Post Comments [Atom]



<< Home