Scott on Writing

Musings on technical writing...

The Worth(lessness) of CAPTCHAs

In a previous entry on stopping comment spam, I discussed a gaggle of techniques that can be used to fight the scourge of comment spam.  The list offerred five techniques:

  1. Moderation
  2. Use of a CAPTCHA
  3. Banning certain substrings from the comments
  4. Munging URLs in comments to remove the impetus for comment spamming (rel=“nofollow“ being the latest iteration of this)
  5. Require authentication to post comments

After thinking about the topic of stopping comment spam in more detail, along with perusing the flood of blog entries around the Web discussing rel=“nofollow“ and other comment spam-stopping approaches, I want to prune the list and remove options #2 and #4.  Don't get me wrong, CAPTCHAs and munging URLs are better options than doing nothing at all, but I think both provide a false sense of security.

Munging URLs Just Plain Doesn't Work at Stopping Spam
The thought behind munging URLs is that by removing the benefit of comment spam - having a comment spammer's URLs improve in PageRank on Google and other search engines - that comment spammers will collectively throw up their hands, sigh, and say, “All right, you win, we'll stop spamming in your blog.“  And for sane, rational folks, that would seem like a sensible reaction.  But to a spammer, whose hearts are three sizes too small (and black, and festered), what's the cost of spamming?  Virtually nil.  So why bother stopping?  From the spammer's point of view, in the best case, the blog being spammed doesn't use URL munging techniques; in the worst case, it does, but the links will still be viewed by people who visit the blog, even if the search engines don't spider them.

Consider email spam.  Over the years ISPs have developed better filters, email clients, like Outlook, have provided Junk Email filters, tools like SpamBayes exist to use Bayesian filtering, tools like Cloudmark allow the task of identifying spam to be distributed across millions, and challenge/response systems succeed in annoying valid senders and resulting in lost messages from the non-technologically savvy.  BUT, despite all of that, do you get more or less spam today than you did, say, five years ago?  Exactly.  (The graph to the right shows the percentage of spam as a share of global email.  More than 80% of email received today is spam, up from 25% or so two years ago...)

For a good discussion on the utility of the rel=“nofollow“ hyperlink attribute, be sure to read: A Step Toward Solving Comment Spam?

You Know What?  CAPTCHAs Don't Work Either
A CAPTCHA - short for  Completely Automated Public Turing test to tell Computers and Humans Apart - is, as its name implies, a test that can be administered that a human can solve easily, but a computer cannot.  Typically CAPTCHAs take advantage of the fact that the human brain is great at recognizing visual images and patterns, while a computer has a much harder time identifying objects in images.  Tons of sites uses CAPTCHAs to stop automated computer programs from creating accounts, posting comments, and so on.  For example, sites like Yahoo and Hotmail display a string of twisted text over a grainy background that you have to enter into a textbox in order to create your new email account.

CAPTCHAs, though, are becoming less useful as humans find more intelligent ways to write computer programs to solve these little puzzles that were designed to be solvable only by humans.  For example, on www.CAPTCHA.net, the site indicates that:

Greg Mori and Jitendra Malik of the University of California at Berkeley have written a program that can solve ez-gimpy with accuracy 83%.... Thayananthan, Stenger, Torr, and Cipolla of the Cambridge vision group have written a program that can achieve 93% correct recognition rate..., and Malik and Mori have matched their accuracy. Their programs represent siginifcant advancements to the field of computer vision.

The latest CAPTCHA breaking programmer is Casey Chestnut, a smart, determined guy with, perhaps, a bit too much time on his hands.  Recently Casey decided that he was going to write an AI program that would break CAPTCHAs on a particular blog site, and did so with alarming success.  Casey's program is quite impressive and extensive, using a neural network to do character recognition... you can read about all the gory details over at this blog entry..  With his program, Casey observed around a 66% success rate at identifying the CAPTCHA, and posted over 90 comment spam emails on various blogs.  An impressive amount of work obviously went into this project, so kudos to Casey.  (There's also currently a follow-up FAQ on the CAPTCHA AI post available on the Casey's blog's homepage...)

So CAPTCHAs are pretty useless, but not just because computer programs can solve them with alarmingly high success rates.  Imagine that such a CAPTCHA existed that a human could easily identify, but not even the most cleverly programmed computer application could solve.  Spammers could still circumvent this CAPTCHA.  How?  Just have humans solve the CAPTCHA.  This has been demonstrated in the past - a nefarious hacker needs to solve a CAPTCHA so he posts it on his porn site to some unsuspecting visitor and says, “Solve this CAPTCHA for me to view free porn!”  The visitor obliges, providing the answer to the CAPTCHA, which the program then uses to complete some signup or comment spam process started momentarily ago.

The Best Way to Stop Comment Spam
The only real way to 100% guarantee a stop to comment spam is to do moderation of all comments.  Of course that can be a lot of work for the blogger and can be frustrating for the commenter, whose comments might not appear for minutes, hours, days, weeks, or... ever.  The second best approach is to have some sort of user account model and only allow authenticated users to post comments.  (Similarly, a well maintained global blacklist of comment spam URLs would also be equivalent to requiring authentication; see this entry for more on blacklists.)  This has the benefit that if someone does start comment spamming their account can be banned.  The downsides here are two-fold:

  1. It requires that posters create some sort of account.  Unless there's a global authentication store, like Passport, users will have accounts on various blog systems, which is less than ideal.
  2. Comment spam can still get through.  Granted, it's less likely since the comment spammer has to expend more effort to create the account, and the effects of comment spam are amatorized over all those bloggers who use the same authentication store, but comment spam can still creep up, whereas with moderation it should never appear.

The main advantage of requiring authentication, though, is that it does not take much effort on the blogger's side.  Their only responsibility is to notify the authentication store managers when some authenticated user has posted comment spam.

The best commenting experience I've seen is in blogs that have married these two concepts.  They allow anyone - both anonymous and authenticated users - to post comments, but comments made by anonymous users require moderation.

posted on Wednesday, February 02, 2005 9:27 AM

Feedback

# re: The Worth(lessness) of CAPTCHAs 2/2/2005 12:09 PM Walt Lounsbery

A great article! Here are a couple of thoughts on the subject:

I think it is a great shame to spend CPU cycles generating fuzzy images to fool comment spamming robots. And according to your article, people are using optical recognition technologies to read them anyway. That makes it technically worthless, too.

One way to rescue the idea of "human validation" is to require someone to type in a character sequence in one of several special ways. Randomize them. Examples are: reverse character order, descending numeric order, alphabetic order, reverse words. All of these are easy on cpu cycles and humans. Sorry, no fancy or neat graphics are needed.

Lastly, if we want to see more comment spam, just outlaw it in CAN SPAM II! ;-)

# re: The Worth(lessness) of CAPTCHAs 2/2/2005 5:56 PM Milan Negovan

Guys, I disagree with the claim that CAPTCHAs are completely useless. Not completely. Blog/link spammers still have to spend their CPU cycles to break a captcha. Their end game is volume, and slowdowns make it inefficient. Correction: not as efficient.

As to embedded CAPTACHAs on porn sites---that can be solved, too. Scott, you wrote an article on URL rewriting and pointed out how to sniff which domain an image comes from and ban it if needed. I do that on my site as well.

Black-listing spammers is pointless. They use others as spam cannons, such as anonymous proxies, therefore often times those are IPs of innocent (clueless) people. Check this out: http://www.theregister.co.uk/2005/01/31/link_spamer_interview/.

I think CAN SPAM is too liberal. It should be ruthless to anyone who tries to peddle ANY online scam. After all, we're talking about a lot of money lost on bandwith, hardware and software maintenance, productivity of individuals cleaing blogs or in-boxes, etc. Most spam is generated in the US (ironically), so this law can be enforced efficiently, IMHO.

I don't have the world's best answer to combatting spam, but the fight is on!

Thanks for food for thought, Scott. ;)

# re: The Worth(lessness) of CAPTCHAs 2/2/2005 6:16 PM Scott Mitchell

Milan, yes, the URL rewriting would work if the spammer is just copying the URL of the CAPTCHA image directly to an IMG tag on the porn site, but what if he's actually donwloading the GIF from Yahoo (or Hotmail, or whatever), and then uploading it to his porn site (or some proxy site)?

Regarding black-listing, the concept with comment spam is not to black-list the IP of the spammer, but black-list particular domain names. I talk about how I use this technique on this very blog here:
http://scottonwriting.net/sowblog/posts/3083.aspx

# re: Fighting Comment SPAM with CAPTCHA 2/3/2005 8:14 AM Giddy Up!

# re: The Worth(lessness) of CAPTCHAs 2/3/2005 8:44 AM Milan Negovan

Yep, I remember reading this post about triggers. Take link farms as an example---spammers register domains via scripts anyway, thus setting up hundreds (if not more) domains with meaningless URLs. The idea was to point them at each other. That'd be a looooong list to ban, and then they keep registering them still. That's one step behind all the time.

I think a proactive thing would be to create a web request to whatever scumbag URL they post and anaylize what comes back. As opposed to a gazillion of domain names, their "vocabulary" is quite limited, and THAT is easy to blacklist.

Spammers mostly serve as gateways to the sites that actually offer porn and drugs (or simply rip people off), and live off of being paid a cut. If you open up a web request and analyze headers and content, you don't have to follow further (unless I badly misinterpret something in the workings of the HTTP protocol). This way you counter them AND they don't get paid.

I think once you weigh in offensive content your code can make an educated decision what it's dealing with.

What do you think?

# The Cost of Comments 2/3/2005 10:46 AM Bryant Likes's Blog

# re: The Worth(lessness) of CAPTCHAs 2/3/2005 1:55 PM Dick K.

I don't consider cutting 20-40% of comment spam worthless at all. The point of having a list is that there's no panacea. I don't protect my computer with just AV. I also use a firewall, adaware, spyware stuff. CAPTCHA can be a useful part of a multi-pronged approach. If Casey can only get 60%, then 1/5 of your list is wiping out 2/5 of the spam. Sounds like a good ROI to me.

# re: The Worth(lessness) of CAPTCHAs 2/4/2005 7:48 AM Kiliman

I think it would be helpful if you could set a threshold on the number of URLs contained in a comment. For example, if the comment contains 3 or more URLs, then it will require moderation.

And if the spammer tries to cheat by posting many comments with less than 3 URLs in a row, you can defeat them by limiting the number of comments from an IP during a certain time span. Like if 3 or more comments are posted in a 5 minute time span, then moderate all of those comments.

Granted, it might affect people behind proxies, but how often are you going to get multiple people behind the same proxy posting on the same blog at the same time?

Basically the goal is to allow legitimate comments while moderating those that exhibit spammy behavior.

Kiliman

# re: The Worth(lessness) of CAPTCHAs 2/23/2005 8:51 AM Jeff Atwood

> Milan, yes, the URL rewriting would work if the spammer is just copying the URL of the CAPTCHA image directly to an IMG tag on the porn site, but what if he's actually donwloading the GIF from Yahoo (or Hotmail, or whatever), and then uploading it to his porn site (or some proxy site)?

Think of all the work we've just created for the spammer. They won't bother-- there are so many easier targets out there. CAPTCHAs *absolutely* work, although they shouldn't be your only layer of defense.

Also, the CAPTCHA that whats-his-name cracked had kind of a goofball design: the characters are all continuous, unbroken areas of contrast.

# Giving a CAPTCHA a Whirl 7/11/2006 6:21 PM Community Blogs

Comment spam is evil. I've been getting on the tune of 25-50 comment spams per day the past several weeks.

# re: The Worth(lessness) of CAPTCHAs 7/31/2006 5:50 AM Stuart Moncrieff

A while back, I wrote a similar entry on combatting comment spam.

Here are the pertinent bits...

Captchas aren’t foolproof, but I think they are the best solution at this time. No blogging tools have included a captcha implementation in the main source tree, but they all seem to have this option as a plug-in or mod. The main objection to captchas is that they disadvantage the blind - this is coming from the same people who have no problem blacklisting all comments from a class C IP address range. Surely they couldn’t object to captchas if users were given the alternative of logging in if they read it.

...as with any anti-spam method, the important thing is not to be totally immune to spam, but to be hard enough that spammers go elsewhere...

While it is simple to write a bot that will handle this, until more people do this the spammers may not bother. It’s like the joke about the two men who stumble upon the hungry mountain lion. “I don’t have to outrun the lion, I just have to outrun you.”

Cheers,
Stuart.

# Not Much on My Mind Right Now 2/10/2007 10:06 PM George V. Reilly's Technical BLog

I have two blogs, my personal blog and my technical blog . The technical blog is a small subset of the

Title:  
Name:  
Url:
Protected by Clearscreen.SharpHIPEnter the code you see:
Comments   

Add To Your Reader

My Links

Archives

Post Categories

 

I am a Microsoft MVP for ASP.NET.
I am an ASPInsider.
<May 2008>
SMTWTFS
27282930123
45678910
11121314151617
18192021222324
25262728293031
1234567

Comment Stats

DayTotal% of Total
Sunday 1856.8%
Monday 37914.0%
Tuesday 45316.7%
Wednesday 50518.6%
Thursday 53319.6%
Friday 49418.2%
Saturday 1666.1%
Total 2715100.0%

Hour1Total% of Total
12:00 AM 652.4%
1:00 AM 682.5%
2:00 AM 612.2%
3:00 AM 742.7%
4:00 AM 572.1%
5:00 AM 1043.8%
6:00 AM 1084.0%
7:00 AM 1585.8%
8:00 AM 1716.3%
9:00 AM 1475.4%
10:00 AM 1716.3%
11:00 AM 1816.7%
12:00 PM 1886.9%
1:00 PM 1696.2%
2:00 PM 1585.8%
3:00 PM 1324.9%
4:00 PM 1073.9%
5:00 PM 923.4%
6:00 PM 913.4%
7:00 PM 963.5%
8:00 PM 833.1%
9:00 PM 782.9%
10:00 PM 792.9%
11:00 PM 772.8%
Total 2715100.0%

Comments by Blog Entry Date/Time

Day Entry MadeAvg.Total
Sunday 5.54144
Monday 5.20338
Tuesday 4.32419
Wednesday 7.69638
Thursday 6.90607
Friday 5.48411
Saturday 5.27158
Total 5.852715

Hour1 Entry MadeAvg.Total
12:00 AM 5.0035
1:00 AM 1.002
5:00 AM 0.000
7:00 AM 7.0035
8:00 AM 5.35107
9:00 AM 6.32278
10:00 AM 6.47246
11:00 AM 4.41181
12:00 PM 6.83328
1:00 PM 3.00111
2:00 PM 5.41222
3:00 PM 8.67286
4:00 PM 4.0589
5:00 PM 5.92154
6:00 PM 4.48112
7:00 PM 9.67174
8:00 PM 10.50147
9:00 PM 5.05111
10:00 PM 5.4265
11:00 PM 4.5732
Total 5.852715

Learn More About Comment Stats
1 - All times GMT -8...


Blog Stats

Favorite Web Sites

My Books

My MSDN Articles