In a previous entry on stopping comment spam, I discussed a gaggle of techniques that can be used to fight the scourge of comment spam. The list offerred five techniques:
- Moderation
- Use of a CAPTCHA
- Banning certain substrings from the comments
- Munging URLs in comments to remove the impetus for comment spamming (rel=“nofollow“ being the latest iteration of this)
- Require authentication to post comments
After thinking about the topic of stopping comment spam in more detail, along with perusing the flood of blog entries around the Web discussing rel=“nofollow“ and other comment spam-stopping approaches, I want to prune the list and remove options #2 and #4. Don't get me wrong, CAPTCHAs and munging URLs are better options than doing nothing at all, but I think both provide a false sense of security.
Munging URLs Just Plain Doesn't Work at Stopping Spam
The thought behind munging URLs is that by removing the benefit of comment spam - having a comment spammer's URLs improve in PageRank on Google and other search engines - that comment spammers will collectively throw up their hands, sigh, and say, “All right, you win, we'll stop spamming in your blog.“ And for sane, rational folks, that would seem like a sensible reaction. But to a spammer, whose hearts are three sizes too small (and black, and festered), what's the cost of spamming? Virtually nil. So why bother stopping? From the spammer's point of view, in the best case, the blog being spammed doesn't use URL munging techniques; in the worst case, it does, but the links will still be viewed by people who visit the blog, even if the search engines don't spider them.
Consider email spam. Over the years ISPs have developed better filters, email clients, like Outlook, have provided Junk Email filters, tools like SpamBayes exist to use Bayesian filtering, tools like Cloudmark allow the task of identifying spam to be distributed across millions, and challenge/response systems succeed in annoying valid senders and resulting in lost messages from the non-technologically savvy. BUT, despite all of that, do you get more or less spam today than you did, say, five years ago? Exactly. (The graph to the right shows the percentage of spam as a share of global email. More than 80% of email received today is spam, up from 25% or so two years ago...)
For a good discussion on the utility of the rel=“nofollow“ hyperlink attribute, be sure to read: A Step Toward Solving Comment Spam?
You Know What? CAPTCHAs Don't Work Either
A CAPTCHA - short for Completely Automated Public Turing test to tell Computers and Humans Apart - is, as its name implies, a test that can be administered that a human can solve easily, but a computer cannot. Typically CAPTCHAs take advantage of the fact that the human brain is great at recognizing visual images and patterns, while a computer has a much harder time identifying objects in images. Tons of sites uses CAPTCHAs to stop automated computer programs from creating accounts, posting comments, and so on. For example, sites like Yahoo and Hotmail display a string of twisted text over a grainy background that you have to enter into a textbox in order to create your new email account.
CAPTCHAs, though, are becoming less useful as humans find more intelligent ways to write computer programs to solve these little puzzles that were designed to be solvable only by humans. For example, on www.CAPTCHA.net, the site indicates that:
Greg Mori and Jitendra Malik of the University of California at Berkeley have written a program that can solve ez-gimpy with accuracy 83%.... Thayananthan, Stenger, Torr, and Cipolla of the Cambridge vision group have written a program that can achieve 93% correct recognition rate..., and Malik and Mori have matched their accuracy. Their programs represent siginifcant advancements to the field of computer vision.
The latest CAPTCHA breaking programmer is Casey Chestnut, a smart, determined guy with, perhaps, a bit too much time on his hands. Recently Casey decided that he was going to write an AI program that would break CAPTCHAs on a particular blog site, and did so with alarming success. Casey's program is quite impressive and extensive, using a neural network to do character recognition... you can read about all the gory details over at this blog entry.. With his program, Casey observed around a 66% success rate at identifying the CAPTCHA, and posted over 90 comment spam emails on various blogs. An impressive amount of work obviously went into this project, so kudos to Casey. (There's also currently a follow-up FAQ on the CAPTCHA AI post available on the Casey's blog's homepage...)
So CAPTCHAs are pretty useless, but not just because computer programs can solve them with alarmingly high success rates. Imagine that such a CAPTCHA existed that a human could easily identify, but not even the most cleverly programmed computer application could solve. Spammers could still circumvent this CAPTCHA. How? Just have humans solve the CAPTCHA. This has been demonstrated in the past - a nefarious hacker needs to solve a CAPTCHA so he posts it on his porn site to some unsuspecting visitor and says, “Solve this CAPTCHA for me to view free porn!” The visitor obliges, providing the answer to the CAPTCHA, which the program then uses to complete some signup or comment spam process started momentarily ago.
The Best Way to Stop Comment Spam
The only real way to 100% guarantee a stop to comment spam is to do moderation of all comments. Of course that can be a lot of work for the blogger and can be frustrating for the commenter, whose comments might not appear for minutes, hours, days, weeks, or... ever. The second best approach is to have some sort of user account model and only allow authenticated users to post comments. (Similarly, a well maintained global blacklist of comment spam URLs would also be equivalent to requiring authentication; see this entry for more on blacklists.) This has the benefit that if someone does start comment spamming their account can be banned. The downsides here are two-fold:
- It requires that posters create some sort of account. Unless there's a global authentication store, like Passport, users will have accounts on various blog systems, which is less than ideal.
- Comment spam can still get through. Granted, it's less likely since the comment spammer has to expend more effort to create the account, and the effects of comment spam are amatorized over all those bloggers who use the same authentication store, but comment spam can still creep up, whereas with moderation it should never appear.
The main advantage of requiring authentication, though, is that it does not take much effort on the blogger's side. Their only responsibility is to notify the authentication store managers when some authenticated user has posted comment spam.
The best commenting experience I've seen is in blogs that have married these two concepts. They allow anyone - both anonymous and authenticated users - to post comments, but comments made by anonymous users require moderation.