Scott on Writing

Musings on technical writing...

The End of Comment Spam?

What is Comment Spam?
Comment spam is an evil and real problem for blogs.  The premise of it goes as follows: evil, vile spammers post use old blog entries to post comments that are littered with links to their porn/gambling/diploma/pharmaceutical sites so that Google/MSN/Yahoo! spider the site they find these links/add them to their dictionaires/spider them/improve their page rank/etc./etc.  Technologies that are designed to make posting easier, such as CommentAPI, just help automate the comment spam posting by these ne'er-do-wells.

Past Techniques for Stopping Comment Spam
Until recently, the the main approaches for stopping comment spam have been:

  • Moderation - a post doesn't appear on a blog until the blog owner reviews and approves it.  The advantage of this is that only on-topic, non-spam/non-inflamatory posts are displayed; the disadvantage is that the blog owner must now take the time to micro-manage approval of messages.
  • Use of a Captcha - a captcha is a test that most humans can pass, but current computer programs cannot.  We've all seen these, it's typically a sequence of wavy letters that you must type into a textbox before proceeding.  The downside to captchas is, to my knowledge, the CommentAPI specification does not support them, so you can only utilize captchas on entering comments through the Web interface.  (There's a Captcha control for .Text blogs, as discussed here.)
  • Banning Certain Substrings from Comments - another approach, which is the one I use here on ScottOnWriting.NET, is to simply restrict certain substrings from appearing in the comment.  There are varying degrees of complexity that can be applied here.  I simply have a set of static strings I search for and add to them when a particularly nasty comment spammer starts causing trouble.  Other solutions actually utilize a global blacklist of URLs used by comment spammers, such as http://www.jayallen.org/comment_spam/blacklist.txt.
  • Munging the URLs in Comments - since comment spammers post their URLs to improve their rank in the search engines, one can remove the impetus for a spammer by removing their desired benefit.  One way to accomplish this is to munge the URLs in a comment from something like http://www.somesite.com/BuyViagra.htm to redirect.aspx?http://www.somesite.com/BuyViagra.htm, or to utilize Google's redirect link (which doesn't impact PageRank): http://www.google.com/url?sa=D&q=URL, as discussed here.
  • Require Authentication to Post Comments - many online forums use this technique, requiring that a user have an account before being able to post.  The theory here is that if someone starts posting spam or off-topic, inflamatory posts, they can be banned and their obnoxious posts deleted.  Sure, a motivated spammer can create a new account, but they have to go through the process of using a new email address, filling out an account creation form, and verifying their account by clicking on some link received in an email.  The major downsides to this is (1) that CommentAPI (to my knowledge) doesn't support any sort of authentication piece, and (2) those who want to post to your blog need to create an account.  Similarly, if another blogger takes the same approach, they'll need to create another account over there.  And so on and so on for every blogger that required authentication.

None of these solutions are really panaceas; the true fix for comment spam is to have some centralized user store and to have blogs require folks to authenticate against this store in order to post.  I blabbed on more about this idea in a past blog entry, Improving the Blog Commenting Experience.

A New Alternative to Fighting Comment Spam
Yesterday Google announced a new attribute for HREF tags that, if present, will indicate that its spiders won't follow the URL, thereby negating the benefits of comment spamming (much like URL munging removes the benefits, except this approach, IMO, is simpler).  Basically, if you add rel=”nofollow” to an HREF, Google won't spider the link (i.e., <a href=”Blah.aspx” rel=”nofollow”>This won't be spidered!</a>.)

Will this measure stop comment spam?  It depends, primarily, on how many search engines support this and, more importantly, how many blog engines support this.  The good news is that not only Google will respect the rel=”nofollow” attribute, but so will MSN Search and Yahoo!  Also, a large number of blog engines have promised to utilize this technique, including:

  • LiveJournal
  • SixApart
  • Blogger
  • MSN Spaces
  • Community Server (the evolution of .Text)

Even if the vast majority of blog engines start using the rel=”nofollow” attribute comment spam may still run rampant in the hope that some blogs won't support it.  Think of it this way - how much stuff have you purchased from a spammer, yet how many spams a day do you get?  In the end, I think Google/MSN Search/Yahoo!'s addition of the rel=”nofollow” attribute is a very positive step in the right direction, but I think one would have to be a bit naive to think that this would spell the end of comment spam, meaning we'll still need to use one or more of the techniques I discussed previously until we finally have some global authentication/user store available that everyone agrees to use...

posted on Wednesday, January 19, 2005 9:24 AM

Feedback

# RE: Preventing Comment Spam 1/19/2005 10:07 AM .Avery Blog

# re: The End of Comment Spam? 1/19/2005 11:23 AM Artem Saveliev

This is great :) Do you have some info on how to patch .Text to add that? Or add your "Banning Certain Substrings from Comments". I get 2-3 blog-spams a day, it's a mess to clean them up. I know that google will lower rating of my site if it finds those links in comments...

# re: The End of Comment Spam? 1/19/2005 11:26 AM Scott Mitchell

Artem, Scott Watermasysk posted on how to modify the .Text sources to add the rel="nofollow" bit to comment URLs:
http://scottwater.com/blog/archive/2005/01/19/rel_nofollow_quickchange

As for banning certain substrings I simply do a check at the stored procedure level. Specifically, blog_InsertEntry has a line of code that exits the sproc without INSERTing if certain substrings are found within the comment.

# re: The End of Comment Spam? 1/19/2005 12:54 PM Scott

Doesn't just adding the nofollow attribute to all of the links in your comments section also hurt legitimate commenters?

For example, you linked to Scott W's blog because it had useful information. Wouldn't you want his pagerank to go up for that very fact? With the no follow link in it he doesn't gain anything.

I think the best way for now is spam catchers with captcha and the ability to mass remove the nofollow attribute from comments (e.g. comments that you verify). I think the ability to selectively remove/apply the nofollow attribute would be best.

# Implementing CAPTCHA, and my thoughts on 1/19/2005 1:03 PM Bob.Yexley.Blog

# re: The End of Comment Spam? 1/19/2005 1:35 PM Scott Mitchell

Scott, yeah, that's one disadvantage of the nofollow, it robs the Google juice from those links that deserve it. I doubt many blog engines will implement what you suggest (per-comment nofollow attribution), but who knows.

Personally, I don't think it's that bad of an idea. I mean, a link in a comment shouldn't, in theory, carry that much weight as, say, a link in the main blog post.

Another concern mentioned by James Avery in this blog entry - http://dotavery.com/blog/archive/2005/01/19/2324.aspx - is that people can message the Google juice w/nofollow. He states: "Let's say someone is posting a link to a competitor's press release, now they can throw a little attribute in and deprive that competitor of well deserved Google juice. Google is giving us a little bit of control over the most important part of their search engine, something I did not think they would want to do."

I don't know how valid a concern this is, since one can use redirection techniques to thrawt giving Google juice, but he does have a good point with Google giving more control over their PageRanking. I think, though, that this is a good idea - a page developer should be able to express if a link is really "worthy" or not in a boolean sort of manner, and that's what nofollow gives us.

# re: Fighting with Blog Spam 1/19/2005 3:11 PM Kent J. Chen's Weblog

# Cool Icons and Images in .Text Skin 1/20/2005 7:20 AM Giddy Up!

# re: The End of Comment Spam? 1/21/2005 8:49 AM Shawn B.

I'm thinking, that the feature in the blogs can be enabled by default, and then an admin can just disable the rel attribute once its proven to be "safe". No different, I suppose, than moderation in the end.

Thanks,
Shawn

# Stopping Comment Spam in .Text Using Triggers 1/24/2005 11:22 AM Scott on Writing

# re: The End of Comment Spam? 1/24/2005 12:05 PM Ann Elisabeth

How about hopefully catching the spammer, somehow figure out a way to shut the outfit down?

Not sure how to do that, but I may have found the spammer.

# The Worth(lessness) of CAPTCHAs 2/2/2005 9:27 AM Scott on Writing

# re: The End of Comment Spam? 12/11/2006 2:40 AM Doc

i have a blogspot blog and often comment on my own blog to link to older related posts. and now i just found out im wasting my time. is there a option on blogspot to shut off this nofollow thing? if not then blogspot should add a checkbox in settings!!

Title:  
Name:  
Url:
Protected by Clearscreen.SharpHIPEnter the code you see:
Comments   

My Links

Ads Via DevMavens

Archives

Post Categories

 

I am a Microsoft MVP for ASP.NET.
I am an ASPInsider.
<March 2010>
SMTWTFS
28123456
78910111213
14151617181920
21222324252627
28293031123
45678910

Comment Stats

DayTotal% of Total
Sunday 2056.8%
Monday 42514.1%
Tuesday 51917.2%
Wednesday 55618.4%
Thursday 58019.2%
Friday 54718.1%
Saturday 1886.2%
Total 3020100.0%

Hour1Total% of Total
12:00 AM 782.6%
1:00 AM 812.7%
2:00 AM 682.3%
3:00 AM 822.7%
4:00 AM 692.3%
5:00 AM 1264.2%
6:00 AM 1193.9%
7:00 AM 1816.0%
8:00 AM 1926.4%
9:00 AM 1585.2%
10:00 AM 1886.2%
11:00 AM 1936.4%
12:00 PM 2016.7%
1:00 PM 1846.1%
2:00 PM 1695.6%
3:00 PM 1354.5%
4:00 PM 1153.8%
5:00 PM 1073.5%
6:00 PM 1013.3%
7:00 PM 1073.5%
8:00 PM 923.0%
9:00 PM 882.9%
10:00 PM 913.0%
11:00 PM 953.1%
Total 3020100.0%

Comments by Blog Entry Date/Time

Day Entry MadeAvg.Total
Sunday 5.00160
Monday 4.80384
Tuesday 4.04477
Wednesday 7.39680
Thursday 6.26676
Friday 5.07466
Saturday 4.78177
Total 5.403020

Hour1 Entry MadeAvg.Total
12:00 AM 5.2937
1:00 AM 1.002
5:00 AM 0.000
7:00 AM 3.8550
8:00 AM 3.72134
9:00 AM 6.06297
10:00 AM 5.63276
11:00 AM 4.22194
12:00 PM 6.16351
1:00 PM 3.09133
2:00 PM 4.89230
3:00 PM 7.67322
4:00 PM 4.00108
5:00 PM 6.07170
6:00 PM 4.64116
7:00 PM 8.95188
8:00 PM 8.63164
9:00 PM 5.00115
10:00 PM 6.31101
11:00 PM 4.5732
Total 5.403020

Learn More About Comment Stats
1 - All times GMT -8...


Blog Stats

Favorite Web Sites

My Books

My MSDN Articles