Stopping Comment Spam in .Text Using Triggers
With Google's recent rel=”nofollow” initiative designed to reduce blog comment spam, I've been reading up more on comment spam, techniques to fight it, and so on. One great resource is the Comment Spam blog, which contains a blacklist of comment spammer URLs and a plugin for blocking comment spam for those with a Movable Type blog.
As discussed earlier, there are a number of techniques for fighting comment spam - moderation, filters, captchas, comment URL munging, and so on. Being a typical developer, one of my main motivations in any endeavor is choosing a tactic that will require the least amount of work from yours truly. Clearly this rules out moderation. I have doubts that URL munging, which Google's initiative is fostering, will stop comment spam. It will reduce its effectiveness, granted, but spammers are a stubborn and evil bunch, and I don't see them just throwing in the towel.
My preferred technique for stopping comment spam before it starts is to use filters. If you have access to your .Text blog's SQL Server, you can setup filtering fairly simple. My first attempt at this was to modify the blog_InsertEntry stored procedure and add hard-coded IF statements that would short circuit the INSERT into the database table if the comment text contained any offending URLs. Basically this came down to something like:
IF CHARINDEX(bannedURL1, @Text) = 0 OR CHARINDEX(bannedURL2, @Text) = 0 ... OR CHARINDEX(bannedURLN, @Text) = 0
BEGIN
... run stored procedure ...
END
Basically the logic here was if the text did not contain an instance of any of the bannedURLs, then run the stored procedure statements. Otherwise, none of the sproc's instructions were executed, resulting in nothing being commited to the database.
The problem with this approach was that the bannedURLs were hard-coded in the sproc, so if a new spammer came along, I'd have to add another OR clause to that ever-growing IF statement. This violated my “do as little work as possible” creed. In reading up on comment spam, I came across the blog entry Preventing .Text Blog Spam Using Triggers. This technique did three things that I liked:
- It used triggers rather than mucking with the stored procedure. The benefit of this, as I see, is that it makes the comment spam stopping piece orthogonal to the .Text application. That is, it's a component that could be added or removed without requiring any change to the core .Text sprocs, tables, etc.
- Rather than using hard-coded checks, the trigger queried a table that contained offending substrings. This way, updating the list of banned URLs could be done through a simple Web-based interface.
- In addition to checking for banned substrings, the trigger approach also counted up the number of links in the text body and blocked comments that had 15 or more links. (Many times comment spammers will leave dozens of links in one post.
I made a few changes to the script provided. The script provided had a table that would allow one to specify banned substrings. That is, you could remove comments with the word Viagra, for example. I decided this was more flexibility than needed, so I changed the table and logic to allow just URLs being specified. Additionally, I had to change the INSERT statements in the trigger since I use a more antiquated version of .Text.
The one downside of this approach (or any filter approach of this manner) is that each blogger has his own list of banned URLs. If a new comment spammer Web site comes about and this comment spammer starts spreading around his or her links, each blogger will have to waste their own time, adding this URL to the blacklist. What we need is a distributed model, where everyone can contribute to the blacklist globally, and .Text blogs will automatically update their table of banned URLs periodically, as well as update the global table. I don't know of a manner to automate this process, but it wouldn't be difficult to create a WinForms desktop client that would allow a user to add new entries to the blacklist and, optionally, update/download the data from the global blacklist. Do you think there would be a demand for this? Would enough people use this to make it worthwhile?