In an earlier post of mine, Building a Challenge/Response Spam Blocking System, I examined a simple challenge/response anti-spam system I built to halt the deluge of spam that clogs my inbox each day. In fact, I wrote a short article on it detailing the inner workings, observations, and commentary- Stopping Spam.
While the challenge/response system was effective in reducing my spam intake from about 100 messages a day to around 1 or 2 messages a day, the approach, in my estimation, was not ideal. One big disadvantage was that fewer people took the time to respond to the challenge email than I had anticipated. The reasons for this, I deduce, were two-fold. Some people don't want to take the time to follow instructions for a challenge email - maybe their message wasn't that important after all, maybe they're busy, or maybe they just don't like being told what to do. These people's messages, I reckoned, weren't that vital. I mean, if you can't take two seconds to respond to the challenge, then just how important is that email you're sending me?
What worried me, and led me to suspend my C/R anti-spam system, is that I noticed some people weren't responding to the challenge email because they never received it! This unfortunate circumstance could happen if their own spam blocking solution halted my challenge email. (A couple folks informed me that Outlook 2003 categorized my challenge emails as spam. Others using a similar challenge/response anti-spam system would never get my challenge as my challenge would generate a challenge on their side.)
I've been using the Spambayes Outlook Plugin the last couple of days and have been impressed with the results. Spambayes identifies spam using Bayesian techniques, which essentially means it relies on Bayes' Theorem to determine if a message is spam or not. Bayes' Theorem was postulated by Reverand Thomas Bayes in the 18th century, and gives a formula for determining conditional probability. It's useful for answering questions like, "If we know that someone voted for Bush in 2000, what is the probability that he lived in Texas?" Let T be the set of people who live in Texas and let B be the set of people who voted for George Bush. We define the probability that our voter lives in Texas if he voted for Bush, denoted P(T | B), as P(T ^ B) / P(B). Here, P(T ^ B) is the probability of a random US citizen that lives in Texas voted for Bush and P(B) is the probability of a random US citizen who voted for Bush.
Now, how can this be used to help stop spam? The way Spambayes works is by tokenizing each and every incoming email. It then looks up in an internal database and determines how likely it is that each token belongs to spam or ham (ham being non-spam). Spambayes parses each incoming email (and its headers) and asks, "If an email has token x, what is the probability that it is spam?" So, I might get a piece of Viagra spam and Spambayes would find tokens like 'V1agra', 'all night!', 'pleasure her', and so on, and, based on its knowledge from past spams and hams, it would deduce mathematically that there was a high probability that this was a spam message itself. Pretty cool.
Spambayes allows you to view the "score" for each email - namely, the probability that the email's spam. It shows how th email was tokenized and how often each token appears in spam vs. appears in ham. Can you guess what words correlate highly to ham for me? Since I receive a lot of email from listservs, where people post code, an email full of code syntax would be marked as ham. For example, the token 'DataGrid' has appeared in over 50 ham messages, but no spam messages, so there's a strong correlation between that token and ham. The token 'dim' has appeared in over 180 hams, and only one spam; dynamic in 99 hams, 3 spams; 'template' in 63 hams, no spams; 'dataset' in 61 hams, 0 spams; 'ddl' in 392 hams and 13 spams; function in 160 hams and not a single spam.
If you use Spambayes, your hammy tokens would, of course, be different than mine, since it adapts to the emails you receive. So now spammers now how to break my filters - flood their spams with words like function, datagrid, dataset, dim, and so on! :-)