Scott on Writing

Musings on technical writing...

Giving SpamBayes a Try

In an earlier post of mine, Building a Challenge/Response Spam Blocking System, I examined a simple challenge/response anti-spam system I built to halt the deluge of spam that clogs my inbox each day. In fact, I wrote a short article on it detailing the inner workings, observations, and commentary- Stopping Spam.

While the challenge/response system was effective in reducing my spam intake from about 100 messages a day to around 1 or 2 messages a day, the approach, in my estimation, was not ideal. One big disadvantage was that fewer people took the time to respond to the challenge email than I had anticipated. The reasons for this, I deduce, were two-fold. Some people don't want to take the time to follow instructions for a challenge email - maybe their message wasn't that important after all, maybe they're busy, or maybe they just don't like being told what to do. These people's messages, I reckoned, weren't that vital. I mean, if you can't take two seconds to respond to the challenge, then just how important is that email you're sending me?

What worried me, and led me to suspend my C/R anti-spam system, is that I noticed some people weren't responding to the challenge email because they never received it! This unfortunate circumstance could happen if their own spam blocking solution halted my challenge email. (A couple folks informed me that Outlook 2003 categorized my challenge emails as spam. Others using a similar challenge/response anti-spam system would never get my challenge as my challenge would generate a challenge on their side.)

I've been using the Spambayes Outlook Plugin the last couple of days and have been impressed with the results. Spambayes identifies spam using Bayesian techniques, which essentially means it relies on Bayes' Theorem to determine if a message is spam or not. Bayes' Theorem was postulated by Reverand Thomas Bayes in the 18th century, and gives a formula for determining conditional probability. It's useful for answering questions like, "If we know that someone voted for Bush in 2000, what is the probability that he lived in Texas?" Let T be the set of people who live in Texas and let B be the set of people who voted for George Bush. We define the probability that our voter lives in Texas if he voted for Bush, denoted P(T | B), as P(T ^ B) / P(B). Here, P(T ^ B) is the probability of a random US citizen that lives in Texas voted for Bush and P(B) is the probability of a random US citizen who voted for Bush.

Now, how can this be used to help stop spam? The way Spambayes works is by tokenizing each and every incoming email. It then looks up in an internal database and determines how likely it is that each token belongs to spam or ham (ham being non-spam). Spambayes parses each incoming email (and its headers) and asks, "If an email has token x, what is the probability that it is spam?" So, I might get a piece of Viagra spam and Spambayes would find tokens like 'V1agra', 'all night!', 'pleasure her', and so on, and, based on its knowledge from past spams and hams, it would deduce mathematically that there was a high probability that this was a spam message itself. Pretty cool.

Spambayes allows you to view the "score" for each email - namely, the probability that the email's spam. It shows how th email was tokenized and how often each token appears in spam vs. appears in ham. Can you guess what words correlate highly to ham for me? Since I receive a lot of email from listservs, where people post code, an email full of code syntax would be marked as ham. For example, the token 'DataGrid' has appeared in over 50 ham messages, but no spam messages, so there's a strong correlation between that token and ham. The token 'dim' has appeared in over 180 hams, and only one spam; dynamic in 99 hams, 3 spams; 'template' in 63 hams, no spams; 'dataset' in 61 hams, 0 spams; 'ddl' in 392 hams and 13 spams; function in 160 hams and not a single spam.

If you use Spambayes, your hammy tokens would, of course, be different than mine, since it adapts to the emails you receive. So now spammers now how to break my filters - flood their spams with words like function, datagrid, dataset, dim, and so on! :-)

posted on Tuesday, January 27, 2004 2:52 PM

Feedback

# Bayesian Statistics and Spam 4/22/2004 12:40 PM Keith 'StarPilot' Barrows (Tech Blog)

# Bayesian Statistics and Spam 4/22/2004 1:06 PM just Keith Barrows

# re: Giving SpamBayes a Try 5/11/2004 12:04 PM Scott Mitchell

For the record, I use SpamBayes regularly now... and am loving it! :-)

# re: Giving SpamBayes a Try 6/18/2004 3:30 PM Fred Martin

The thing that will kill Bayesian, though, is that I am getting an increasing amount of spam with paragraphs of words (not sentences) at the bottom of the email. The words are benign and relate to anything -except- viagra and the like, so the bulk of the incoming message is more ham-like than spam-like. There's an obvious attempt by spammers to get a high ham-like bayesian score.

# A Neat Idea, but a Poor Implementation 10/20/2005 8:57 AM Scott on Writing

# Some Spam for Thought 4/17/2006 5:31 PM Scott On Life

Title:  
Name:  
Url:
Protected by Clearscreen.SharpHIPEnter the code you see:
Comments   

Add To Your Reader

My Links

Archives

Post Categories

 

I am a Microsoft MVP for ASP.NET.
I am an ASPInsider.
<May 2008>
SMTWTFS
27282930123
45678910
11121314151617
18192021222324
25262728293031
1234567

Comment Stats

DayTotal% of Total
Sunday 1866.8%
Monday 37913.9%
Tuesday 45316.7%
Wednesday 50418.5%
Thursday 53519.7%
Friday 49418.2%
Saturday 1666.1%
Total 2717100.0%

Hour1Total% of Total
12:00 AM 652.4%
1:00 AM 682.5%
2:00 AM 622.3%
3:00 AM 742.7%
4:00 AM 572.1%
5:00 AM 1033.8%
6:00 AM 1084.0%
7:00 AM 1585.8%
8:00 AM 1716.3%
9:00 AM 1475.4%
10:00 AM 1716.3%
11:00 AM 1816.7%
12:00 PM 1886.9%
1:00 PM 1696.2%
2:00 PM 1605.9%
3:00 PM 1324.9%
4:00 PM 1073.9%
5:00 PM 923.4%
6:00 PM 913.3%
7:00 PM 963.5%
8:00 PM 833.1%
9:00 PM 782.9%
10:00 PM 792.9%
11:00 PM 772.8%
Total 2717100.0%

Comments by Blog Entry Date/Time

Day Entry MadeAvg.Total
Sunday 5.54144
Monday 5.22339
Tuesday 4.28419
Wednesday 7.67637
Thursday 6.90607
Friday 5.48411
Saturday 5.33160
Total 5.842717

Hour1 Entry MadeAvg.Total
12:00 AM 5.0035
1:00 AM 1.002
5:00 AM 0.000
7:00 AM 7.0035
8:00 AM 5.35107
9:00 AM 6.32278
10:00 AM 6.47246
11:00 AM 4.41181
12:00 PM 6.88330
1:00 PM 3.00111
2:00 PM 5.41222
3:00 PM 8.64285
4:00 PM 4.0589
5:00 PM 5.92154
6:00 PM 4.52113
7:00 PM 9.67174
8:00 PM 9.80147
9:00 PM 5.05111
10:00 PM 5.4265
11:00 PM 4.5732
Total 5.842717

Learn More About Comment Stats
1 - All times GMT -8...


Blog Stats

Favorite Web Sites

My Books

My MSDN Articles