Wednesday, May 25, 2005

Fighting SPAM: Bayesian Filtering

SPAM has been the biggest nuisance for the Internet community off late. An estimated 74 percent of email being trasfrred over Internet in the year 2004 was identified as junk. Governments and organizations around the globe have been busy estimating the losses incurred due to SPAM and figuring ways to counter it. On the technology front, researchers and software companies have been finding ways to identify and isolate/eliminate SPAM. Over a period of time this has been a never ending tussle between the SPAMmers and the antispam software developers. Firstly it were simple software that filtered out mails based on a word-list. The administrator would prepare a list of words that will cause an email to be considered SPAM. This worked for sometime, but not long. Spammers soon found ways to dodge these filters. They would write v1agra instead of viagra and p3n1s instead of penis to get past the filter. Clearly, the techniqe couldn't be very successful due to the reason that word-list filters are easy to dodge and a too restrictive word-filter causes false positives, resulting in a loss of legitimate mails. Another techology that has been vastly in use is based on authentication method. Such a software works on a simple challenge-response method. The sender is responded back with a message asking him/her to reply back to a mail. Whereas this method literally gurantees to protect from SPAM, it is not feasible to use it beyond a limit, practically eliminating this as an infeasible solution. There have been other methods to identify SPAM and most of them work on some kind of "Whitelist" or "Blacklist". All such methods require continuous refinement in the list and are not very accurate. Recently, more and more email products are adopting a technology called Bayesian Filtering to fight SPAM. This technolgy is based on the works by Paul Graham [http://www.paulgraham.com/]. During the summer of 2002, he stumbled upon the idea of using Bayes Theorm (the work of 18th century English mathemetician Thomas Bayes www.fsu.edu/~geog/elsner/bayesian/post/Bellhouse2003.pdf) to eliminate SPAM. Rest, as they say, is history. Bayesian filters as being incorporated in more and more email products because of its accuracy. The thing that sets Bayesian filters apart from the rest is, its ability to learn by itself. According to Paul Graham, a "well taught" Bayesian filter can be as accurate as 99.5 percent, without any false positives. Some of the email products that are using Bayesian Filtering todate:
  • Thunderbird
  • Mozzila
  • GFI MailEssentials
  • TrustedMail
  • SpamBully
  • SpamAssassin
  • SPAM Shredder
  • PlexMailer
  • Safe Express
Some intereting links:

No comments: