B8 Spam Filter
B8 Spam Filter Overview
Maian Support uses the open source B8 Spam Filter, created by Tobias Leupold.
The information here was provided by Tobias to help better understand the spam filter, so please read carefully. If you don`t understand the spam filter, the default
settings (as recommended by Tobias) will probably be fine.
An overview of the spam filter options can be found here.
An overview of the spam filter options can be found here.
What is B8?
B8 is a spam filter implemented in PHP. It is intended to keep your weblog or guestbook spam-free. The filter can be used anywhere in your PHP code and tells you whether a text is spam or not, using statistical text analysis. What it does is: you give B8 a text and it returns a value between 0 and 1, saying it's ham when it's near 0 and saying it's spam when it's near 1. More info below in the next section.
To be able to do this, B8 first has to learn some spam and some ham (non-spam) texts. If it makes mistakes when classifying unknown texts or the result is not distinct enough, B8 can be told what the text actually is, getting better with each learned text.
B8 is a statistical spam filter. I'm not a mathematician, but as far as I can grasp it, the math used in B8 has not much to do with Bayes' theorem itself. So I call it a statistical spam filter, not a Bayesian one. Principally, It's a program like Bogofilter or SpamBayes, but it is not intended to classify emails. Therefore, the way B8 works is slightly different from email spam filters.
An example of what we're talking about here:
At the moment of this writing (november 2012), B8 has, since december 2006, classified 26869 guestbook entries and weblog comments on my homepage. 145 were ham. 76 spam texts (0.28 %) have been falsely rated as ham (false negatives) and I had to remove them manually. Only one single ham message has been falsely classified as spam (false positive) back in june 2010, but – in defense of B8 – this was the very first English ham text I got. Previously, each and every of the 15024 English texts posted has been spam. Texts with Chinese, Japanese or Cyrillic content (all spam either) did not appear until 2011.
This results in a sensitivity of 99.72 % (the probability that a spam text will actually be rated as spam) and a specifity of 99.31 % (the probability that a ham text will actually be rated as ham) for my homepage. Before the one false positive, of course, the specifity has been 100%
To be able to do this, B8 first has to learn some spam and some ham (non-spam) texts. If it makes mistakes when classifying unknown texts or the result is not distinct enough, B8 can be told what the text actually is, getting better with each learned text.
B8 is a statistical spam filter. I'm not a mathematician, but as far as I can grasp it, the math used in B8 has not much to do with Bayes' theorem itself. So I call it a statistical spam filter, not a Bayesian one. Principally, It's a program like Bogofilter or SpamBayes, but it is not intended to classify emails. Therefore, the way B8 works is slightly different from email spam filters.
An example of what we're talking about here:
At the moment of this writing (november 2012), B8 has, since december 2006, classified 26869 guestbook entries and weblog comments on my homepage. 145 were ham. 76 spam texts (0.28 %) have been falsely rated as ham (false negatives) and I had to remove them manually. Only one single ham message has been falsely classified as spam (false positive) back in june 2010, but – in defense of B8 – this was the very first English ham text I got. Previously, each and every of the 15024 English texts posted has been spam. Texts with Chinese, Japanese or Cyrillic content (all spam either) did not appear until 2011.
This results in a sensitivity of 99.72 % (the probability that a spam text will actually be rated as spam) and a specifity of 99.31 % (the probability that a ham text will actually be rated as ham) for my homepage. Before the one false positive, of course, the specifity has been 100%
Text: Tobias Leupold
How does it work?
In principle, B8 uses the math and technique described in Gary Robinson's articles "A Statistical Approach to the Spam Problem" and "Spam Detection". The "degeneration" method Paul Graham proposed in "Better Bayesian Filtering" has also been implemented.
B8 cuts the text to classify to pieces, extracting stuff like email addresses, links and HTML tags and of course normal words. For each such token, it calculates a single probability for a text containing it being spam, based on what the filter has learned so far. When the token has not been seen before, B8 tries to find similar ones using "degeneration" and uses the most relevant value found. If really nothing is found, B8 assumes a default rating for this token for the further calculations.
Then, B8 takes the most relevant values (which have a rating far from 0.5, which would mean we don't know what it is) and calculates the combined probability that the whole text is spam.
B8 cuts the text to classify to pieces, extracting stuff like email addresses, links and HTML tags and of course normal words. For each such token, it calculates a single probability for a text containing it being spam, based on what the filter has learned so far. When the token has not been seen before, B8 tries to find similar ones using "degeneration" and uses the most relevant value found. If really nothing is found, B8 assumes a default rating for this token for the further calculations.
Then, B8 takes the most relevant values (which have a rating far from 0.5, which would mean we don't know what it is) and calculates the combined probability that the whole text is spam.
Text: Tobias Leupold
Learning Filters & Getting Started in Maian Support
Before B8 can decide whether a text is spam or ham, you have to tell it what you consider as spam or ham. At least one learned spam or one learned ham text is needed to calculate anything. With nothing learned, B8 will rate everything with your score "Spam Score Deviation" (or whatever "Gary Robinsons X Constant" has been set to). To get good ratings, you need both learned ham and learned spam texts, the more the better.
Learning Options > Add to Learning Filters
To start classifying spam in Maian Support, enter some keywords or an email message body into the learning filter and process accordingly.
As mentioned above, the first time a message comes through it will always match your allowed score because it has nothing learnt. Once you start entering keywords or text it will start to learn and be able to classify incoming spam better.
Alternatively, once you start accepting or rejecting tickets via spam tickets, the learning filters will start learning if they are enabled.
As mentioned above, the first time a message comes through it will always match your allowed score because it has nothing learnt. Once you start entering keywords or text it will start to learn and be able to classify incoming spam better.
Alternatively, once you start accepting or rejecting tickets via spam tickets, the learning filters will start learning if they are enabled.
Skip Filters
This has been implemented in Maian Support, but is NOT a feature of B8. Any header (name, subject, message etc) that finds a match will flag the message
and delete it. This should be approached with caution.
Imap Logs
All operations of the imap filters are logged if enabled in the settings (Settings > General > Imap Settings). If you find something has been caught by the spam filters view the logs for
more information.
Spam Filter Options
Return to Imap Filter Options