A Training Script for OSBF-Lua

Idea

The OSBF-Lua filter written by Fidelis Assis is an amazing spam filter, which is, however, somewhat hard to train. The filter requires feedback when it has misclassified an email or when it is uncertain about classification ("reinforcement training"). The "normal" way to train is to reply to yourself with a re-classification command in the subject line, which is a somewhat complicated and error-prone process.

This program makes the process easier, by scanning two Mbox folders for messages to train as spam and as ham, respectively. (It should also work for Maildir folders, but I haven't tested this.) The user moves or copies messages that have been misclassified or require reinforcement into the appropriate folder, and this program is run by a cron entry periodically to train the filter from all messages it finds in this folders.

This program is loosely based on training scripts written by John Johnston and Michael J. Chudobiak for Bill Yerazunis' CRM114. This version was created by Christian Siefkes.

Installation

Installation is a bit complicated, I'm sorry to say. But it might be worth the trouble – after installation, everything will be very easy! :-)

Clean-up script

OSBF-Lua logs all incoming messages in its local directory. Messages need to be available for a few days for training, but there is no need to store them there forever. I have written a small shell script that can be to used clean-up this directory from time to time. If you want to use this script, download and install it just like the other one. The script assumes $HOME/.osbf-lua/ as your local OSBF-Lua directory; if you use a different one, you'll have to adapt it.

To execute this script once a month, add this entry or a similar one to your crontab:

 0 3 15 * * <PATH-TO-SCRIPT>/spamfilter-cleanup.sh

(this will run the script at 3am of the 15th of each month.)

Tips and recommendations

Usage

For regular usage, 3 kinds of user actions are required:

Experience Report

I enabled the filter for the first time on 2006-08-17. For initial training I used a tiny, tiny "training set" of 4 ham and 5 spam mails (the 5th of which was already reported as "Training not necessary"). Actually, these were meant less for serious training that for testing that my filter setup actually worked.

At the same time, I turned off my old spam filters (a chain of SpamAssassin and the Naive Bayes filter built into Thunderbird). Since I practically hadn't trained the new filter and I routinely get 1000 or more spam mails per week (people keep telling me I'm happy not to get more!), I expected a huge flood of spam to hit my INBOX. Amazingly, that was not what happened. Instead, the filter, trained from just 8 messages, happily started filing away most of the spam that came in. During the first day of usage, I only had to (reinforcement) train 1 spam and 4 ham messages, while the filter delivered 98 messages as "certainly spam" into the SupposedSpam folder and 23 messages as "certainly ham" into my INBOX.

Of course, in the following days, some more training was necessary – but far less than I would have expected.

In the first month of usage, the filter classified 5985 messages:

Classified as Spam Ham
Certain 4472 1183
Uncertain 198 132
Sum 4670 1315

In 330 cases in the classifier was uncertain of the classification, i.e. I had to check and re-inforce the decision.

Obviously, some of the classifications turned out to be wrong:

Misclassified as Spam Ham
Certain 4 6
Uncertain 24 17
Sum 28 23

Most of the misclassified messages had been in the reinforcement zone, where the spam filter had been uncertain about its decision. The number that's slightly worrisome here is that 4 ham messages had been classified as certain spam, i.e. they ended up in the SupposedSpam folder. They all occurred in the first 1-2 weeks when the spam filter hadn't had much training and they were all borderline/problematic cases (a forwarded HTML mail with a link and very little other text; a digest from a list I hadn't read before; a commercial inquiry from a company I had been in contact with before; a very short mail from a list which had previously been abused as a spam relay).

After the first two weeks, I didn't have any such problems – but it really shows that you should check your spam folder regularly, at least until the classifier is properly trained! (Note that I used a small reinforcement threshold of 10 from the very start. Fidelis now recommends starting with a larger threshold of 20 and only reducing the threshold to 10 after the filter is well-trained. This would have meant more initial training effort, but it might have avoided such false positives.)

To judge the spam filtering performance, we can calculate the frequently used metrics precision (percentage of supposed spam which actually was spam) and recall (percentage of spam that was caught). For this we need the counts:

True positives (tp, correctly identified spam) 4642
False positives (fp, ham misclassified as spam) 28
False negatives (fn, spam misclassified as ham) 23

Which gets us:

Precision 99.4% (tp / (tp+fp))
Recall 99.5% (tp / (tp+fn))

Since 51 of 5985 messages had been misclassified, the overall accuracy is 99.1%.

Considering that the filter had only been pretrained with 9 (actually, 8) messages, I feel it has done quite well :-)

To Do


[Last generated: 2024-09-21] Valid XHTML 1.0 Transitional