A Training Script for OSBF-Lua

Idea

The OSBF-Lua filter written by Fidelis Assis is an amazing spam filter, which is, however, somewhat hard to train. The filter requires feedback when it has misclassified an email or when it is uncertain about classification ("reinforcement training"). The "normal" way to train is to reply to yourself with a re-classification command in the subject line, which is a somewhat complicated and error-prone process.

This program makes the process easier, by scanning two Mbox folders for messages to train as spam and as ham, respectively. (It should also work for Maildir folders, but I haven't tested this.) The user moves or copies messages that have been misclassified or require reinforcement into the appropriate folder, and this program is run by a cron entry periodically to train the filter from all messages it finds in this folders.

This program is loosely based on training scripts written by John Johnston and Michael J. Chudobiak for Bill Yerazunis' CRM114. This version was created by Christian Siefkes.

Installation

Installation is a bit complicated, I'm sorry to say. But it might be worth the trouble – after installation, everything will be very easy! :-)

Install the nice little scripting language Lua if it is yet not present on your system (see INSTALL file in the tar ball). If you don't want to install to /usr/local/ (e.g. if you can't get root access), you have to edit the Makefile accordingly.
Install OSBF-Lua, following the instructions in the "Installation" section. If you didn't install Lua in /usr/local/, you have to edit to "config" file accordingly and set the LUA_CPATH environment variable as described; you also need to correct the "#!" interpreter line in the Lua scripts. I used $HOME/.osbf-lua/ as local directory instead of $HOME/osbf-lua/ so as not to clutter my home directory.
Fidelis recommends setting the initial osbf.cfg_threshold to 20 in the config file, since the interval [-20, +20] is best for reinforcement training. You can reduce the threshold back to 10 after the databases are well trained. (Personally, I used 10 since I didn't know this recommendation when I installed the filter.)
Create two mail folders: TrainAsHam, TrainAsSpam. You can also create a third folder SupposedSpam for the procmail recipe below.
I added two additional recipes to my .procmailrc, after the recipe invoking the spam filter (remove initial spaces when copying and be careful not to introduce additional line breaks). Neither of them is strictly necessary, but you might find them useful too. The first recipe writes the score OSBF-Lua score + label to a log file ("classify.log" in the local OSBF directory). The second recipe delivers those spam messages where the filter is confident about the classification to a special folder (SupposedSpam) instead of to my INBOX.
```
 # Write OSBF-Lua label + score to classify.log file (for statistics):
 :0Ach:
 | perl -ne 'if (s/^X-OSBF-Lua-Score:\s*([^\]]*\]).*/\1/i) {print localtime(time()) . ": $_"};' >> $OSBF_LUA_USER_DIR/classify.log


 # Move mails tagged by the spam filter as probably spam into a special folder
 :0AD:
 * ^X-OSBF-Lua-Score:.*\[S\]
 $MAILDIR/SupposedSpam
```
After each training operation, OSBF-Lua sends you either a new, correctly classified copy of the trained message or a report on the result of the training, depending on how you configure it. If you want neither, you can out-comment the osbf.cfg_output = "message" line in the spamfilter_config.lua file (to get to training report) and then add the following procmail recipe after the other ones. The recipe identifies all reports on successful trainings and deletes them, appending only the subject line to a log file (train-results.log in the local OSBF directory). Note that this recipe will destroy all matching messages so be careful if you want to use it!
```
 # Delete OSBF-Lua training reports (unless there was an error), logging
 # only training result and original subject line in train-results.log.
 # (These messages have a "X-OSBF-Lua-Version" header instead of
 # "X-OSBF-Lua-Score", Subject starts with "Train" unless there was an error.)
 # Note that this recipe will DESTROY all matching messages so be careful!
 :0A:
 * ^X-OSBF-Lua-Version:
 * !^X-OSBF-Lua-Score:
 * ^Subject: Train
 | perl -e 'print "" . localtime(time()); while (<>) { print ": $1" if /^Subject:\s*(.*)/}; print "\n";' >> $OSBF_LUA_USER_DIR/train-results.log
```
Install the Mail::Box module if it's not present in your Perl installation.
Download the script and rename it to trainspamfilter.pl .
Edit the Configuration section of the script to provide your e-mail address and the password you specified for controlling OSBF-Lua, and to correct other settings (such as the maildir) where necessary.
Make the script executable and move it to into the directory where you installed spamfilter.lua (or any other suitable location).
Add a new entry to your crontab ("crontab -e") to execute the script every ten minutes:
```
 */10 * * * * <PATH-TO-SCRIPT>/trainspamfilter.pl
```
Congratulations, the spam filter should now be working and you are ready to train it as described below!

Clean-up script

OSBF-Lua logs all incoming messages in its local directory. Messages need to be available for a few days for training, but there is no need to store them there forever. I have written a small shell script that can be to used clean-up this directory from time to time. If you want to use this script, download and install it just like the other one. The script assumes $HOME/.osbf-lua/ as your local OSBF-Lua directory; if you use a different one, you'll have to adapt it.

To execute this script once a month, add this entry or a similar one to your crontab:

 0 3 15 * * <PATH-TO-SCRIPT>/spamfilter-cleanup.sh

(this will run the script at 3am of the 15th of each month.)

Tips and recommendations

You should disable any other spam filter (e.g. the simple Bayes filter built into Thunderbird – open "Tools / Junk Mail Controls / Adaptive Filter tab" to disable it).
If you configure your mail client to sort your INBOX alphabetically by subject, messages in the reinforcement zone will be grouped together for easier training (marked with "[-]" if probably spam, marked with "[+]" if probably ham). If you happen to use Thunderbird, select "Sort by: Subject" from the "View" menu.

Usage

For regular usage, 3 kinds of user actions are required:

Move any missed spam messages (that end up in your INBOX) into the TrainAsSpam folder.
If the classifier is not quite sure of its decision, it marks messages by prepending a "[+]" (probably ham) or "[-]" (probably spam). These message are in the "reinforcement zone" and should be trained in any case. Move them into the TrainAsSpam folder if they are spam; copy them into the TrainAsHam folder if they aren't. Note that the training script deletes all messages it finds in these folders (after training), so you shouldn't move ham messages into the TrainAsHam folder (unless you don't need them any more). If you happen to use Thunderbird, messages will be copied instead of moved if you press the Ctrl key while dragging them.
You should check the SupposedSpam folder regularly (once a week or so if the filter is running fine, but once a day for the first week(s) after installation) and delete all the collected spam. If you happen to find any misclassified ham messages there (this can happen, especially in the first days), copy them to the TrainAsHam folder. Of course, you must also move them to a suitable storage folder or to your INBOX if you want to preserve them.

Experience Report

I enabled the filter for the first time on 2006-08-17. For initial training I used a tiny, tiny "training set" of 4 ham and 5 spam mails (the 5th of which was already reported as "Training not necessary"). Actually, these were meant less for serious training that for testing that my filter setup actually worked.

At the same time, I turned off my old spam filters (a chain of SpamAssassin and the Naive Bayes filter built into Thunderbird). Since I practically hadn't trained the new filter and I routinely get 1000 or more spam mails per week (people keep telling me I'm happy not to get more!), I expected a huge flood of spam to hit my INBOX. Amazingly, that was not what happened. Instead, the filter, trained from just 8 messages, happily started filing away most of the spam that came in. During the first day of usage, I only had to (reinforcement) train 1 spam and 4 ham messages, while the filter delivered 98 messages as "certainly spam" into the SupposedSpam folder and 23 messages as "certainly ham" into my INBOX.

Of course, in the following days, some more training was necessary – but far less than I would have expected.

In the first month of usage, the filter classified 5985 messages:

Classified as	Spam	Ham
Certain	4472	1183
Uncertain	198	132
Sum	4670	1315

In 330 cases in the classifier was uncertain of the classification, i.e. I had to check and re-inforce the decision.

Obviously, some of the classifications turned out to be wrong:

Misclassified as	Spam	Ham
Certain	4	6
Uncertain	24	17
Sum	28	23

Most of the misclassified messages had been in the reinforcement zone, where the spam filter had been uncertain about its decision. The number that's slightly worrisome here is that 4 ham messages had been classified as certain spam, i.e. they ended up in the SupposedSpam folder. They all occurred in the first 1-2 weeks when the spam filter hadn't had much training and they were all borderline/problematic cases (a forwarded HTML mail with a link and very little other text; a digest from a list I hadn't read before; a commercial inquiry from a company I had been in contact with before; a very short mail from a list which had previously been abused as a spam relay).

After the first two weeks, I didn't have any such problems – but it really shows that you should check your spam folder regularly, at least until the classifier is properly trained! (Note that I used a small reinforcement threshold of 10 from the very start. Fidelis now recommends starting with a larger threshold of 20 and only reducing the threshold to 10 after the filter is well-trained. This would have meant more initial training effort, but it might have avoided such false positives.)

To judge the spam filtering performance, we can calculate the frequently used metrics precision (percentage of supposed spam which actually was spam) and recall (percentage of spam that was caught). For this we need the counts:

True positives	(tp, correctly identified spam)	4642
False positives	(fp, ham misclassified as spam)	28
False negatives	(fn, spam misclassified as ham)	23

Which gets us:

Precision	99.4%	(tp / (tp+fp))
Recall	99.5%	(tp / (tp+fn))

Since 51 of 5985 messages had been misclassified, the overall accuracy is 99.1%.

Considering that the filter had only been pretrained with 9 (actually, 8) messages, I feel it has done quite well :-)

To Do

The Mail::Box chokes whenever it encounters a malformed mail header. Hence we shouldn't use it to extract SFIDs since we should not expect all spammers to comply with the RFCs ;-)
Currently the script trains first all collected ham mails and than all collected spam mails. It would be better to train spam and ham mails in a mixed order, or in the order they arrived. Also, the script should be modified to just send a single message with the new "batch_train" command introduced in OSBF-Lua v2.0.2, instead of generating a separate command message for each mail it trains.
The training script could be rewritten in Lua and integrated more tightly with the spamfilter, instead of just triggering it remotely.

[Last generated: 2024-09-21]