The OSBF-Lua filter written by Fidelis Assis is an amazing spam filter, which is, however, somewhat hard to train. The filter requires feedback when it has misclassified an email or when it is uncertain about classification ("reinforcement training"). The "normal" way to train is to reply to yourself with a re-classification command in the subject line, which is a somewhat complicated and error-prone process.
This program makes the process easier, by scanning two Mbox folders for messages to train as spam and as ham, respectively. (It should also work for Maildir folders, but I haven't tested this.) The user moves or copies messages that have been misclassified or require reinforcement into the appropriate folder, and this program is run by a cron entry periodically to train the filter from all messages it finds in this folders.
This program is loosely based on training scripts written by John Johnston and Michael J. Chudobiak for Bill Yerazunis' CRM114. This version was created by Christian Siefkes.
Installation is a bit complicated, I'm sorry to say. But it might be worth the trouble – after installation, everything will be very easy! :-)
Install the nice little scripting language Lua if
it is yet not present on your system (see INSTALL file in the tar ball).
If you don't want to install to /usr/local/
(e.g. if you can't get root
access), you have to edit the Makefile accordingly.
Install OSBF-Lua, following the
instructions in the "Installation"
section. If you didn't
install Lua in /usr/local/
, you have to edit to "config" file
accordingly and set the LUA_CPATH environment variable as described; you
also need to correct the "#!" interpreter line in the Lua scripts. I used
$HOME/.osbf-lua/
as local directory instead of $HOME/osbf-lua/
so as
not to clutter my home directory.
Fidelis recommends setting the initial osbf.cfg_threshold
to 20 in the
config file, since the interval [-20, +20] is best for reinforcement
training. You can reduce the threshold back to 10 after the databases are
well trained. (Personally, I used 10 since I didn't know this
recommendation when I installed the filter.)
Create two mail folders: TrainAsHam, TrainAsSpam. You can also create a third folder SupposedSpam for the procmail recipe below.
I added two additional recipes to my .procmailrc, after the recipe invoking the spam filter (remove initial spaces when copying and be careful not to introduce additional line breaks). Neither of them is strictly necessary, but you might find them useful too. The first recipe writes the score OSBF-Lua score + label to a log file ("classify.log" in the local OSBF directory). The second recipe delivers those spam messages where the filter is confident about the classification to a special folder (SupposedSpam) instead of to my INBOX.
# Write OSBF-Lua label + score to classify.log file (for statistics):
:0Ach:
| perl -ne 'if (s/^X-OSBF-Lua-Score:\s*([^\]]*\]).*/\1/i) {print localtime(time()) . ": $_"};' >> $OSBF_LUA_USER_DIR/classify.log
# Move mails tagged by the spam filter as probably spam into a special folder
:0AD:
* ^X-OSBF-Lua-Score:.*\[S\]
$MAILDIR/SupposedSpam
After each training operation, OSBF-Lua sends you either a new, correctly
classified copy of the trained message or a report on the result of the
training, depending on how you configure it. If you want neither, you can
out-comment the osbf.cfg_output = "message"
line in the
spamfilter_config.lua
file (to get to training report) and then add the
following procmail recipe after the other ones. The recipe identifies all
reports on successful trainings and deletes them, appending only the
subject line to a log file (train-results.log
in the local OSBF
directory). Note that this recipe will destroy all matching messages so
be careful if you want to use it!
# Delete OSBF-Lua training reports (unless there was an error), logging
# only training result and original subject line in train-results.log.
# (These messages have a "X-OSBF-Lua-Version" header instead of
# "X-OSBF-Lua-Score", Subject starts with "Train" unless there was an error.)
# Note that this recipe will DESTROY all matching messages so be careful!
:0A:
* ^X-OSBF-Lua-Version:
* !^X-OSBF-Lua-Score:
* ^Subject: Train
| perl -e 'print "" . localtime(time()); while (<>) { print ": $1" if /^Subject:\s*(.*)/}; print "\n";' >> $OSBF_LUA_USER_DIR/train-results.log
Install the Mail::Box module if it's not present in your Perl installation.
Download the script and
rename it to trainspamfilter.pl
.
Edit the Configuration section of the script to provide your e-mail address and the password you specified for controlling OSBF-Lua, and to correct other settings (such as the maildir) where necessary.
Make the script executable and move it to into the directory where you
installed spamfilter.lua
(or any other suitable location).
Add a new entry to your crontab ("crontab -e") to execute the script every ten minutes:
*/10 * * * * <PATH-TO-SCRIPT>/trainspamfilter.pl
Congratulations, the spam filter should now be working and you are ready to train it as described below!
OSBF-Lua logs all incoming messages in its local directory. Messages need
to be available for a few days for training, but there is no need to store
them there forever. I have written a small shell
script that can be to
used clean-up this directory from time to time. If you want to use this
script, download and install it just like the other one. The script assumes
$HOME/.osbf-lua/
as your local OSBF-Lua directory; if you use a different
one, you'll have to adapt it.
To execute this script once a month, add this entry or a similar one to your crontab:
0 3 15 * * <PATH-TO-SCRIPT>/spamfilter-cleanup.sh
(this will run the script at 3am of the 15th of each month.)
For regular usage, 3 kinds of user actions are required:
TrainAsHam
folder (unless you
don't need them any more). If you happen to use Thunderbird, messages
will be copied instead of moved if you press the Ctrl key while dragging
them.
I enabled the filter for the first time on 2006-08-17. For initial training I used a tiny, tiny "training set" of 4 ham and 5 spam mails (the 5th of which was already reported as "Training not necessary"). Actually, these were meant less for serious training that for testing that my filter setup actually worked.
At the same time, I turned off my old spam filters (a chain of SpamAssassin and the Naive Bayes filter built into Thunderbird). Since I practically hadn't trained the new filter and I routinely get 1000 or more spam mails per week (people keep telling me I'm happy not to get more!), I expected a huge flood of spam to hit my INBOX. Amazingly, that was not what happened. Instead, the filter, trained from just 8 messages, happily started filing away most of the spam that came in. During the first day of usage, I only had to (reinforcement) train 1 spam and 4 ham messages, while the filter delivered 98 messages as "certainly spam" into the SupposedSpam folder and 23 messages as "certainly ham" into my INBOX.
Of course, in the following days, some more training was necessary – but far less than I would have expected.
In the first month of usage, the filter classified 5985 messages:
Classified as | Spam | Ham |
---|---|---|
Certain | 4472 | 1183 |
Uncertain | 198 | 132 |
Sum | 4670 | 1315 |
In 330 cases in the classifier was uncertain of the classification, i.e. I had to check and re-inforce the decision.
Obviously, some of the classifications turned out to be wrong:
Misclassified as | Spam | Ham |
---|---|---|
Certain | 4 | 6 |
Uncertain | 24 | 17 |
Sum | 28 | 23 |
Most of the misclassified messages had been in the reinforcement zone, where the spam filter had been uncertain about its decision. The number that's slightly worrisome here is that 4 ham messages had been classified as certain spam, i.e. they ended up in the SupposedSpam folder. They all occurred in the first 1-2 weeks when the spam filter hadn't had much training and they were all borderline/problematic cases (a forwarded HTML mail with a link and very little other text; a digest from a list I hadn't read before; a commercial inquiry from a company I had been in contact with before; a very short mail from a list which had previously been abused as a spam relay).
After the first two weeks, I didn't have any such problems – but it really shows that you should check your spam folder regularly, at least until the classifier is properly trained! (Note that I used a small reinforcement threshold of 10 from the very start. Fidelis now recommends starting with a larger threshold of 20 and only reducing the threshold to 10 after the filter is well-trained. This would have meant more initial training effort, but it might have avoided such false positives.)
To judge the spam filtering performance, we can calculate the frequently used metrics precision (percentage of supposed spam which actually was spam) and recall (percentage of spam that was caught). For this we need the counts:
True positives | (tp, correctly identified spam) | 4642 |
False positives | (fp, ham misclassified as spam) | 28 |
False negatives | (fn, spam misclassified as ham) | 23 |
Which gets us:
Precision | 99.4% | (tp / (tp+fp)) |
Recall | 99.5% | (tp / (tp+fn)) |
Since 51 of 5985 messages had been misclassified, the overall accuracy is 99.1%.
Considering that the filter had only been pretrained with 9 (actually, 8) messages, I feel it has done quite well :-)
[Last generated: 2024-09-21] |