The OSBF-Lua filter written by Fidelis Assis is an amazing spam filter, which is, however, somewhat hard to train. The filter requires feedback when it has misclassified an email or when it is uncertain about classification ("reinforcement training"). The "normal" way to train is to reply to yourself with a re-classification command in the subject line, which is a somewhat complicated and error-prone process.
This program makes the process easier, by scanning two Mbox folders for messages to train as spam and as ham, respectively. (It should also work for Maildir folders, but I haven't tested this.) The user moves or copies messages that have been misclassified or require reinforcement into the appropriate folder, and this program is run by a cron entry periodically to train the filter from all messages it finds in this folders.
This program is loosely based on training scripts written by John Johnston and Michael J. Chudobiak for Bill Yerazunis' CRM114. This version was created by Christian Siefkes.
Installation is a bit complicated, I'm sorry to say. But it might be worth the trouble – after installation, everything will be very easy! :-)
/usr/local/ (e.g. if you can't get root
access), you have to edit the Makefile accordingly.
/usr/local/, you have to edit to "config" file accordingly and set
the LUA_CPATH environment variable as described; you also need to correct
the "#!" interpreter line in the Lua scripts. I used $HOME/.osbf-lua/
as local directory instead of $HOME/osbf-lua/ so as not to clutter my
home directory.
osbf.cfg_threshold to 20 in the
config file, since the interval [-20, +20] is best for reinforcement
training. You can reduce the threshold back to 10 after the databases are
well trained. (Personally, I used 10 since I didn't know this recommendation
when I installed the filter.)
# Write OSBF-Lua label + score to classify.log file (for statistics):
:0Ach:
| perl -ne 'if (s/^X-OSBF-Lua-Score:\s*([^\]]\])./\1/i) {print localtime(time()) . ": $"};' >> $OSBFLUA_USER_DIR/classify.log
# Move mails tagged by the spam filter as probably spam into a special folder
:0AD:
^X-OSBF-Lua-Score:.\[S\]
$MAILDIR/SupposedSpam
osbf.cfg_output = "message" line in the
spamfilter_config.lua file (to get to training report) and then add the
following procmail recipe after the other ones. The recipe identifies all
reports on successful trainings and deletes them, appending only the
subject line to a log file (train-results.log in the local OSBF
directory). Note that this recipe will destroy all matching messages
so be careful if you want to use it!
# Delete OSBF-Lua training reports (unless there was an error), logging
# only training result and original subject line in train-results.log.
# (These messages have a "X-OSBF-Lua-Version" header instead of
# "X-OSBF-Lua-Score", Subject starts with "Train" unless there was an error.)
# Note that this recipe will DESTROY all matching messages so be careful!
:0A:
* ^X-OSBF-Lua-Version:
* !^X-OSBF-Lua-Score:
^Subject: Train
| perl -e 'print "" . localtime(time()); while (<>) { print ": $1" if /^Subject:\s(.*)/}; print "\n";' >> $OSBF_LUA_USER_DIR/train-results.log
trainspamfilter.pl .
spamfilter.lua (or any other suitable location).
*/10 * * * * <PATH-TO-SCRIPT>/trainspamfilter.pl
OSBF-Lua logs all incoming messages in its local directory. Messages need to
be available for a few days for training, but there is no need to store them
there forever. I have written a small shell script that can be to used clean-up
this directory from time to time. If you want to use this script, download and
install it just like the other one. The script assumes $HOME/.osbf-lua/ as
your local OSBF-Lua directory; if you use a different one, you'll have to
adapt it.
To execute this script once a month, add this entry or a similar one to your crontab:
0 3 15 * * <PATH-TO-SCRIPT>/spamfilter-cleanup.sh
(this will run the script at 3am of the 15th of each month.)
For regular usage, 3 kinds of user actions are required:
TrainAsHam folder
(unless you don't need them any more). If you happen to use Thunderbird,
messages will be copied instead of moved if you press the Ctrl key while
dragging them.
I enabled the filter for the first time on 2006-08-17. For initial training I used a tiny, tiny "training set" of 4 ham and 5 spam mails (the 5th of which was already reported as "Training not necessary"). Actually, these were meant less for serious training that for testing that my filter setup actually worked.
At the same time, I turned off my old spam filters (a chain of SpamAssassin and the Naive Bayes filter built into Thunderbird). Since I practically hadn't trained the new filter and I routinely get 1000 or more spam mails per week (people keep telling me I'm happy not to get more!), I expected a huge flood of spam to hit my INBOX. Amazingly, that was not what happened. Instead, the filter, trained from just 8 messages, happily started filing away most of the spam that came in. During the first day of usage, I only had to (reinforcement) train 1 spam and 4 ham messages, while the filter delivered 98 messages as "certainly spam" into the SupposedSpam folder and 23 messages as "certainly ham" into my INBOX.
Of course, in the following days, some more training was necessary – but far less than I would have expected.
In the first month of usage, the filter classified 5985 messages:
| Classified as | Spam | Ham |
| Certain | 4472 | 1183 |
| Uncertain | 198 | 132 |
| Sum | 4670 | 1315 |
In 330 cases in the classifier was uncertain of the classification, i.e. I had to check and re-inforce the decision.
Obviously, some of the classifications turned out to be wrong:
| Misclassified as | Spam | Ham |
| Certain | 4 | 6 |
| Uncertain | 24 | 17 |
| Sum | 28 | 23 |
Most of the misclassified messages had been in the reinforcement zone, where the spam filter had been uncertain about its decision. The number that's slightly worrisome here is that 4 ham messages had been classified as certain spam, i.e. they ended up in the SupposedSpam folder. They all occurred in the first 1-2 weeks when the spam filter hadn't had much training and they were all borderline/problematic cases (a forwarded HTML mail with a link and very little other text; a digest from a list I hadn't read before; a commercial inquiry from a company I had been in contact with before; a very short mail from a list which had previously been abused as a spam relay).
After the first two weeks, I didn't have any such problems – but it really shows that you should check your spam folder regularly, at least until the classifier is properly trained! (Note that I used a small reinforcement threshold of 10 from the very start. Fidelis now recommends starting with a larger threshold of 20 and only reducing the threshold to 10 after the filter is well-trained. This would have meant more initial training effort, but it might have avoided such false positives.)
To judge the spam filtering performance, we can calculate the frequently used metrics precision (percentage of supposed spam which actually was spam) and recall (percentage of spam that was caught). For this we need the counts:
| True positives | (tp, correctly identified spam): | 4642 |
| False positives | (fp, ham misclassified as spam): | 28 |
| False negatives | (fn, spam misclassified as ham): | 23 |
Which gets us:
| Precision: | 99.4% | (tp / (tp+fp)) |
| Recall: | 99.5% | (tp / (tp+fn)) |
Since 51 of 5985 messages had been misclassified, the overall accuracy is 99.1%.
Considering that the filter had only been pretrained with 9 (actually, 8) messages, I feel it has done quite well :-)
| [Last generated: 2009-06-07] |
|