Spam Filtering and Text Classification

Spam is ubiquitous, highly adapting to any counter-measures, and most annoying. This makes spam filtering one of the most important and interesting research topics in the field of text classification and text mining. During the course of my research, I have cooperated with the spam filtering geniuses Bill Yerazunis (CRM114) and Fidelis Assis (OSBF-Lua), trying to advance the state-of-the-art in spam filtering and text classification. I have also contributed some small software utilities of my own and published several papers in the field.

Software

CRM114 and OSBF-Lua (GPL): The two excellent spam filter and general-purpose text classifiers I've been involved with.
trainspamfilter (Perl, public domain): A small script that makes it easier to train the OSBF-Lua spam filter, converting the training process into a drag-and-drop operation.
Moonfilter (Lua, GPL): A general-purpose text classifier based on OSBF-Lua. It is usable both as a Lua module or as stand-alone script that can easily be controlled from the command-line.

Talk

As part of my PhD defense, I gave a talk on "Challenges in Spam Filtering Research." The talk highlights not only some of the current issues facing the anti-spam community, but also gives a short historical overview over the development of spam and spam filters. Abstract and slides of the talk are available.

(This wasn't the only talk of my disputation, I also gave an overview of the information extraction approach I pursued in my thesis which you will find on my information extraction page.)

Publications

My publications on spam filtering and text classification:

Exponential Differential Document Count: A Feature Selection Factor for Improving Bayesian Filters Accuracy, Fidelis Assis, William Yerazunis, Christian Siefkes, and Shalendra Chhabra. In 2006 Spam Conference, Cambridge, MA, 2006.
CRM114 versus Mr. X: CRM114 Notes for the TREC 2005 Spam Track, Fidelis Assis, William Yerazunis, Christian Siefkes, and Shalendra Chhabra. In TREC: Text REtrieval Conference, 2005.
A Unified Model of Spam Filtration, William S. Yerazunis, Shalendra Chhabra, Christian Siefkes, Fidelis Assis, and Dimitrios Gunopulos. In 2005 Spam Conference, Cambridge, MA, 2005.
This article has been reprinted in: Satish D, Rajesh Prabhakar, editors: Combating Spam. Icfai Books, Hyderabad, India, 2007.
Spam Filtering using a Markov Random Field Model with Variable Weighting Schemas, Shalendra Chhabra, William S. Yerazunis, and Christian Siefkes. In ICDM '04: Proceedings of the Fourth IEEE International Conference on Data Mining, 2004.
Combining Winnow and Orthogonal Sparse Bigrams for Incremental Spam Filtering, Christian Siefkes, Fidelis Assis, Shalendra Chhabra, and William S. Yerazunis. In Jean-Francois Boulicaut, Floriana Esposito, Fosca Giannotti, and Dino Pedreschi, editors, PKDD 2004: Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases, volume 3202 of Lecture Notes in Artificial Intelligence, pages 410-421. Springer, 2004.
Copyright © Springer-Verlag 2004.

[Last generated: 2024-09-21]