Spam Filtering and Text Classification
Spam is ubiquitous, highly adapting to any counter-measures, and most
annoying. This makes spam filtering one of the most important and
interesting research topics in the field of text classification and text
mining. During the course of my research, I have cooperated with the spam
filtering geniuses Bill Yerazunis
(CRM114) and Fidelis Assis
(OSBF-Lua), trying to advance the
state-of-the-art in spam filtering and text classification. I have also
contributed some small software utilities of my own and published several
papers in the field.
Software
- CRM114 and
OSBF-Lua (GPL): The two excellent
spam filter and general-purpose text classifiers I've been involved with.
- trainspamfilter (Perl, public domain): A
small script that makes it easier to train the OSBF-Lua spam filter,
converting the training process into a drag-and-drop operation.
- Moonfilter (Lua, GPL): A general-purpose
text classifier based on OSBF-Lua. It is usable both as a Lua module or
as stand-alone script that can easily be controlled from the
command-line.
Talk
As part of my PhD defense, I gave a talk on "Challenges in Spam Filtering
Research." The talk highlights not only some of the current issues facing
the anti-spam community, but also gives a short historical overview over
the development of spam and spam filters.
Abstract and
slides of the talk are available.
(This wasn't the only talk of my disputation, I also gave an overview of
the information extraction approach I pursued in my thesis which you will
find on my information extraction page.)
Publications
My publications on spam filtering and text classification:
- Exponential Differential Document Count: A Feature Selection Factor
for Improving Bayesian Filters
Accuracy, Fidelis
Assis, William Yerazunis, Christian Siefkes, and Shalendra Chhabra. In
2006 Spam Conference, Cambridge, MA, 2006.
- CRM114 versus Mr. X: CRM114 Notes for the TREC 2005 Spam
Track, Fidelis Assis,
William Yerazunis, Christian Siefkes, and Shalendra Chhabra. In TREC:
Text REtrieval Conference, 2005.
- A Unified Model of Spam
Filtration, William
S. Yerazunis, Shalendra Chhabra, Christian Siefkes, Fidelis Assis, and
Dimitrios Gunopulos. In 2005 Spam Conference, Cambridge, MA, 2005.
This article has been reprinted in: Satish D, Rajesh Prabhakar,
editors: Combating Spam. Icfai Books, Hyderabad, India, 2007.
- Spam Filtering using a Markov Random Field Model with Variable
Weighting
Schemas,
Shalendra Chhabra, William S. Yerazunis, and Christian Siefkes. In
ICDM '04: Proceedings of the Fourth IEEE International Conference on
Data Mining, 2004.
- Combining Winnow and Orthogonal Sparse Bigrams for Incremental Spam
Filtering, Christian Siefkes, Fidelis Assis,
Shalendra Chhabra, and William S. Yerazunis. In Jean-Francois
Boulicaut, Floriana Esposito, Fosca Giannotti, and Dino Pedreschi,
editors, PKDD 2004: Proceedings of the 8th European Conference on
Principles and Practice of Knowledge Discovery in Databases, volume
3202 of Lecture Notes in Artificial
Intelligence, pages
410-421. Springer, 2004.
Copyright © Springer-Verlag 2004.
[Last generated: 2024-09-21]
|
|