The primary goal of my Ph.D. research was to develop and evaluate a statistical system for the extraction of information from natural language texts that supports incremental training and that takes the structure of texts into account. The full title of my Ph.D. thesis is "An Incrementally Trainable Statistical Approach to Information Extraction Based on Token Classification and Rich Context Models." I defended the thesis in early 2007.
To get a (very) short impression of what I have been doing, you might want to check the slides of the talk I gave during my PhD defense about my work. (This wasn't the only talk of my defense, I also gave a longer talk regarding "Challenges in Spam Filtering Research" which you will find on my spam filtering page.)
While doing my Ph.D., I was a member of the Berlin-Brandenburg Graduate School in Distributed Information Systems. My primary adviser was Professor Heinz F. Schweppe, the head of the Database and Information Systems Group of the Freie Universität Berlin. Before the start of the scholarship, I worked for some months as a research assistant for the FEx project on related topics.
For historical reasons, I have kept the initial proposal of my dissertation plans from January 2003 around. It's amazing how much has changed since then, but the core ideas are already there.
My publications on information extraction and related topics:
My publications regarding text classification and spam filtering can be found on my spam filtering page.
The software I've written during my PhD project is called Trainable Information Extractor, or TIE for short: a statistical system that supports not just information extraction, but also text classification and some related tasks such as preprocessing and XML merging and repair.
It's written in Java and available under GPL. But be warned that this is experimented software. It worked very fine for my purposes, but it's hardly ready for general use, due to lack of sufficient user documentation, convenient user interfaces etc. Sorry – you know how it is.
| [Last generated: 2008-04-28] |
|