Information Extraction

Thesis

The primary goal of my Ph.D. research was to develop and evaluate a statistical system for the extraction of information from natural language texts that supports incremental training and that takes the structure of texts into account. The full title of my Ph.D. thesis is "An Incrementally Trainable Statistical Approach to Information Extraction Based on Token Classification and Rich Context Models." I defended the thesis in early 2007. The full text of my thesis is available online (also as PDF), but it has also been published as a printed book:

An Incrementally Trainable Statistical Approach to Information Extraction

An
Incrementally Trainable Statistical Approach to Information ExtractionVDM Verlag, Saarbrücken, 2008.

My Ph.D. thesis, finished in 2007: The purpose of Information Extraction (IE) is to find desired pieces of information in unstructured or weakly structured texts and store them in a form that is suitable for automatic querying and processing. This book presents a innovative approach to statistical information extraction. It introduces a new algorithm which supports functionality not available in previous IE systems, such as interactive incremental training to reduce the human training effort.
The system also utilizes new sources of information, employing rich tree-based context representations to combine document structure (HTML or XML markup) with linguistic and semantic information. The resulting IE system is designed as a generic framework for statistical information extraction.

Printed copies can be bought from Amazon.com or Amazon.de as well as through other offline or online booksellers.

Research Overview

To get a (very) short impression of what I have been doing, you might want to check the slides of the talk I gave during my PhD defense about my work. (This wasn't the only talk of my defense, I also gave a longer talk regarding "Challenges in Spam Filtering Research" which you will find on my spam filtering page.)

While doing my Ph.D., I was a member of the Berlin-Brandenburg Graduate School in Distributed Information Systems. My primary adviser was Professor Heinz F. Schweppe, the head of the Database and Information Systems Group of the Freie Universität Berlin. Before the start of the scholarship, I worked for some months as a research assistant for the FEx project on related topics.

For historical reasons, I have kept the initial proposal of my dissertation plans from January 2003 around. It's amazing how much has changed since then, but the core ideas are already there.

Publications

My publications on information extraction and related topics:

My publications regarding text classification and spam filtering can be found on my spam filtering page.

Software

The software I've written during my PhD project is called Trainable Information Extractor, or TIE for short: a statistical system that supports not just information extraction, but also text classification and some related tasks such as preprocessing and XML merging and repair.

It's written in Java and available under GPL. But be warned that this is experimented software. It worked very fine for my purposes, but it's hardly ready for general use, due to lack of sufficient user documentation, convenient user interfaces etc. Sorry – you know how it is.

WARNING: apparently, the TIE classifiers don't work correctly under Java 6. Please use Java 5 instead.

Courses Given


[Last generated: 2017-04-26] Valid XHTML 1.0 Transitional