net.siefkes.nlstego.util
Class TokenizerFactory

java.lang.Object
  extended by net.siefkes.nlstego.util.TokenizerFactory

public class TokenizerFactory
extends Object

Factory for creating TextTokenizers of different types.

Version:
$Revision: 1.4 $, $Date: 2005/07/12 17:02:22 $, $Author: siefkes $
Author:
Christian Siefkes

Field Summary
static String WHITESPACE_CONTROL_OTHER
          Pattern string capturing whitespace and control/other characters.
 
Constructor Summary
TokenizerFactory()
           
 
Method Summary
static TextTokenizer createAlnumTokenizer(CharSequence text)
          Static factory method to create an instance for tokenizing alphanumeric and symbol sequences and puntuation.
static TextTokenizer createCategoryTokenizer(CharSequence text)
          Static factory method to create an instance for tokenizing according to Unicode categories.
static TextTokenizer createThoroughTokenizer(CharSequence text)
          Static factory method to create an instance that uses the "thorough" patterns listed below.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

WHITESPACE_CONTROL_OTHER

public static final String WHITESPACE_CONTROL_OTHER
Pattern string capturing whitespace and control/other characters.

See Also:
Constant Field Values
Constructor Detail

TokenizerFactory

public TokenizerFactory()
Method Detail

createAlnumTokenizer

public static TextTokenizer createAlnumTokenizer(CharSequence text)
Static factory method to create an instance for tokenizing alphanumeric and symbol sequences and puntuation. Token types:

When you are only interested in words and numbers (e.g. for indexing), you can use the captured text -- it will contain the full token for alphanumeric sequences, it will be empty for symbols and punctuation.

The whitespace pattern comprised a sequence of whitespace and control/other characters ("C" and "Z" categories).

Parameters:
text - the text to tokenize
Returns:
the created tokenizer

createCategoryTokenizer

public static TextTokenizer createCategoryTokenizer(CharSequence text)
Static factory method to create an instance for tokenizing according to Unicode categories. Token types:

When you are only interested in words and numbers (e.g. for indexing), you can use the captured text -- it will contain the full token for letter and digit sequences, it will be empty for symbols and punctuation.

The whitespace pattern comprised a sequence of whitespace and control/other characters ("C" and "Z" categories).

Parameters:
text - the text to tokenize
Returns:
the created tokenizer

createThoroughTokenizer

public static TextTokenizer createThoroughTokenizer(CharSequence text)
Static factory method to create an instance that uses the "thorough" patterns listed below.

These patterns don't contain any useful information for TextTokenizer.capturedText().

The whitespace pattern comprised a sequence of whitespace and control/other characters.

Parameters:
text - the text to tokenize
Returns:
the created tokenizer


Copyright © 2003-2005 Christian Siefkes. All Rights Reserved.