|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Objectnet.siefkes.nlstego.util.TokenizerFactory
public class TokenizerFactory
Factory for creating TextTokenizers of
different types.
| Field Summary | |
|---|---|
static String |
WHITESPACE_CONTROL_OTHER
Pattern string capturing whitespace and control/other characters. |
| Constructor Summary | |
|---|---|
TokenizerFactory()
|
|
| Method Summary | |
|---|---|
static TextTokenizer |
createAlnumTokenizer(CharSequence text)
Static factory method to create an instance for tokenizing alphanumeric and symbol sequences and puntuation. |
static TextTokenizer |
createCategoryTokenizer(CharSequence text)
Static factory method to create an instance for tokenizing according to Unicode categories. |
static TextTokenizer |
createThoroughTokenizer(CharSequence text)
Static factory method to create an instance that uses the "thorough" patterns listed below. |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Field Detail |
|---|
public static final String WHITESPACE_CONTROL_OTHER
| Constructor Detail |
|---|
public TokenizerFactory()
| Method Detail |
|---|
public static TextTokenizer createAlnumTokenizer(CharSequence text)
TextTokenizer.capturedText()When you are only interested in words and numbers (e.g. for indexing),
you can use the captured text --
it will contain the full token for alphanumeric sequences, it will be
empty for symbols and punctuation.
The whitespace pattern comprised a sequence of whitespace and control/other characters ("C" and "Z" categories).
text - the text to tokenize
public static TextTokenizer createCategoryTokenizer(CharSequence text)
TextTokenizer.capturedText()When you are only interested in words and numbers (e.g. for indexing),
you can use the captured text --
it will contain the full token for letter and digit sequences, it will
be empty for symbols and punctuation.
The whitespace pattern comprised a sequence of whitespace and control/other characters ("C" and "Z" categories).
text - the text to tokenize
public static TextTokenizer createThoroughTokenizer(CharSequence text)
These patterns don't contain any useful information for
TextTokenizer.capturedText().
The whitespace pattern comprised a sequence of whitespace and control/other characters.
text - the text to tokenize
|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||