net.siefkes.nlstego.textgen
Class TextModel

java.lang.Object
  extended by net.siefkes.nlstego.textgen.TextModel
All Implemented Interfaces:
StatelessFilter

public class TextModel
extends Object
implements StatelessFilter

A text model that can be used to complete texts or to generate "typical" texts based a prediction model. One or several sample texts must be provided to train the prediction models (train(CharSequence)), then the expected next token(s) in a sequence can be predicted (predictTokens(int, int, boolean, StringBuilder)) or "typical" texts can be generated.

Version:
$Revision: 1.30 $, $Date: 2005/07/28 15:03:12 $, $Author: siefkes $
Author:
Christian Siefkes

Field Summary
static String VOID_TOKEN
          Pseudo-token to use when a token is expected but none is available (e.g., before the first token starting a text).
 
Constructor Summary
TextModel(Configuration conf)
          Creates a new instance.
 
Method Summary
 boolean acceptForStateless(String key)
          Whether a key should be accepted for the stateless model. This implementation rejects tokens that start with an alphanumeric character to ensure correct tokenization when decoding.
 TextTokenizer createTokenizer(CharSequence contents)
          Creates a tokenizer for splitting text into texts.
 PredictedToken findFirstEndOfSentenceToken(List<PredictedToken> predictions)
          Finds the first (most likely) end-of-sentence token among a list of predictions.
 double getThreshold()
          Returns the threshold for discarding unlikely predictions.
 boolean isEndOfSentenceToken(String token)
          Checks whether a token is suitable as the end of a sentence.
static void main(String[] args)
          Does some tests, tokenizing and printing some files.
 List<PredictedToken> predictLikely()
          Returns a list of likely predictions, sorted by probability.
 List<PredictedToken> predictLikelyWithoutRepetitions()
          Returns a list of likely predictions, sorted by probability.
 void predictTokens(int number, boolean reset, StringBuilder appender)
          Predicts the tokens that will probably be generated next, randomly selecting tokens according to their probability (considering only the first 50% of the probability mass to avoid unlikely predictions which are likely to turn out dead ends).
 void predictTokens(int number, int maxExtra, boolean reset, StringBuilder appender)
          Predicts the tokens that will probably be requested next, randomly selecting tokens according to their probability (considering only the first 50% of the probability mass to avoid unlikely predictions which are likely to turn out dead ends).
protected  boolean startsWithWS(String token)
          Checks whether a token starts with normalized whitespace.
 void startTokenSequence()
          Prepares the model to start a new token sequence.
 String toString()
          Returns a string representation of this object.
 void train(CharSequence contents)
          Trains the prediction models from the provided string.
 void train(File file)
          Trains the prediction models from the contents of a file.
 void train(Reader reader)
          Trains the prediction models from the contents of the provided reader.
 void train(String[] filePaths, boolean recurse)
          Trains the prediction models from an array of file paths, using the InOutUtils.listFileContents(String[], boolean) method to resolve file contents.
 void train(URL url)
          Trains the prediction models from the contents of an URL.
 void updateState(String token, boolean train)
          Updates the prediction models by feeding them the specified token.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

VOID_TOKEN

public static final String VOID_TOKEN
Pseudo-token to use when a token is expected but none is available (e.g., before the first token starting a text).

See Also:
Constant Field Values
Constructor Detail

TextModel

public TextModel(Configuration conf)
Creates a new instance.

Parameters:
conf - the configuration to use
Method Detail

acceptForStateless

public boolean acceptForStateless(String key)
Whether a key should be accepted for the stateless model. This implementation rejects tokens that start with an alphanumeric character to ensure correct tokenization when decoding.

Specified by:
acceptForStateless in interface StatelessFilter
Parameters:
key - the key to check
Returns:
true if this key should be considered by stateless model; false otherwise

createTokenizer

public TextTokenizer createTokenizer(CharSequence contents)
Creates a tokenizer for splitting text into texts.

Parameters:
contents - the text to tokenize
Returns:
a tokenizer initialized to the given text

getThreshold

public double getThreshold()
Returns the threshold for discarding unlikely predictions.

Returns:
the value of the attribute

isEndOfSentenceToken

public boolean isEndOfSentenceToken(String token)
Checks whether a token is suitable as the end of a sentence.

Parameters:
token - the token to check
Returns:
true iff this token is a typical end-of-sentence token

findFirstEndOfSentenceToken

public PredictedToken findFirstEndOfSentenceToken(List<PredictedToken> predictions)
Finds the first (most likely) end-of-sentence token among a list of predictions.

Parameters:
predictions - the predictions to check
Returns:
the first (most likely} end-of-sentence token found in the list; or null if there are none

predictLikely

public List<PredictedToken> predictLikely()
Returns a list of likely predictions, sorted by probability. Applies the configured dynamic threshold to discard unlikely predictions: if P(predn+1) < P(predn) / threshold predn+1 and all following (less likely) predictions will be discarded.

Don't forget to call updateState(String, boolean) after choosing a token from the result list.

Returns:
a list of predictions whose probability is high enough, sorted by probability

predictLikelyWithoutRepetitions

public List<PredictedToken> predictLikelyWithoutRepetitions()
Returns a list of likely predictions, sorted by probability. Predictions that would lead to the repetition of a recent phrase are filtered out.

Returns:
a list of likely predictions

predictTokens

public final void predictTokens(int number,
                                boolean reset,
                                StringBuilder appender)
Predicts the tokens that will probably be generated next, randomly selecting tokens according to their probability (considering only the first 50% of the probability mass to avoid unlikely predictions which are likely to turn out dead ends). The number of predicted words will be in the range from number to 4/3 number to attain a suitable ending (it might also be lower if the prediction model ran out of predictions).

Parameters:
number - the number of words to generate, >0
reset - whether to reset the model to the start of a new sentence; if false, the current model state won't be modified
appender - a StringBuilder to append the predictions to

predictTokens

public final void predictTokens(int number,
                                int maxExtra,
                                boolean reset,
                                StringBuilder appender)
Predicts the tokens that will probably be requested next, randomly selecting tokens according to their probability (considering only the first 50% of the probability mass to avoid unlikely predictions which are likely to turn out dead ends). The number of predicted words will be in the range from number to number + maxExtra to attain a suitable ending (it might also be lower if the prediction model ran out of predictions).

Parameters:
number - the number of words to generate, >0
maxExtra - the maximum number of extra words that can be generated until a suitable ending is found
reset - whether to reset the model to the start of a new sentence; if false, the current model state won't be modified
appender - a StringBuilder to append the predictions to (for output/debugging)

startsWithWS

protected boolean startsWithWS(String token)
Checks whether a token starts with normalized whitespace.

Parameters:
token - the token to check
Returns:
true if the token starts with normalized whitespace

startTokenSequence

public void startTokenSequence()
Prepares the model to start a new token sequence.


toString

public String toString()
Returns a string representation of this object.

Overrides:
toString in class Object
Returns:
a textual representation

train

public final void train(CharSequence contents)
Trains the prediction models from the provided string.

Parameters:
contents - the training data

train

public final void train(File file)
                 throws IOException
Trains the prediction models from the contents of a file.

Parameters:
file - the source of the training data
Throws:
IOException - if an I/O error occurs

train

public final void train(Reader reader)
                 throws IOException
Trains the prediction models from the contents of the provided reader. The reader is not closed after usage.

Parameters:
reader - the source of the training data
Throws:
IOException - if an I/O error occurs

train

public final void train(String[] filePaths,
                        boolean recurse)
                 throws IOException
Trains the prediction models from an array of file paths, using the InOutUtils.listFileContents(String[], boolean) method to resolve file contents.

Parameters:
filePaths - the array of file paths
recurse - whether to recursively add the children of folders and other files containing nested entries
Throws:
IOException - if an I/O error occurs

train

public final void train(URL url)
                 throws IOException
Trains the prediction models from the contents of an URL.

Parameters:
url - the source of the training data
Throws:
IOException - if an I/O error occurs

updateState

public final void updateState(String token,
                              boolean train)
Updates the prediction models by feeding them the specified token.

Parameters:
token - the token to add
train - true iff the models should be trained, i.e. this request should be incorporated into the prediction models; otherwise only the internal state of the models will be changed so the next call for predictions will consider the updated state

main

public static void main(String[] args)
Does some tests, tokenizing and printing some files.

Parameters:
args - the command line arguments (ignored)


Copyright © 2003-2005 Christian Siefkes. All Rights Reserved.