|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Objectnet.siefkes.nlstego.textgen.TextModel
public class TextModel
A text model that can be used to complete texts or to generate "typical"
texts based a prediction model.
One or several sample texts must be provided to train the prediction
models (train(CharSequence)), then the expected next token(s)
in a sequence can be predicted
(predictTokens(int, int, boolean, StringBuilder)) or "typical"
texts can be generated.
| Field Summary | |
|---|---|
static String |
VOID_TOKEN
Pseudo-token to use when a token is expected but none is available (e.g., before the first token starting a text). |
| Constructor Summary | |
|---|---|
TextModel(Configuration conf)
Creates a new instance. |
|
| Method Summary | |
|---|---|
boolean |
acceptForStateless(String key)
Whether a key should be accepted for the stateless model. This implementation rejects tokens that start with an alphanumeric character to ensure correct tokenization when decoding. |
TextTokenizer |
createTokenizer(CharSequence contents)
Creates a tokenizer for splitting text into texts. |
PredictedToken |
findFirstEndOfSentenceToken(List<PredictedToken> predictions)
Finds the first (most likely) end-of-sentence token among a list of predictions. |
double |
getThreshold()
Returns the threshold for discarding unlikely predictions. |
boolean |
isEndOfSentenceToken(String token)
Checks whether a token is suitable as the end of a sentence. |
static void |
main(String[] args)
Does some tests, tokenizing and printing some files. |
List<PredictedToken> |
predictLikely()
Returns a list of likely predictions, sorted by probability. |
List<PredictedToken> |
predictLikelyWithoutRepetitions()
Returns a list of likely predictions, sorted by probability. |
void |
predictTokens(int number,
boolean reset,
StringBuilder appender)
Predicts the tokens that will probably be generated next, randomly selecting tokens according to their probability (considering only the first 50% of the probability mass to avoid unlikely predictions which are likely to turn out dead ends). |
void |
predictTokens(int number,
int maxExtra,
boolean reset,
StringBuilder appender)
Predicts the tokens that will probably be requested next, randomly selecting tokens according to their probability (considering only the first 50% of the probability mass to avoid unlikely predictions which are likely to turn out dead ends). |
protected boolean |
startsWithWS(String token)
Checks whether a token starts with normalized whitespace. |
void |
startTokenSequence()
Prepares the model to start a new token sequence. |
String |
toString()
Returns a string representation of this object. |
void |
train(CharSequence contents)
Trains the prediction models from the provided string. |
void |
train(File file)
Trains the prediction models from the contents of a file. |
void |
train(Reader reader)
Trains the prediction models from the contents of the provided reader. |
void |
train(String[] filePaths,
boolean recurse)
Trains the prediction models from an array of file paths, using the InOutUtils.listFileContents(String[], boolean) method to resolve
file contents. |
void |
train(URL url)
Trains the prediction models from the contents of an URL. |
void |
updateState(String token,
boolean train)
Updates the prediction models by feeding them the specified token. |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
| Field Detail |
|---|
public static final String VOID_TOKEN
| Constructor Detail |
|---|
public TextModel(Configuration conf)
conf - the configuration to use| Method Detail |
|---|
public boolean acceptForStateless(String key)
acceptForStateless in interface StatelessFilterkey - the key to check
true if this key should be considered by stateless
model; false otherwisepublic TextTokenizer createTokenizer(CharSequence contents)
contents - the text to tokenize
public double getThreshold()
public boolean isEndOfSentenceToken(String token)
token - the token to check
true iff this token is a typical end-of-sentence
tokenpublic PredictedToken findFirstEndOfSentenceToken(List<PredictedToken> predictions)
predictions - the predictions to check
null if there are nonepublic List<PredictedToken> predictLikely()
Don't forget to call updateState(String, boolean) after
choosing a token from the result list.
public List<PredictedToken> predictLikelyWithoutRepetitions()
public final void predictTokens(int number,
boolean reset,
StringBuilder appender)
number to
4/3 number to attain a suitable ending (it might also be
lower if the prediction model ran out of predictions).
number - the number of words to generate, >0reset - whether to reset the model to the start of a new sentence;
if false, the current model state won't be modifiedappender - a StringBuilder to append the predictions to
public final void predictTokens(int number,
int maxExtra,
boolean reset,
StringBuilder appender)
number to
number + maxExtra to attain a suitable ending (it might also
be lower if the prediction model ran out of predictions).
number - the number of words to generate, >0maxExtra - the maximum number of extra words that can be
generated until a suitable ending is foundreset - whether to reset the model to the start of a new sentence;
if false, the current model state won't be modifiedappender - a StringBuilder to append the predictions to
(for output/debugging)protected boolean startsWithWS(String token)
normalized whitespace.
token - the token to check
true if the token starts with normalized whitespacepublic void startTokenSequence()
public String toString()
toString in class Objectpublic final void train(CharSequence contents)
contents - the training data
public final void train(File file)
throws IOException
file - the source of the training data
IOException - if an I/O error occurs
public final void train(Reader reader)
throws IOException
reader - the source of the training data
IOException - if an I/O error occurs
public final void train(String[] filePaths,
boolean recurse)
throws IOException
InOutUtils.listFileContents(String[], boolean) method to resolve
file contents.
filePaths - the array of file pathsrecurse - whether to recursively add the children of folders and
other files containing nested entries
IOException - if an I/O error occurs
public final void train(URL url)
throws IOException
url - the source of the training data
IOException - if an I/O error occurs
public final void updateState(String token,
boolean train)
token - the token to addtrain - true iff the models should be trained, i.e.
this request should be incorporated into the prediction models;
otherwise only the internal state of the models will be changed so the
next call for predictions will consider the updated statepublic static void main(String[] args)
args - the command line arguments (ignored)
|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||