Moonfilter

Moonfilter is a general-purpose text classifier based on OSBF-Lua. It is usable both as a Lua module or as stand-alone script that can easily be controlled from the command-line.

Moonfilter has been written by Christian Siefkes and is licensed under the GPL.

Idea

The OSBF-Lua filter written by Fidelis Assis is an amazing spam filter and general-purpose text classifier. However, some of its "magic" is hidden in spam-filtering specific code, hence it requires some experience and know-how to use it efficiently for other tasks beyond spam filter.

Moonfilter is a wrapper around OSBF-Lua that aims at making this process easier. It picks up Fidelis Assis' "best practice" experience from spam filtering (regarding issues such as thick threshold/reinforcement training, recommended number of buckets, etc.) and offers a comfortable interface for training and classifying any other text classes in the same way.

The Moonfilter API includes:

The API also exports some public variables so the user can modify the training threshold or the number of buckets or other important settings. The default settings should already be reasonable for normal usage, leaving such parameter fiddling for those who really want do to it.

In addition to the Lua API which can be invoked flexibly from Lua scripts, there also is an easy-to-use script which reads lines from standard input, executes each of them as a command, and writes a line containing the result to standard output. This allows using the filter for standard purposes without having to write any code, and it also makes it easy to remote-control the filter from other languages such as Java.

API and Scripting Interface

The moonfilter Lua API is documented in a separate file.

The executable Lua script maps the Lua API to a command-response syntax in a straightforward way. Since the external scripting interface is merely a generic wrapper around the internal API, it is nearly identical to the scripting interface. Each command corresponds to one exported Lua function which expects its parameters (if any) as Lua strings or numbers.

Each line of standard input is passed as a command to the wrapped moonfilter module. Each line consists in a command name followed by any number of parameters, separated by whitespace. Parameters containing whitespace or starting with a double quote must be enclosed in "double quotes"; double quotes and backslashes in such quoted strings must be backslash-escaped).

If command is a function in the wrapped module, the function is executed with the specified parameters. Otherwise it should be a public variable of the wrapped module, which will be set to the value(s) specified as parameters; if there are no parameters, the variable is simple returned without changing it.

If execution of the command is successful, the program will print the name of the command followed by "ok" and the return value of the function or the (new) value of the variable (if any – the "nil" value is omitted). Key/value pairs are separated by an equals sign: key=value; booleans are serialized as true or false. In case of an error, the program will print the command name + "failed" and an error message.

The special command "exit" terminates the program. Alternatively, the program terminates when it reaches the end of standard input. In the latter case, it will terminate immediately without printing a response.

Commands can also be passed in on the command line. Each command line argument is considered a full command call, so if a command call contains parameters, it must be quoted so the operation system will treat it as a single argument. Commands from the command line are read and executed prior to executing commands from standard input.

Moonrunner Usage Examples

For all simple purposes, two lines should be sufficient (one for selecting the classes and the other for doing the job):

Create database files:

        classes nonspam spam
        create

Classify a file:

        classes nonspam spam
        classify FILENAME

Train a file (e.g. as spam):

        classes nonspam spam
        train spam FILENAME

If you want to classify and then train, it should be sufficient to give the filename once (train will use the same file):

        classes nonspam spam
        classify FILENAME
        train spam

If the text to classify isn't already stored in a file, having to create a temp file would be inefficient. Hence you can use "-" as a special filename that indicates to read from standard input until the end of input. This will only work as parameter for the very last command, since it will consume all the rest of stdin.

For purposes where this won't do, you can use the "readuntil" command which is an equivalent to Perl's "HERE" documents and allows you to write things like:

        classes nonspam spam
        readuntil <EOF>
        (...text to classify...)
        <EOF>
        classify
        readuntil <EOF>
        (...text to train as spam...)
        <EOF>
        train spam

and so on.

Download and Installation

Installation is straightforward:

The code depends on Fidelis Assis' OSBF-Lua, so you will need to install that one as well (in case you haven't already). And, obviously, you need a functioning Lua installation :-)

To Do

Meta Stuff

The website for this program has been created with the help of docext.lua, a small script I have written for extracting documentation from Lua files, and the txt2html text-to-HTML converter.


[Last generated: 2013-10-12] Valid XHTML 1.0 Transitional