NL Stego

NL Stego is a system for text generation and text-based steganography. It combines Markov Models of several orders to generate random text resembling a given training text (or text corpus). It can also embed secret messages into pseudo-random generated text.

Author: Christian Siefkes <christian@siefkes.net>
Website: https://www.siefkes.net/software/nlstego/
License: GPL

Download and Installation

For easy installation, we provide a JAR file that is self-contained; you only need Java (JDK 5.0 or higher) and the JAR file, there are no further dependencies.

Download the JAR file and store it in a place of your choice.

The JAR file can directly executed by Java by calling

    java -Xmx400M -jar PATH-TO-JAR/nlstego.jar

The Xmx option increases amount of memory available for the system to 400 MB to avoid problems with large text models. On Windows you need to use '\' instead of '/'.

For convenient invocation you can define an alias for the above command (at least on Unix).

Documentation

Usage notes
API documentation
Idea and algorithm (slides in OpenOffice format)

Building the System

Apache Ant is required to build the software. To build the documentation, you need to have txt2html in your path.

To rebuild the whole system including documentation, run ant in the current directory. Call ant -projecthelp for a list of other build options.

Bug Reports

Bug reports go to christian@siefkes.net. But before you post make sure you've tested against the latest published version. Maybe your bug has already been fixed. If it hasn't been fixed in the latest version, then when posting be sure to say which code version you tested against. Also be sure to include enough information to reproduce the bug and full exception stack traces.

Dependencies

A further dependency is only required for building the system, not for running it:

Ant-Contrib 1.0b2

All dependencies are contained in the lib directory.

To Do / Possible Extensions

Integrate with Enigmail to allow direct stego encoding/decoding of messages (encode encrypted message, but use raw binary data instead of Base64, maybe integrate PGP Stealth to remove any standard headers if necessary). Maybe also add direct support for external GnuPG (gpg): -key=userID.
Refine finalization model: Determine unsuitable pre-end/final tokens (to avoid tokens such as 'Dr.'): store each token occurring in front of an EOS marker in a cache, counting how often it occurs generally and before EOS and excluding tokens where the EOS/all ratio is higher than 50%. Probably still append next token(s) if they appear without intervening whitespace and are not alphanumeric (e.g. trailing " ). Also avoid starting the text with such a token (/"/ from /."/, sometimes leads to texts starting with /" "/).
Add code to define shortcuts alias[list], e.g. –rfclist expects a list of RFCs numbers to build the model (end prints the ), while –rfc expects all RFC numbers encoded in a single number (prime-multiplied: Ax2, Bx3, Cx5, Dx7 etc. Other suitable sources: date range of articles from Telepolis subcategory, range of Slashdot news (or Heise news, but probably too many topics). Additional shortcuts can be defined in the user-specific config file, syntax: shortcut.aliasname = expansion where \1 is replaced with the argument (predefined: shortcut.rfc = http://www.ietf.org/rfc/rfc\1.txt).
If ContentType contains 'xml', text is tried for processing XML, discarding any markup. Same with 'html', after first invoking JTidy to clean up.
When training from HTML, write code to find the main text block in a website: recursively collect text of all elements containing textual or mixed content and use the longest one (cf. BTE). Specify lists of tags that should always be ignored: style, script...
Add support for embedding stego text in HTML/XML/other formats: any markup can be added to the stego version which is ignored when decoding (only plain text is used).
Alternatively to a pre-agreed training corpus, the model-building text could be added in front of the stego text – then there must be a way to signal where the stego text begins, maybe some conventional stego within the model-building text (e.g. using whitespace stego as done in snow (free for non-commercial use, Java version), or using wbStego4open (GPL, Delphi, Win+Linux)).

[Last generated: 2024-09-21]