NL Stego is a system for text generation and text-based steganography. It
combines Markov Models of several orders to generate random text resembling
a given training text (or text corpus). It can also embed secret messages
into pseudo-random generated text.
Christian Siefkes <email@example.com>
Download and Installation
For easy installation, we provide a JAR file that is self-contained; you
only need Java (JDK 5.0 or higher) and the JAR file, there are no further
Download the JAR file and store it in a place of your choice.
The JAR file can directly executed by Java by calling
java -Xmx400M -jar PATH-TO-JAR/nlstego.jar
The Xmx option increases amount of memory available for the system to 400
MB to avoid problems with large text models. On Windows you need to use '\'
instead of '/'.
For convenient invocation you can define an alias for the above command (at
least on Unix).
Building the System
Apache Ant is required to build the software. To
build the documentation, you need to have
txt2html in your path.
To rebuild the whole system including documentation, run
ant in the
current directory. Call
ant -projecthelp for a list of other build
Bug reports go to firstname.lastname@example.org. But before you post make sure
you've tested against the latest published version. Maybe your bug has
already been fixed. If it hasn't been fixed in the latest version, then
when posting be sure to say which code version you tested against. Also be
sure to include enough information to reproduce the bug and full exception
A further dependency is only required for building the system, not for
All dependencies are contained in the
To Do / Possible Extensions
- Integrate with Enigmail to allow direct stego encoding/decoding of
messages (encode encrypted message, but use raw binary data instead of
Base64, maybe integrate PGP
Stealth to remove any
standard headers if necessary). Maybe also add direct support for
external GnuPG (gpg): -key=userID.
- Refine finalization model: Determine unsuitable pre-end/final tokens (to
avoid tokens such as 'Dr.'): store each token occurring in front of an
EOS marker in a cache, counting how often it occurs generally and before
EOS and excluding tokens where the EOS/all ratio is higher than 50%.
Probably still append next token(s) if they appear without intervening
whitespace and are not alphanumeric (e.g. trailing " ). Also avoid
starting the text with such a token (/"/ from /."/, sometimes leads to
texts starting with /" "/).
- Add code to define shortcuts alias[list], e.g. –rfclist expects a list
of RFCs numbers to build the model (end prints the ), while –rfc expects
all RFC numbers encoded in a single number (prime-multiplied: Ax2, Bx3,
Cx5, Dx7 etc. Other suitable sources: date range of articles from
Telepolis subcategory, range of Slashdot news (or Heise news, but
probably too many topics). Additional shortcuts can be defined in the
user-specific config file, syntax: shortcut.aliasname = expansion where
\1 is replaced with the argument (predefined: shortcut.rfc =
- If ContentType contains 'xml', text is tried for processing XML,
discarding any markup. Same with 'html', after first invoking JTidy to
- When training from HTML, write code to find the main text block in a
website: recursively collect text of all elements containing textual or
mixed content and use the longest one (cf.
BTE). Specify lists of tags that should
always be ignored: style, script...
- Add support for embedding stego text in HTML/XML/other formats: any
markup can be added to the stego version which is ignored when decoding
(only plain text is used).
- Alternatively to a pre-agreed training corpus, the model-building text
could be added in front of the stego text – then there must be a way to
signal where the stego text begins, maybe some conventional stego within
the model-building text (e.g. using whitespace stego as done in
snow (free for non-commercial use,
Java version), or using wbStego4open (GPL,
[Last generated: 2014-07-16]