FauxCrypt - A Method of Text Obfuscation

Warnings have been raised about the steady diminution of privacy. More and more personal information, such as that contained electronic mail, is moving to cloud computing servers where it might be machine-searched and indexed. FauxCrypt is an algorit…

Authors: Devlin M. Gualtieri

Cryptography is a process for converting plaintext to an unreadable cyphertext that can be recovered to plaintext through use of an alphanumeric key phrase and a cryptographic algorithm. The purpose of encryption is to make the plaintext unreadable by both people and machines in the absence of the key. Often, a writer's purpose is not to make the plaintext completely unreadable by people, but to modify it in a way to circumvent certain machine processes. This is not encryption, since a key is not used. It's obfuscation, since the plaintext can still be understood by a person, albeit with some difficulty. Examples of such obfuscation abound in the realm of email messaging, where authors of "spam" messages attempt to circumvent filters through creative misspelling. Their intent is to fool the machine but still get the message across to the person. Likewise, substitution of certain words for others (e.g., "pr0n") and slight modification of URLs ("hxxp" in place of "http") are used as a means to post messages that break local custom or rules on message boards. Many people substitute phrases such as "*at*" in place of the "@" symbol in Internet postings to circumvent the robotic harvesting of email addresses. These modified email addresses are still decipherable by people without a special tool. The use of ROT13 encoding, a Caesar cypher in which letters of the 26 character English alphabet are substituted by letters 13 places distant (with modulus rotation to map "N" to "A", etc.), is often used to publish solutions to puzzles. A second application of ROT13 to a ROT13 encoded message recovers the plaintext without the need for a key. The intent of ROT13 is to make the plaintext easy to recover. Still, an operation is required for the recovery, akin to an encryption for which the key is commonly known, although a skilled person could train himself to read ROT13 encoded text on sight. FauxCrypt has been developed as a way to modify plaintext in a way that makes it difficult for a machine to index, but simple for a person to read. FauxCrypt does not require a deciphering operation to make the text readable. The inspiration for this algorithm was an anonymous posting on the Internet in September 2003, purportedly from Cambridge University [1], that reads as follows: "Aoccdrnig to a rscheearch at an Elingsh uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht frist and lsat ltteer is at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit porbelm. Tihs is bcuseae we do not raed ervey lteter by it slef but the wrod as a wlohe." This statement is still unattributed, and its origin is classified by some as an "urban legend" [2]. Whatever its source, it makes its point admirably in such a small number of words. FauxCrypt goes beyond the stated algorithm of just keeping the first and last letters of a word fixed. It uses certain processes to scramble the remainder of the letters in the word. The scrambling is done in a way that maintains reasonable readability. Some previous programs obfuscate files in different ways. Balakrishnan, et al., have published a US patent application that modifies data files to eliminate private information by substitution [3]. This is quite unlike FauxCrypt, which does not remove content, but rather makes it more difficult, but not impossible, to interpret. The program, "strob," by Vedanta Barooah, converts text to a form not readily readable by machine through substitution of the HTML character codes for characters [4]. This makes text files considerably larger, but there is the advantage that the text is rendered as plaintext by web browsers. The FauxCrypt algorithm is summarized, as follows: 1) All characters are converted to lower case. 2) Words from a selected dictionary are replaced by equivalents, and these words are marked as not subsequently modifiable. 3) First and last letters of words are unchanged and these characters are marked as not subsequently modifiable. 4) Vowels in digraphs are swapped (oe -> eo) and these characters are marked as not subsequently modifiable. 5) The order of vowels and consonants in a word are kept the same, but some vowels are shifted forwards or backwards. 6) A single pair of consonants are swapped in words larger than N. Readability is presumed to be enhanced when a "riser" and a "dangler" are swapped; e.g., dg becomes gd). 7) For extreme obfuscation, a consonant or vowel can be moved quite a bit up or down. As an example of FauxCrypt operation, consider the first two paragraphs of Charles Dickens' "A Christmas Carol" [6], and their FauxCrypt output. "MARLEY was dead: to begin with. There is no doubt whatever about that. The register of his burial was signed by the clergyman, the clerk, the undertaker, and the chief mourner. Scrooge signed it: and Scrooge's name was good upon 'Change, for anything he chose to put his hand to. Old Marley was as dead as a door-nail. Mind! I don't mean to say that I know, of my own knowledge, what there is particularly dead about a door-nail. I might have been inclined, myself, to regard a coffin-nail as the deadest piece of ironmongery in the trade. But the wisdom of our ancestors is in the simile; and my unhallowed hands shall not disturb it, or the Country's done for. You will therefore permit me to repeat, emphatically, that Marley was as dead as a door-nail." "marley was daed: to begin with. tehre is no duobt wahtever abuot taht. the regsiter of his burial was signed by the cyerglmna, the clerk, the udnertaker, and the cihef muorner. scrooge signed it: and scrooge's name was good upon 'cahgne, for anyhtnig he chose to put his hnad to. old marley was as daed as a doro-nail. mnid! i don't maen to say taht i know, of my own knowgdele, waht tehre is paritcularly daed abuot a doro-nail. i mihgt have been inclnide, mfsely, to regard a cfofni-nail as the daedset piece of irnomnogery in the trade. but the wsidom of our ancseotrs is in the simile; and my unahllowed hnads sahll not dsiturb it, or the cuotnry's dnoe fro. you will teherfroe permit me to retaep, empahitcalyl, taht marley was as daed as a doro-nail." The Levenshtein distance [7] is a convenient measure of the difference between two strings. It's the minimum number of operations needed to transform one string into the other by deletion, insertion, or substitution of single characters. Since FauxCrypt does not insert characters into words, or delete characters from words, the Levenshtein distance will count substitutions, only. Another measure of string difference is a modified form of the Levenshtein distance developed by Damerau [8], known as the Damerau-Levenshtein distance. The Damerau-Levenshtein distance considers transposition of characters as a single operation, whereas such an operation would be doubly counted as substitutions by Levenshtein. We analyzed the Levenshtein distance on a word-by-word basis between the entire Dickens' text of "A Christmas Carol" and its FauxCrypt translation. Letter case was ignored in this analysis. The 28,559 word texts had a total Levenshtein distance of 22,330, a Levenshtein distance of 0.782 per word, or roughly three-quarters of a character change per word. Because of the nature of the FauxCrypt algorithm, shorter words are generally unmodified, so the greatest contribution to the Levenshtein distance is born by longer words. Table III lists some of the longer Levenshtein distances between word pairs after FauxCrypt obfuscation. It can be seen that some words identified with a longer Levenshtein distance are less readable than others, but probably readable in their context. FauxCrypt is a method for obfuscating text to make it difficult for machines to search and index, but still possible for people to read. Source code for FauxCrypt in the C programming language is listed in the appendix. FauxCrypt is free software, licensed under version 3 of the GNU General Public License, as published by the Free Software Foundation.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment