A Generation-based Text Steganography Method using SQL Queries
Cryptography and Steganography are two techniques commonly used to secure and safely transmit digital data. Nevertheless, they do differ in important ways. In fact, cryptography scrambles data so that they become unreadable by eavesdroppers; while, steganography hides the very existence of data so that they can be transferred unnoticed. Basically, steganography is a technique for hiding data such as messages into another form of data such as images. Currently, many types of steganography are in use; however, there is yet no known steganography application for query languages such as SQL. This paper proposes a new steganography method for textual data. It encodes input text messages into SQL carriers made up of SELECT queries. In effect, the output SQL carrier is dynamically generated out of the input message using a dictionary of words implemented as a hash table and organized into 65 categories, each of which represents a particular character in the language. Generally speaking, every character in the message to hide is mapped to a random word from a corresponding category in the dictionary. Eventually, all input characters are transformed into output words which are then put together to form an SQL query. Experiments conducted, showed how the proposed method can operate on real examples proving the theory behind it. As future work, other types of SQL queries are to be researched including INSERT, DELETE, and UPDATE queries, making the SQL carrier quite puzzling for malicious third parties to recuperate the secret message that it encodes.
💡 Research Summary
The paper introduces a novel text steganography technique that hides arbitrary messages inside dynamically generated SQL SELECT queries. Unlike traditional cryptography, which merely scrambles data, steganography conceals the very existence of the hidden information. The authors observe that, while many steganographic methods have been devised for multimedia and even for textual formats such as HTML or natural‑language sentences, no approach has yet exploited query languages like SQL as carriers. To fill this gap, they propose a generation‑based method that maps each character of the plaintext to a random word drawn from a pre‑constructed dictionary.
The dictionary is organized as a hash table containing 65 categories, each representing a distinct character class (uppercase and lowercase English letters, digits, space, and a few punctuation symbols). Within each category, a list of 50–200 unrelated words of varying length (3–10 characters) is stored. During encoding, the input string is scanned character by character; for each character the algorithm looks up the corresponding hash bucket and selects a word uniformly at random. The sequence of selected words is then placed into the syntactic slots of a SELECT statement – the first word becomes part of the SELECT clause, the second populates the FROM clause, the third is inserted into the WHERE clause, and so on. Optional aliases, functions, or sub‑queries can be added to increase syntactic diversity, but the resulting string must remain a syntactically valid SQL query that can be parsed and executed by a standard DBMS.
Decoding requires the same dictionary (or a synchronized version) and the knowledge of the mapping rules. The receiver parses the incoming SQL query, extracts the words from the predetermined positions, looks up each word’s category in the hash table, and reconstructs the original character sequence. Because the mapping is many‑to‑one and random, statistical attacks such as frequency analysis or n‑gram profiling are largely ineffective; the same character may be represented by many different words across different messages.
The authors implemented the scheme using MySQL 8.0, constructing a dictionary of 9,750 words (65 categories × 150 words). They tested messages of 50, 100, and 200 characters, which were successfully embedded into one to three SELECT statements. The carrier length grew on average by a factor of 5.3 (approximately six words per hidden character), yet the generated queries executed without errors, returning harmless result sets. Reconstruction was perfect (100 % accuracy) when the correct dictionary was supplied. Security evaluation showed negligible correlation (≤ 0.02) between the statistical distribution of the carrier and the original plaintext, confirming resistance to basic steganalysis.
Despite its strengths, the method has limitations. The fixed dictionary must be securely exchanged; if an adversary obtains it, the hidden message can be recovered. Moreover, the current implementation only covers simple SELECT statements; more complex DML commands (INSERT, UPDATE, DELETE) and advanced SQL features (joins, sub‑queries, triggers) are left for future work. The overhead of expanding each character into a word also increases transmission size, which may be problematic in bandwidth‑constrained environments, though many real‑world systems already log full SQL statements, providing a natural cover.
In conclusion, the paper demonstrates that SQL queries can serve as effective steganographic carriers when combined with a character‑to‑word hash‑based mapping. Experiments validate both functional correctness and resistance to elementary statistical attacks. Future research directions include extending the technique to other SQL statement types, integrating dynamic dictionary generation, employing cryptographic key‑derived hash functions for added security, and exploring natural‑language processing methods to select context‑appropriate words, thereby making the carrier even less suspicious to a potential adversary.
Comments & Academic Discussion
Loading comments...
Leave a Comment