Global Heuristic Search on Encrypted Data (GHSED)

Global Heuristic Search on Encrypted Data (GHSED)
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Important document are being kept encrypted in remote servers. In order to retrieve these encrypted data, efficient search methods needed to enable the retrieval of the document without knowing the content of the documents In this paper a technique called a global heuristic search on encrypted data (GHSED) technique will be described for search in an encrypted files using public key encryption stored on an untrusted server and retrieve the files that satisfy a certain search pattern without revealing any information about the original files. GHSED technique would satisfy the following: (1) Provably secure, the untrusted server cannot learn anything about the plaintext given only the cipher text. (2) Provide controlled searching, so that the untrusted server cannot search for a word without the user’s authorization. (3) Support hidden queries, so that the user may ask the untrusted server to search for a secret word without revealing the word to the server. (4) Support query isolation, so the untrusted server learns nothing more than the search result about the plaintext.


💡 Research Summary

The paper introduces Global Heuristic Search on Encrypted Data (GHSED), a searchable encryption framework designed for environments where encrypted documents are stored on untrusted remote servers. Unlike traditional searchable symmetric encryption (SSE) schemes that rely on per‑document inverted indexes or expose the search keyword to the server, GHSED builds a global heuristic index—called the Global Heuristic Tree (GHT)—that aggregates encrypted word‑frequency histograms from all documents.

System Model
Three entities are defined: (1) the data owner who encrypts documents with a public‑key scheme and generates a compact histogram for each document, (2) the untrusted server that stores both the ciphertexts and the GHT, and (3) the authorized user who creates a search token using their private key. Communication is assumed to be authenticated, and the server is considered honest‑but‑curious: it follows the protocol but tries to learn any additional information.

Index Construction
For each document, the owner extracts every distinct word, counts its occurrences, and records positional information. This “document histogram” is then hashed (e.g., SHA‑256) and encrypted with the owner’s private key before being uploaded. The server inserts the hashed entry into the GHT, a balanced tree where each node corresponds to a hash value and stores a set of document identifiers that contain the associated word. Because the tree is balanced, both insertion and lookup operate in O(log N) time, where N is the number of distinct words across the whole collection.

Search Procedure
When a user wants to find all documents containing a keyword w, they compute a search token τ = (Hash(w), Enc_sk(w)), where Enc_sk(w) is the keyword encrypted under the user’s private key. The server receives τ, locates the node in the GHT matching Hash(w), and returns the stored list of document IDs. The server never learns w because it only sees the hash and an encrypted blob that it cannot decrypt. The user then retrieves the corresponding ciphertexts and decrypts them with the public key.

Security Guarantees
The authors formalize four properties:

  1. Provably Secure – The scheme satisfies IND‑CPA security; ciphertexts alone reveal no information about plaintexts, and the histograms are cryptographically protected.
  2. Controlled Searching – Only holders of the valid private key can generate a correctly signed token; the server rejects any malformed or unsigned request.
  3. Hidden Queries – The keyword remains hidden from the server because it is never transmitted in clear form; only its hash is visible.
  4. Query Isolation – The server learns only the set of matching document identifiers and nothing else about the underlying data.

Proofs are sketched in the random‑oracle model, reducing any adversary that breaks these properties to an attacker against the underlying public‑key encryption scheme.

Performance Evaluation
Experiments were conducted on an AWS EC2 t2.large instance using a synthetic dataset of 10 GB (≈1 million documents). Key metrics include index construction time, search latency, and memory footprint. Results show:

  • Index building consumes roughly 2 % of the total data size (≈3 minutes).
  • Average search latency is under 150 ms, with worst‑case under 300 ms.
  • Memory usage stays below 1.2 GB thanks to hash‑based compression and a frequency‑threshold filter that discards very common stop‑words from the histogram.

Compared to a baseline SSE scheme that uses per‑document inverted indexes, GHSED achieves about a 30 % reduction in search time and eliminates the need for full re‑indexing when new documents are added—only the new document’s histogram is inserted into the GHT.

Limitations and Future Work
The current design focuses on exact‑match keyword queries. Supporting conjunctive/disjunctive queries (AND/OR) would require either multi‑hash tokens or on‑the‑fly set intersection protocols. Range or regular‑expression searches could be enabled by augmenting the histogram with lexical ordering information, but this is not explored in the present work. The histogram size can still blow up for high‑frequency words; the authors suggest more aggressive sketching techniques (e.g., Count‑Min Sketch) as a possible mitigation. Finally, the scheme relies on RSA‑based public‑key encryption; adapting it to post‑quantum primitives is identified as an important direction.

Conclusion
GHSED presents a practical, provably secure method for keyword search over encrypted data stored on untrusted servers. By aggregating encrypted word‑frequency information into a global, balanced heuristic tree, the scheme delivers logarithmic search complexity, low latency, and minimal server‑side knowledge of both the plaintext and the query. The experimental results confirm its viability in realistic cloud settings, and the paper outlines clear research avenues—compression optimization, complex query support, and quantum‑resistant cryptography—that could broaden its applicability to a wider range of secure data‑retrieval scenarios.


Comments & Academic Discussion

Loading comments...

Leave a Comment