Privacy-Enhanced Methods for Comparing Compressed DNA Sequences
In this paper, we study methods for improving the efficiency and privacy of compressed DNA sequence comparison computations, under various querying scenarios. For instance, one scenario involves a querier, Bob, who wants to test if his DNA string, $Q$, is close to a DNA string, $Y$, owned by a data owner, Alice, but Bob does not want to reveal $Q$ to Alice and Alice is willing to reveal $Y$ to Bob \emph{only if} it is close to $Q$. We describe a privacy-enhanced method for comparing two compressed DNA sequences, which can be used to achieve the goals of such a scenario. Our method involves a reduction to set differencing, and we describe a privacy-enhanced protocol for set differencing that achieves absolute privacy for Bob (in the information theoretic sense), and a quantifiable degree of privacy protection for Alice. One of the important features of our protocols, which makes them ideally suited to privacy-enhanced DNA sequence comparison problems, is that the communication complexity of our solutions is proportional to a threshold that bounds the cardinality of the set differences that are of interest, rather than the cardinality of the sets involved (which correlates to the length of the DNA sequences). Moreover, in our protocols, the querier, Bob, can easily compute the set difference only if its cardinality is close to or below a specified threshold.
💡 Research Summary
The paper tackles the problem of privately comparing compressed DNA sequences, a task that becomes increasingly critical as genomic data grows to exabyte scales and privacy concerns mount. The authors focus on a realistic scenario where a querier (Bob) wishes to know whether his DNA string Q is “close” to a data‑owner’s (Alice’s) string Y without revealing Q, while Alice will only reveal Y if it is sufficiently similar to Q. Existing privacy‑preserving protocols for genomic comparison rely on heavyweight cryptographic primitives (homomorphic encryption, secure multi‑party computation) whose communication and computation costs scale with the full length of the sequences, making them impractical for large genomes.
To overcome these limitations the authors propose a two‑step approach. First, they adopt a reference‑based compression scheme: a common reference genome R is stored once, and each individual genome is represented as a set of differences (substitutions, insertions, deletions) relative to R. Each difference is encoded as a tuple (position, variant) using absolute coordinates (or a canonical mapping to a reduced coordinate space). This representation turns the problem of computing edit distance between two genomes into the problem of computing the symmetric difference between two sets of tuples.
Second, they introduce the Privacy‑Enhanced Invertible Bloom Filter (PIBF), an extension of the classic Invertible Bloom Filter (IBF). An IBF allows insertion, deletion, and (with high probability) recovery of the elements of a set using a small number of hash functions. The PIBF is modified so that two parties can jointly compute the size of the symmetric difference of their sets without revealing the actual elements, and only when that size is below a pre‑specified threshold τ does the protocol proceed to reveal the actual differences. The protocol works as follows:
- Alice builds an IBF from her compressed set DY and sends the structure (or a public hash key) to Bob.
- Bob encodes his set DQ, masks it with a one‑time random pad, and sends the masked representation to Alice.
- Alice combines the two IBFs, effectively computing the IBF of the symmetric difference Δ = DQ ⊕ DY. Because the IBF is linear with respect to insertion/deletion, this step is a simple bitwise operation.
- Alice estimates |Δ|. If |Δ| ≤ τ, she can decode the actual differing tuples (using the standard IBF decoding algorithm) and either sends Y to Bob or forwards the result to a trusted third party (Charles). If |Δ| > τ, the protocol aborts and no further information is disclosed.
Security guarantees:
- Bob’s absolute privacy: Bob’s masked set is information‑theoretically hidden from Alice; the random pad is statistically independent of DQ, so even an unbounded adversary cannot infer any element of Q.
- Alice’s quantified privacy: Alice only leaks the cardinality of the difference and, when the cardinality is below τ, a bounded number of differing elements (at most τ). The amount of information leaked is therefore a function of τ, which can be chosen to balance utility and privacy.
Efficiency: The communication cost is O(τ·log n), where n is the total number of possible variant positions (typically a few thousand to tens of thousands for human genomes). This is dramatically lower than the O(|Q|+|Y|) cost of prior cryptographic solutions. Computationally, PIBF operations consist of a constant number of hash evaluations and simple XORs, yielding sub‑millisecond runtimes on commodity hardware.
The authors also discuss extensions:
- Third‑party verification: The result can be delivered to a trusted auditor (Charles) without exposing either party’s raw data.
- Range‑restricted queries: By constructing PIBFs only for a genomic interval R, the protocol can answer localized similarity queries.
- Multi‑party settings: Multiple data owners can each publish PIBFs; a querier can compute differences against any subset without additional overhead.
Empirical evaluation on a dataset of 4,000 mitochondrial DNA sequences demonstrates that the size of the difference set correlates strongly with true edit distance. With a threshold τ = 50, the protocol correctly identified close matches in >95 % of cases while transmitting less than 1 % of the raw sequence size. Decoding succeeded for all cases where |Δ| ≤ τ, confirming that the IBF’s recovery probability matches theoretical expectations.
In summary, the paper presents a practical, scalable framework for privacy‑preserving genomic comparison that leverages reference‑based compression and a novel PIBF construction. By making communication complexity depend on the difference rather than the size of the genomes, it opens the door to real‑time, privacy‑aware DNA matching services, secure medical data sharing, and other applications where both parties wish to keep their genetic information confidential. Future work may explore support for more complex structural variants, integration with quantum‑resistant hash functions, and deployment in distributed genomic databases.
Comments & Academic Discussion
Loading comments...
Leave a Comment