Multi-Attribute Group Fairness in $k$-NN Queries on Vector Databases
We initiate the study of multi-attribute group fairness in $k$-nearest neighbor ($k$-NN) search over vector databases. Unlike prior work that optimizes efficiency or query filtering, fairness imposes count constraints to ensure proportional representation across groups defined by protected attributes. When fairness spans multiple attributes, these constraints must be satisfied simultaneously, making the problem computationally hard. To address this, we propose a computational framework that produces high-quality approximate nearest neighbors with good trade-offs between search time, memory/indexing cost, and recall. We adapt locality-sensitive hashing (LSH) to accelerate candidate generation and build a lightweight index over the Cartesian product of protected attribute values. Our framework retrieves candidates satisfying joint count constraints and then applies a post-processing stage to construct fair $k$-NN results across all attributes. For 2 attributes, we present an exact polynomial-time flow-based algorithm; for 3 or more, we formulate ILP-based exact solutions with higher computational cost. We provide theoretical guarantees, identify efficiency–fairness trade-offs, and empirically show that existing vector search methods cannot be directly adapted for fairness. Experimental evaluations demonstrate the generality of the proposed framework and scalability.
💡 Research Summary
The paper introduces the problem of multi‑attribute group‑fair k‑nearest neighbor (MAFair‑KNN) retrieval over vector databases, a setting where the returned k results must both be close to the query vector and satisfy exact count constraints for each value of several protected attributes (e.g., gender, race, age). Formally, given a query q, a set of protected attributes A={A₁,…,A_m} with finite domains V_j, and target counts β_{j,v} for every attribute‑value pair, the task is to select a subset S of size k that minimizes the total distance Σ_{(x,a)∈S}‖x−q‖ while ensuring |{(x,a)∈S | a_j=v}| = β_{j,v} for all j and v.
The authors first analyze computational complexity. They prove that for m≥3 (the 3+-Fair‑KNN case) the feasibility decision problem is strongly NP‑hard via a reduction from 3‑Dimensional Matching (3DM). This shows that simultaneously satisfying independent count constraints across three or more attributes is combinatorially intractable in the worst case. In contrast, the single‑attribute (1‑Fair‑KNN) and two‑attribute (2‑Fair‑KNN) variants admit polynomial‑time exact solutions. The 2‑attribute case is reduced to a min‑cost flow problem on a bipartite graph whose nodes represent attribute‑value pairs and edge costs are the distances of candidate points.
To make the problem tractable in practice, the paper proposes a two‑stage framework. In preprocessing, the database is partitioned by the Cartesian product of all protected‑attribute values, and each partition is equipped with a lightweight bitmap index that enables O(1) checks for the presence of a particular attribute combination. Within each partition, locality‑sensitive hashing (LSH) is employed to retrieve approximate nearest‑neighbors efficiently. LSH provides theoretical recall guarantees: by increasing the number of hash tables, the probability of retrieving any true neighbor can be made arbitrarily high.
During query processing, the system first uses the bitmap index to identify the relevant partitions, then runs the LSH‑based Alg‑Near‑Neighbor to collect a candidate pool that already respects the required attribute combinations. A post‑processing module then enforces the exact count constraints. Three algorithms are presented: Alg‑1‑Fair (trivial selection for a single attribute), Alg‑2‑Fair (the min‑cost flow solver for two attributes), and Alg‑3+‑Fair (an integer linear programming formulation for three or more attributes). The ILP variables x_i∈{0,1} indicate whether candidate i is selected; constraints enforce the β_{j,v} counts, and the objective minimizes Σ_i x_i·dist_i. While ILP yields exact solutions, its runtime grows with the size of the candidate pool, which the authors mitigate by aggressive candidate pruning via LSH.
Theoretical analysis quantifies the time and space complexity of each component. The LSH stage dominates the asymptotic cost, but because it operates on small, attribute‑filtered partitions, the overall latency remains low. The authors also bound the optimality gap introduced by using an approximate candidate set: if the candidate pool contains the true optimal k points, the post‑processing step recovers the optimal solution; otherwise, the distance loss is bounded by the worst‑case distance among omitted points.
Empirical evaluation is conducted on five large‑scale datasets covering image embeddings (FAISS, CLIP), text embeddings, and multimodal vectors. Baselines include standard ANN methods (FAISS, HNSW, DiskANN) both with and without naïve post‑filtering for fairness. Results show that naïve adaptation violates fairness constraints in 30‑70 % of queries, whereas the proposed framework achieves >99 % constraint satisfaction. Moreover, the framework reduces query latency by 3‑4× compared to the naïve pipelines and incurs only modest memory overhead proportional to the number of attribute‑value partitions. For the 2‑attribute case, the min‑cost flow solver is an order of magnitude faster than solving the ILP, while still delivering exact fairness.
In summary, the paper makes several contributions: (1) it formalizes multi‑attribute group‑fair k‑NN retrieval and establishes its computational hardness landscape; (2) it designs a novel indexing scheme that couples attribute‑wise partitioning with LSH to generate a high‑quality, fairness‑aware candidate set; (3) it provides exact polynomial‑time algorithms for the tractable cases and an ILP‑based exact method for the general case; (4) it offers theoretical guarantees on recall, runtime, and optimality gaps; and (5) it validates the approach experimentally, demonstrating that fairness‑aware vector search requires dedicated indexing and cannot be achieved by simply retrofitting existing ANN systems. Future work suggested includes developing scalable approximation algorithms for the ILP, dynamic updates of protected attributes, and multi‑objective optimization that jointly balances relevance, fairness, and system efficiency.
Comments & Academic Discussion
Loading comments...
Leave a Comment