Maximizing Diversity in (near-)Median String Selection

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Given a set of strings over a specified alphabet, identifying a median or consensus string that minimizes the total distance to all input strings is a fundamental data aggregation problem. When the Hamming distance is considered as the underlying metric, this problem has extensive applications, ranging from bioinformatics to pattern recognition. However, modern applications often require the generation of multiple (near-)optimal yet diverse median strings to enhance flexibility and robustness in decision-making. In this study, we address this need by focusing on two prominent diversity measures: sum dispersion and min dispersion. We first introduce an exact algorithm for the diameter variant of the problem, which identifies pairs of near-optimal medians that are maximally diverse. Subsequently, we propose a $(1-ε)$-approximation algorithm (for any $ε>0$) for sum dispersion, as well as a bi-criteria approximation algorithm for the more challenging min dispersion case, allowing the generation of multiple (more than two) diverse near-optimal Hamming medians. Our approach primarily leverages structural insights into the Hamming median space and also draws on techniques from error-correcting code construction to establish these results.

💡 Research Summary

The paper studies the problem of generating multiple (near‑optimal) median strings under the Hamming distance while simultaneously maximizing diversity among the selected strings. Two standard diversity measures are considered: sum‑dispersion (the sum of all pairwise Hamming distances) and min‑dispersion (the minimum pairwise distance). The authors first exploit a structural property of Hamming medians: at each coordinate a median must contain one of the most frequent characters in the input set. Consequently, any two distinct exact medians can differ only at positions where there is a tie for the most frequent character. Using this observation they design a linear‑time algorithm that, given a dataset X⊂Γ^d, finds two exact medians whose Hamming distance (the “diameter”) is maximized (Theorem 13).

Because exact medians are often unique, the paper moves to (1+ε)‑approximate medians, i.e., strings whose total distance to the input is at most (1+ε) times the optimum. By scaling the input distances by (1+ε) and applying the same coordinate‑wise reasoning, they obtain an algorithm that returns two (1+ε)‑approximate medians with maximum possible diameter in O((1+ε)·n·d + d log d) time (Theorem 1).

Next, the authors address the problem of selecting k (1+ε)‑approximate medians that maximize sum‑dispersion. Since the objective decomposes over coordinates, the optimal assignment at each coordinate is to distribute the r most frequent characters as evenly as possible among the k strings. By constructing the k strings accordingly, they achieve a polynomial‑time PTAS: for any ε,δ>0 the algorithm returns a set S of k (1+ε)‑approximate medians with sum‑dispersion at least (1−δ)·v*, where v* is the optimum for k (1+ε)‑approximate medians (Theorem 2).

The min‑dispersion objective is substantially harder. For constant k the problem can be solved exactly by dynamic programming, but for general k it is NP‑hard even in the Hamming metric. The paper distinguishes two regimes based on the optimal diameter D* of the instance. If D* ≥ Ω((1/δ²)·log k), a randomized construction based on error‑correcting codes yields, with high probability, a set of k (1+ε)‑approximate medians whose minimum pairwise distance is at least (1−δ)·t*, where t* is the optimal min‑dispersion (Theorem 3, first case). If D* ≤ O((1/δ²)·log k), the algorithm falls back to the classic 1/2‑approximation for max‑min dispersion.

Finally, the authors provide a bi‑criteria approximation for the min‑dispersion problem over (1+ε)‑approximate medians. When D* is small (constant), they output k (1+ε)‑approximate medians achieving at least t*/2 minimum distance. When D* is large, they output k (1+2ε)‑approximate medians achieving at least (1/2−δ)·t* with high probability (Theorem 4). Moreover, if t* ≥ Ω((1/δ)·√d·log k), an even stronger bound of (1+ε+δ)‑approximate medians with (1/2−δ)‑approximation to min‑dispersion is possible.

The technical contribution lies in combining combinatorial insights about the Hamming median space with constructions from coding theory. The coordinate‑wise analysis yields exact or optimal solutions for diameter and sum‑dispersion, while randomized coding techniques enable high‑probability guarantees for the more challenging min‑dispersion objective. All algorithms run in polynomial time with respect to the input size, the number of desired medians k, and the approximation parameters ε and δ.

Overall, the paper delivers three main results: (1) an optimal algorithm for the diameter of two (exact or (1+ε)‑approximate) medians, (2) a PTAS for maximizing sum‑dispersion among k (1+ε)‑approximate medians, and (3) a bi‑criteria approximation scheme for maximizing min‑dispersion among k (1+ε)‑approximate medians. These results have immediate relevance to applications such as motif discovery in bioinformatics, prototype selection in pattern recognition, and diverse query generation in machine learning, where providing a set of high‑quality yet diverse solutions is essential.

Maximizing Diversity in (near-)Median String Selection

💡 Research Summary

Comments & Academic Discussion

Leave a Comment