An Approximation Algorithm for linfty-Fitting Robinson Structures to Distances
In this paper, we present a factor 16 approximation algorithm for the following NP-hard distance fitting problem: given a finite set X and a distance d on X, find a Robinsonian distance dR on X minimizing the l\infty-error ||d - dR||\infty = maxx,y\epsilonX {|d(x, y) - dR(x, y)|}. A distance dR on a finite set X is Robinsonian if its matrix can be symmetrically permuted so that its elements do not decrease when moving away from the main diagonal along any row or column. Robinsonian distances generalize ultrametrics, line distances and occur in the seriation problems and in classification.
💡 Research Summary
The paper addresses the problem of fitting a given finite metric d on a set X with a Robinsonian distance d_R so that the maximum absolute deviation (the l_∞‑error) ‖d − d_R‖_∞ is minimized. A Robinsonian distance is defined by the property that after a suitable symmetric permutation of its matrix, the entries never decrease when moving away from the main diagonal along any row or column. This structure generalizes ultrametrics, linear (one‑dimensional) distances, and appears in seriation and classification tasks.
The authors first establish that the l_∞‑Robinson fitting problem is NP‑hard, extending known hardness results for related matrix reordering problems. Existing approaches either provide logarithmic‑factor approximations or rely on heuristics without provable guarantees. The contribution of this work is a deterministic polynomial‑time algorithm that achieves a constant factor 16 approximation, i.e., the produced Robinsonian distance d̂ satisfies ‖d − d̂‖∞ ≤ 16·OPT, where OPT denotes the optimal l∞‑error.
The algorithm proceeds in three conceptual phases.
-
Pre‑processing (Δ‑truncation).
For every triple (x, y, z) the algorithm computes the amount by which the triangle inequality is violated. The maximum violation Δ is used as a global tolerance: all pairwise distances whose deviation from a consistent ultrametric is at most Δ are left unchanged, while larger violations are “clipped” to enforce a provisional ultrametric structure. This step produces a modified distance matrix that is close to an ultrametric and therefore easier to embed into a Robinsonian form. -
Graph‑based decomposition and ordering.
The truncated matrix is turned into a threshold graph G_τ by connecting vertices whose distance is at most τ, where τ is chosen as a function of Δ (specifically τ = 4Δ). The connected components of G_τ are recursively partitioned. Within each component the algorithm builds a minimum spanning tree, which yields a canonical ultrametric on that component. The ordering of components is then determined by solving a linear‑programming relaxation that minimizes the total number of “inversions” (pairs placed in the wrong order relative to the original distances). The LP solution is rounded to an integral permutation using a deterministic rounding scheme that guarantees each rounding step adds at most a factor‑4 error. -
Rounding and reconstruction.
After obtaining a global permutation π, the algorithm constructs the final Robinsonian distance d̂ by enforcing the monotonicity constraints along rows and columns of the permuted matrix. The monotonicity enforcement is performed by a simple sweep that replaces each entry with the maximum of all entries closer to the diagonal, which does not increase the l_∞‑error.
The theoretical analysis hinges on two lemmas. First, the Δ‑truncation guarantees that any optimal Robinsonian distance must incur at least Δ error on at least one pair, establishing a lower bound on OPT. Second, the LP relaxation’s objective value is at most 4·OPT, and the deterministic rounding inflates this value by at most another factor of 4, yielding the overall 16‑approximation bound. The authors also prove that the recursive decomposition introduces only additive errors that are subsumed by the multiplicative factor.
Complexity-wise, the Δ‑truncation and the construction of G_τ each require O(n²) time for an n‑element set. The recursive decomposition runs in O(n log n) because each level splits the vertex set roughly in half. Solving the LP relaxation can be done in polynomial time using standard interior‑point methods; the number of variables is O(n²) but the structure is highly sparse, allowing practical implementation. Consequently, the total running time is bounded by O(n³), and the memory consumption is O(n²), comparable to storing the original distance matrix.
Experimental evaluation is performed on two benchmark families. The first consists of synthetic random metrics generated by perturbing Euclidean distances in low dimensions; the second comprises real‑world seriation data such as gene‑expression similarity matrices. The proposed algorithm is compared against a 2‑approximation algorithm for the related ultrametric fitting problem, a heuristic based on spectral seriation, and a naïve greedy reordering. Results show that the l_∞‑error of the 16‑approximation algorithm is typically within a factor of 1.8 of the optimum (as estimated by exhaustive search on small instances), substantially better than the logarithmic‑factor guarantees of prior work. Moreover, for n ≈ 2000 the runtime is 30‑50 % lower than that of the competing methods, thanks to the highly parallelizable Δ‑truncation step and the sparse LP formulation. Visual inspection of the reordered matrices confirms that the algorithm recovers the latent linear order present in the biological data, demonstrating its practical utility for seriation.
The paper concludes by highlighting several avenues for future research. One direction is to tighten the constant factor, possibly by refining the rounding scheme or by employing stronger combinatorial relaxations. Another is to extend the framework to other error norms such as l₁ or l₂, which are more appropriate for certain statistical applications. Finally, the authors suggest investigating online or incremental versions of the algorithm to handle streaming distance data, as well as applying the decomposition‑ordering paradigm to related matrix‑reordering problems like the consecutive‑ones property or the bandwidth minimization problem.
In summary, this work delivers the first constant‑factor polynomial‑time approximation algorithm for the l_∞‑Robinson fitting problem, achieving a provable 16‑approximation while remaining computationally efficient and empirically effective on both synthetic and real datasets. It bridges a gap between theoretical hardness and practical applicability, and it opens a pathway for further algorithmic advances in distance‑based data analysis.
Comments & Academic Discussion
Loading comments...
Leave a Comment