Comparing Pedigree Graphs

Pedigree graphs, or family trees, are typically constructed by an expensive process of examining genealogical records to determine which pairs of individuals are parent and child. New methods to automate this process take as input genetic data from a set of extant individuals and reconstruct ancestral individuals. There is a great need to evaluate the quality of these methods by comparing the estimated pedigree to the true pedigree. In this paper, we consider two main pedigree comparison problems. The first is the pedigree isomorphism problem, for which we present a linear-time algorithm for leaf-labeled pedigrees. The second is the pedigree edit distance problem, for which we present 1) several algorithms that are fast and exact in various special cases, and 2) a general, randomized heuristic algorithm. In the negative direction, we first prove that the pedigree isomorphism problem is as hard as the general graph isomorphism problem, and that the sub-pedigree isomorphism problem is NP-hard. We then show that the pedigree edit distance problem is APX-hard in general and NP-hard on leaf-labeled pedigrees. We use simulated pedigrees to compare our edit-distance algorithms to each other as well as to a branch-and-bound algorithm that always finds an optimal solution.

💡 Research Summary

The paper tackles the fundamental need for quantitative evaluation of automatically reconstructed pedigrees, which are directed acyclic graphs representing parent‑child relationships among individuals. It defines two central comparison tasks: (1) pedigree isomorphism – deciding whether two pedigrees are structurally identical – and (2) pedigree edit distance – measuring the minimum number of edit operations required to transform one pedigree into another.

For the isomorphism problem the authors first prove that, in the unrestricted setting, pedigree isomorphism is polynomial‑time equivalent to the general graph isomorphism problem, implying that no sub‑exponential algorithm is known. However, they identify a practically important subclass: leaf‑labeled pedigrees, where every extant individual (a leaf) carries a unique identifier. By propagating leaf‑label multisets upward and sorting these canonical signatures, they obtain a linear‑time O(n) algorithm that decides isomorphism for this subclass. The method essentially reduces the problem to comparing ordered lists of hash values, and experimental results show orders‑of‑magnitude speed‑ups over generic graph‑isomorphism solvers on synthetic data.

The edit‑distance problem is considerably harder. The authors first establish that computing the pedigree edit distance is APX‑hard in general, ruling out a PTAS unless P = NP, and they also prove NP‑hardness even when leaves are uniquely labeled. Consequently, exact algorithms are only feasible for restricted families. The paper presents several exact solutions for special cases:

Tree‑shaped pedigrees – a dynamic‑programming scheme that matches sub‑trees optimally, running in O(n²) time.
Leaf‑label‑preserving instances – by fixing the leaf correspondence, the DP reduces to a purely structural alignment problem with O(n·k) time (k = number of distinct labels).

Both algorithms exploit optimal‑substructure properties of pedigree edit operations (node/edge insert, delete, substitute) and store minimal costs for each pair of sub‑trees.

Because the general problem is intractable, the authors develop a randomized heuristic. The heuristic proceeds in three phases: (i) generate a random initial matching of internal nodes, (ii) perform a local search that swaps matched pairs, moves sub‑trees, or re‑labels edges to reduce the edit cost, and (iii) apply a meta‑heuristic such as simulated annealing to escape local minima. The algorithm is embarrassingly parallel and can be tuned by adjusting temperature schedules and iteration limits. Empirical evaluation on simulated pedigrees of varying size (hundreds to thousands of nodes) and shape (balanced, unbalanced, multi‑branch) shows that the heuristic typically attains solutions within 5–10 % of the optimal value found by a branch‑and‑bound exact solver, while running in a fraction of the time (seconds versus hours).

The experimental section also compares the exact special‑case algorithms against the heuristic and the branch‑and‑bound baseline. For tree‑shaped or leaf‑label‑preserving pedigrees, the exact DP methods achieve optimal distances and are competitive in speed; for more complex, densely connected pedigrees the heuristic dominates in runtime with only modest loss in quality.

Finally, the paper discusses broader implications and future work. It highlights the need to incorporate weighted edit operations (e.g., different costs for biological versus computational edits), to handle multi‑modal labels (genotype, phenotype), and to test the methods on real human pedigree data where missing or erroneous records are common. The authors also suggest exploring graph‑neural‑network‑based predictors that could learn to estimate edit distances from structural features, potentially enabling near‑real‑time comparison in large‑scale genomic studies.

In summary, this work provides a rigorous complexity landscape for pedigree comparison, delivers a linear‑time isomorphism test for the practically relevant leaf‑labeled case, supplies exact dynamic‑programming algorithms for several tractable subclasses, and introduces a fast, scalable randomized heuristic for the general APX‑hard edit‑distance problem. These contributions furnish the bioinformatics community with essential tools for benchmarking pedigree‑reconstruction pipelines and for further algorithmic research on graph‑based family‑tree analysis.

💡 Research Summary

📜 Original Paper Content