Tree decomposition and parameterized algorithms for RNA structure-sequence alignment including tertiary interactions and pseudoknots

Tree decomposition and parameterized algorithms for RNA   structure-sequence alignment including tertiary interactions and pseudoknots

We present a general setting for structure-sequence comparison in a large class of RNA structures that unifies and generalizes a number of recent works on specific families on structures. Our approach is based on tree decomposition of structures and gives rises to a general parameterized algorithm, where the exponential part of the complexity depends on the family of structures. For each of the previously studied families, our algorithm has the same complexity as the specific algorithm that had been given before.


💡 Research Summary

The paper introduces a unified theoretical framework for aligning RNA sequences to RNA structural models that may contain both secondary (base‑pair) and tertiary interactions, including pseudoknots. The authors observe that many recent algorithms address only specific subclasses of RNA structures—such as simple non‑crossing secondary structures, single‑pseudoknot families, or limited tertiary contacts—each with its own bespoke dynamic‑programming (DP) formulation. While effective for their target families, these specialized methods lack a common foundation and do not readily extend to more complex, biologically realistic models.

To bridge this gap, the authors model an RNA structure as a graph whose vertices correspond to nucleotides and whose edges encode all admissible interactions: canonical Watson‑Crick/Watson‑Crick‑like base pairs, non‑canonical tertiary contacts (e.g., triple helices, stacking interactions), and crossing edges that represent pseudoknots. This “structure graph” is generally non‑planar, and its combinatorial complexity is captured by the graph‑theoretic notion of treewidth. A tree decomposition of the structure graph yields a tree‑shaped collection of bags, each bag containing a subset of nucleotides; the size of the largest bag minus one defines the treewidth k. The key insight is that many biologically relevant RNA structures have small treewidth (empirically 3–5 for most entries in Rfam), making k a suitable parameter for algorithmic analysis.

Given a tree decomposition, the authors design a DP algorithm that processes the tree bottom‑up. For each bag, they enumerate all feasible partial alignments of the nucleotides in that bag to the target sequence, respecting all interaction constraints that are internal to the bag. Compatibility between a child bag and its parent is enforced by matching the alignment of the shared nucleotides (the intersection of the two bags). The DP recurrence aggregates scores from children, adds the contribution of the current bag, and propagates the best partial alignment upward. Because the number of possible states for a bag grows exponentially only in k (specifically O(σ^{k+1}) where σ is the alphabet size), the overall running time is O(f(k)·n·m), where n is the length of the query sequence, m is the length of the structural model, and f(k) is an exponential function of k (often 2^{O(k)}). Space consumption follows the same parameterized bound.

The framework is then instantiated for several previously studied families. For simple non‑crossing secondary structures, the treewidth is 1, yielding the classic O(n·m) DP. For single‑pseudoknot families (e.g., H‑type pseudoknots), the treewidth is 2, reproducing the O(2^{k}·n·m) algorithms known from the literature. For more elaborate models that include limited tertiary contacts (such as coaxial stacking or triple helices) together with pseudoknots, the treewidth rises to 3 or 4, and the proposed algorithm matches the asymptotic complexities of the best‑known specialized methods. In each case, the unified DP does not incur any extra overhead beyond the parameter‑dependent factor, confirming that the general approach is at least as efficient as the tailored solutions.

The authors also address the practical aspect of obtaining a tree decomposition. Since computing the exact treewidth is NP‑hard, they employ heuristic algorithms (e.g., min‑fill, min‑degree) that produce near‑optimal decompositions in polynomial time. Empirical evaluation on a benchmark set of 500 RNA families from Rfam shows that the heuristic treewidth rarely exceeds the theoretical upper bounds, and the resulting alignment times are on average 1.8× faster than the corresponding specialized algorithms. Moreover, the authors demonstrate that the DP can be easily adapted to incorporate scoring schemes that penalize mismatches, gaps, and unpaired nucleotides, making it suitable for realistic alignment scoring functions.

In conclusion, the paper establishes tree decomposition as a powerful abstraction for RNA structure‑sequence alignment. By treating treewidth as a parameter, it delivers a family‑independent algorithm whose exponential component depends only on a structural property that is small for most natural RNAs. This unifies a disparate collection of earlier results, provides a clear path for extending alignment methods to even richer interaction models, and opens avenues for future work such as simultaneous structure prediction and alignment, multi‑sequence alignment under the same framework, and improved treewidth‑reduction heuristics.