RNAGenScape: Property-Guided, Optimized Generation of mRNA Sequences with Manifold Langevin Dynamics

RNAGenScape: Property-Guided, Optimized Generation of mRNA Sequences with Manifold Langevin Dynamics
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Generating property-optimized mRNA sequences is central to applications such as vaccine design and protein replacement therapy, but remains challenging due to limited data, complex sequence-function relationships, and the narrow space of biologically viable sequences. Generative methods that drift away from the data manifold can yield sequences that fail to fold, translate poorly, or are otherwise nonfunctional. We present RNAGenScape, a property-guided manifold Langevin dynamics framework for mRNA sequence generation that operates directly on a learned manifold of real data. By performing iterative local optimization constrained to this manifold, RNAGenScape preserves biological viability, accesses reliable guidance, and avoids excursions into nonfunctional regions of the ambient sequence space. The framework integrates three components: (1) an autoencoder jointly trained with a property predictor to learn a property-organized latent manifold, (2) a denoising autoencoder that projects updates back onto the manifold, and (3) a property-guided Langevin dynamics procedure that performs optimization along the manifold. Across three real-world mRNA datasets spanning two orders of magnitude in size, RNAGenScape increases median property gain by up to 148% and success rate by up to 30% while ensuring biological viability of generated sequences, and achieves competitive inference efficiency relative to existing generative approaches.


💡 Research Summary

RNAGenScape tackles the problem of generating mRNA sequences that are both biologically viable and optimized for a desired functional property (e.g., stability, translation efficiency, ribosome load). The authors argue that most existing generative approaches either ignore the low‑dimensional manifold on which real mRNA sequences lie or rely on diffusion from pure Gaussian noise, causing the generated candidates to drift into regions of sequence space that are unlikely to fold, translate, or be otherwise functional. To address this, RNAGenScape performs iterative, local optimization directly on a learned data manifold, ensuring that every intermediate sequence stays close to the space of naturally occurring transcripts.

The framework consists of three tightly coupled components. First, an “organized autoencoder” (OAE) jointly trains an encoder‑decoder pair with a property predictor. The encoder maps a one‑hot mRNA sequence x to a latent vector z; the decoder reconstructs the sequence, while the predictor estimates the target property ŷ from z. By minimizing a weighted sum of reconstruction loss and property prediction loss, the latent space is shaped into a manifold that simultaneously captures sequence information and is organized according to the property of interest. Hyper‑parameters λ_pred and λ_recon balance these two objectives.

Second, because experimental mRNA datasets are often small and unevenly sampled, the authors augment the latent embeddings using SUGAR, a diffusion‑geometry method that samples uniformly along the learned manifold. This augmentation fills sparsely populated regions, producing an expanded set Z = Z_original ∪ Z_SUGAR that is more representative of the true data geometry.

Third, a “manifold projector” Ψ is trained as a denoising autoencoder. Given a clean latent point z, a short Gaussian corruption chain (typically 1–3 steps) produces noisy versions ẑ(k). The projector learns to map each noisy point back to its predecessor, minimizing Σ_k‖Ψ(ẑ(k)) – ẑ(k‑1)‖². This operation acts as a retraction in Riemannian optimization, pulling any off‑manifold update back onto the learned manifold without requiring an explicit analytic description of the manifold.

With E, P, and Ψ in place, the authors define a property‑guided manifold Langevin dynamics. Starting from the latent embedding of a seed sequence, they compute a drift term η·τ·∇_z f(z) where ∇z f(z) is the gradient of the property predictor, add isotropic Gaussian noise √η·ε, and then apply the projector: z{t+1}=Ψ(z_t + η·τ·∇_z f(z_t) + √η·ε_t). The temperature τ controls the trade‑off between deterministic gradient ascent (small τ) and stochastic exploration (large τ). After each update, the decoder can translate the latent point back to a nucleotide sequence, allowing real‑time inspection of the optimization trajectory.

The method was evaluated on three real‑world mRNA datasets of increasing size and complexity: (1) OpenVaccine (≈2.4 k sequences, 107 nt each) with stability as the target; (2) Zebrafish 5′UTR collection (≈55 k sequences, 124 nt) targeting translation efficiency; and (3) Ribosome (≈260 k sequences, 50 nt) targeting mean ribosome load. RNAGenScape was compared against state‑of‑the‑art diffusion‑based generators (DiffAb), language‑model‑based approaches (IgLM), and recent optimization‑specific models (NOS‑C/D). Across all benchmarks, RNAGenScape achieved median property improvements up to 148 % higher than baselines and increased the success rate (percentage of sequences with improved property) by up to 30 percentage points. Importantly, biological viability metrics—such as reduction of upstream open‑reading‑frame start codons (uORF OOF AUG), preservation or increase of Kozak consensus similarity, and lowering of minimum free energy—were consistently better than competing methods, demonstrating that the generated sequences remain translationally competent and structurally sound. In terms of computational efficiency, RNAGenScape’s inference throughput was 68 % higher than the fastest baseline, making it suitable for large‑scale in‑silico screening pipelines.

In summary, RNAGenScape introduces a principled way to perform property‑guided optimization on the intrinsic manifold of mRNA sequences. By jointly learning a property‑organized latent space, augmenting it with geometry‑preserving samples, and repeatedly projecting Langevin updates back onto the manifold, the framework achieves superior functional gains while guaranteeing biological plausibility. The authors suggest that this approach can be extended to other RNA‑based therapeutics, synthetic biology applications, and any domain where the design space is a thin, structured manifold embedded in a high‑dimensional ambient space.


Comments & Academic Discussion

Loading comments...

Leave a Comment