Generative models derived from large protein sequence alignments define complex fitness landscapes, but their utility for accurately modeling non-equilibrium evolutionary dynamics remains unclear. In this work, we perform a rigorous comparative analysis of three simulation schemes, designed to mimic evolution in silico by local sampling of the probability distribution defined by a generative model. We compare standard independent Markov Chain Monte Carlo, Monte Carlo on a phylogenetic tree, and a population genetics dynamics, benchmarking their outputs against deep sequencing data from four distinct in vitro evolution experiments. We find that standard Monte Carlo fails to reproduce the correct phylogenetic structure and generates unrealistic, gradual mutational sweeps. Performing Monte Carlo on a tree inferred from data improves phylogenetic fidelity and historical accuracy. The population genetics scheme successfully captures phylogenetic correlations, mutational abundances, and selective sweeps as emergent properties, without the need to infer additional information from data. However, the latter choice come at the price of not sampling the proper generative model distribution at long times. Our findings highlight the crucial role of phylogenetic correlations and finite-population effects in shaping evolutionary trajectories on fitness landscapes. These models therefore provide powerful tools for predicting complex adaptive paths and for reliably extrapolating evolutionary dynamics beyond current experimental limitations.
Proteins are fundamental building blocks of life, performing a vast array of essential biological functions. Over evolutionary timescales, mutations in the genetic code can alter the amino acid sequence of a protein, potentially affecting its stability, structure, or function. Accurately modeling protein evolution is therefore of both theoretical and practical importance. From a theoretical perspective, realistic models are crucial for phylogenetic inference and for understanding the constraints that shape and structure evolutionary landscapes. From a practical standpoint, such models are essential for predicting the evolutionary trajectories of pathogens and for informing the design of effective drug therapies.
The modeling of molecular evolution fundamentally relies on statistical frameworks to describe how protein or nucleic acid sequences change over time. Historically, this effort has been motivated by the need to understand the forces of natural selection and to infer evolutionary relationships, leading to the development of sequence substitution models [1]. These models provide the core mathematical framework for comparative genomics and phylogenetic inference. They treat the evolution of a sequence as a continuous-time Markov process, where a transition rate matrix governs the rate of change from one nucleotide (or amino acid) to another. The complexity of these models ranges widely: the simplest, like the Jukes-Cantor (JC69) model for nucleotides [2], assumes that all substitutions occur at the same rate. More sophisticated models, such as HKY85 and GTR for DNA [3,4], and WAG [5], JTT [6] and LG [7] for proteins, account for biases like transition/transversion rates and unequal stationary frequencies.
While these substitution models have been very successful, they mostly operate under a severe simplification: the mutation rate of each site does not depend on the rest of the sequence. This is equivalent to assuming that sites evolve independently. In order to reproduce statistical properties of evolving sequences, such as the distribution of substitution rates among sites or the time-dependent mutability constraints at different residues, one then needs to add empirical observations explicitly. More generally, these standard models are inherently unable to capture epistatic interactions-where the functional effect of one mutation depends on the context in which it occurs-that are crucial for maintaining protein structure and function, and strongly influence evolutionary trajectories [8][9][10][11][12][13][14][15][16]. Considering site interactions makes the algorithm more computationally expensive and prevents the use of transition matrices, but it also makes it more accurate. One attempt at including epistasis into evolutionary modeling was done with structural substitution models [17,18], in which an interaction between sites mediated by the structure is taken into consideration.
Here, we consider an alternative approach in which first a fitness landscape is inferred from natural sequence data using generative sequence modeling, and then the evolutionary dynamics is simulated on that landscape. The advantage of working directly in sequence space, without explicitly incorporating structural information, is that the resulting models are computationally very simple and efficient. We employ a powerful energy-based generative modeling framework known as Direct Coupling Analysis (DCA) [19]. DCA is trained on a dataset of natural sequences, from which it infers a probability distribution with parameters representing site-specific constraints and, critically, the long-range epistatic interactions that stabilize protein structure and dictate function [20][21][22][23][24]. DCA-based models have already proven useful in various tasks, including contact prediction [25,26], the in silico generation of functional proteins [27][28][29], the study of specificity-switching pathways [30] and the inference of mutational fitness landscapes from experiments [21,[31][32][33][34][35]. Similar approaches have also been used in the context of HIV/HCV [36][37][38][39][40] and SARS-CoV-2 [41,42]. Hence, sequence fitness landscapes inferred by DCA contain information about structure, and are able to capture the main ingredients needed to make a sequence functional, as well as the epistatic effect of mutations. While alternative generative modeling approaches to fitness landscapes exits, e.g. based on Large Language Models, to our knowledge DCA is not outperformed by them on the modeling tasks listed above, hence we stick to it due to its simplicity of implementation and computational efficiency [43][44][45].
While DCA has traditionally been applied at equilibrium-generating novel protein sequences by sampling the inferred probability distribution via independent Markov chains, for times much larger than the mixing time of the model-its use has increasingly expanded to forward evolutionary simulations, where its energy is
This content is AI-processed based on open access ArXiv data.