Branching Flows: Discrete, Continuous, and Manifold Flow Matching with Splits and Deletions
Diffusion and flow matching approaches to generative modeling have shown promise in domains where the state space is continuous, such as image generation or protein folding & design, and discrete, exemplified by diffusion large language models. They offer a natural fit when the number of elements in a state is fixed in advance (e.g. images), but require ad hoc solutions when, for example, the length of a response from a large language model, or the number of amino acids in a protein chain is not known a priori. Here we propose Branching Flows, a generative modeling framework that, like diffusion and flow matching approaches, transports a simple distribution to the data distribution. But in Branching Flows, the elements in the state evolve over a forest of binary trees, branching and dying stochastically with rates that are learned by the model. This allows the model to control, during generation, the number of elements in the sequence. We also show that Branching Flows can compose with any flow matching base process on discrete sets, continuous Euclidean spaces, smooth manifolds, and `multimodal’ product spaces that mix these components. We demonstrate this in three domains: small molecule generation (multimodal), antibody sequence generation (discrete), and protein backbone generation (multimodal), and show that Branching Flows is a capable distribution learner with a stable learning objective, and that it enables new capabilities.
💡 Research Summary
The paper “Branching Flows: Discrete, Continuous, and Manifold Flow Matching with Splits and Deletions” introduces a novel generative modeling framework designed to handle sequences of variable length, a significant challenge for existing diffusion and flow matching models which typically require a fixed state dimension.
The core innovation of Branching Flows lies in its integration of stochastic branching and deletion processes within the Generator Matching paradigm. Generator Matching works by defining a conditional probability path that transports samples from a simple prior distribution (p) to the data distribution (q), and then training a neural network to mimic the generator of this conditional process, ultimately enabling sampling from the data distribution. Branching Flows extends this by enriching the conditioning variable Z. For each data sample (x1, e.g., a protein sequence) and a paired initial sample (x0), Z includes a forest of labeled binary trees (T) and a set of anchor values (A) for internal tree nodes. The roots of the trees correspond to elements of x0, while the surviving leaves correspond to elements of x1. Leaves marked for deletion represent elements that should be removed by the end of the process.
During training, samples are drawn from the conditional path given Z. In this path, each element evolves independently along its assigned branch in the tree according to a chosen “base generator” (which can be defined for discrete, continuous, or manifold-valued states). Crucially, when an element reaches a bifurcation point predefined in the tree T, it undergoes a “split” event: it is duplicated in place, with the two copies proceeding down the two child branches. Conversely, elements on branches leading to a “deleted” leaf are subject to a “deletion” event, removing them from the sequence. The timing of splits and deletions is governed by time-dependent rates derived from hazard distributions (H_split, H_del). These hazard rates are designed to explode as time approaches 1, ensuring that all prescribed splits and deletions are guaranteed to occur, forcing the conditional path to terminate exactly at the target x1.
This architecture confers several key advantages. First, it allows the model to naturally control the number of elements in a sequence during generation, solving the variable-length problem. Second, it is highly composable: the branching-and-deletion mechanism operates independently of the base process, allowing Branching Flows to seamlessly work with discrete token spaces, continuous Euclidean spaces, smooth manifolds (like rotation groups), and multimodal combinations thereof. Third, it offers flexibility in controlling the behavior of the generative trajectories through the choice of tree structures and anchor distributions.
The authors demonstrate the efficacy and versatility of Branching Flows across three challenging domains: small molecule generation on the QM9 dataset (combining continuous atomic coordinates and discrete atom types), antibody sequence generation (discrete amino acids), and protein backbone generation (a multimodal task involving continuous coordinates, SO(3) rotations for torsion angles, and discrete amino acid types). The experiments show that Branching Flows is a capable distribution learner with a stable training objective. Furthermore, it enables new capabilities, such as solving the “unknown-length infix sampling” problem—inserting a segment of variable length between two known flanking regions—which is difficult for fixed-length flow models and autoregressive models alike. The work thus presents a unified and powerful framework for generative modeling over structured, variable-length data across diverse state spaces.
Comments & Academic Discussion
Loading comments...
Leave a Comment