MapPFN: Learning Causal Perturbation Maps in Context

MapPFN: Learning Causal Perturbation Maps in Context
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Planning effective interventions in biological systems requires treatment-effect models that adapt to unseen biological contexts by identifying their specific underlying mechanisms. Yet single-cell perturbation datasets span only a handful of biological contexts, and existing methods cannot leverage new interventional evidence at inference time to adapt beyond their training data. To meta-learn a perturbation effect estimator, we present MapPFN, a prior-data fitted network (PFN) pretrained on synthetic data generated from a prior over causal perturbations. Given a set of experiments, MapPFN uses in-context learning to predict post-perturbation distributions, without gradient-based optimization. Despite being pretrained on in silico gene knockouts alone, MapPFN identifies differentially expressed genes, matching the performance of models trained on real single-cell data. Our code and data are available at https://github.com/marvinsxtr/MapPFN.


💡 Research Summary

MapPFN (Mapping Perturbation Effects with Prior‑Data Fitted Networks) introduces a novel approach for predicting the outcomes of gene perturbations in single‑cell data, especially when the biological context (e.g., cell line, experimental conditions) has not been seen during training. Traditional perturbation‑prediction models rely heavily on the distribution of training contexts and often require fine‑tuning or additional data collection to generalize to new settings. MapPFN overcomes these limitations by combining three key ideas: (1) meta‑learning via Prior‑Data Fitted Networks (PFNs) trained on massive synthetic datasets generated from a prior over structural causal models (SCMs) and synthetic gene‑regulatory networks (GRNs); (2) a Multimodal Diffusion Transformer (MMDiT) architecture that treats each cell as a token and processes three modalities—noise, cell state, and one‑hot encoded treatment—through separate streams with cross‑attention; and (3) in‑context learning (ICL) that conditions predictions on a small set of observed interventional distributions (the “context”) together with the observational distribution.

During pre‑training, the authors sample a causal graph ψ from a predefined prior, generate an observational distribution Y_obs by propagating Gaussian noise through ψ, and then create a context C consisting of K interventional distributions Y_int_k for randomly chosen treatments t_k. A query treatment t_q, unseen in the context, is also sampled, and its true post‑perturbation distribution Y_int_q is generated. The model is trained to predict Y_int_q directly from (Y_obs, C, t_q) using a Conditional Flow Matching (CFM) loss. The CFM objective interpolates between a standard Gaussian noise tensor Y_0 and the target distribution Y_int_q via a stochastic time variable τ drawn from a LogitNormal distribution, and minimizes the L2 distance between the model’s predicted intermediate state and the interpolated target. This formulation enables the network to learn a continuous probability flow that maps any observational‑context pair to a plausible post‑perturbation distribution without ever performing gradient‑based optimization at test time.

A notable design choice is the use of “paired” synthetic data, where the same noise matrix N_k is reused across different interventions, effectively providing a counterfactual pairing between pre‑ and post‑perturbation cells. Experiments show that paired pre‑training yields superior downstream performance compared to unpaired sampling, suggesting that implicit alignment of cells across interventions aids the model’s ability to capture causal effects.

The authors evaluate MapPFN in two settings. First, a controlled synthetic benchmark using linear SCMs with 20 nodes and varying edge probabilities. In both few‑shot (a few observed interventions) and zero‑shot (no observed interventions) regimes, MapPFN achieves AUROC and mean absolute error scores that match or exceed state‑of‑the‑art methods based on optimal transport, generative adversarial networks, and recent causal PFNs. Importantly, MapPFN avoids the “identity collapse” failure mode—where a model predicts the same output for all queries—that plagues many competing approaches.

Second, the model is transferred to real single‑cell Perturb‑Seq data (Frangieh et al., 2021). Despite being trained exclusively on synthetic gene‑knockout data, MapPFN accurately identifies differentially expressed genes in unseen cell lines, attaining an AUROC of approximately 0.86, comparable to models that were trained directly on the real perturbation dataset. This synthetic‑to‑real transfer demonstrates that a sufficiently diverse synthetic prior can capture the statistical regularities needed for real biological systems, even without explicit knowledge of the underlying gene‑regulatory network.

Overall, MapPFN contributes a powerful, gradient‑free inference mechanism that can adapt to new biological contexts at test time simply by feeding a few observed perturbation examples. By framing perturbation prediction as a distribution‑to‑distribution mapping problem and leveraging the flexibility of PFNs and diffusion transformers, the work sidesteps the need for explicit causal graph inference, large labeled datasets, or costly fine‑tuning. Future directions include extending the synthetic prior to non‑linear SCMs, handling multi‑gene or drug‑dose interventions, and integrating additional modalities such as protein or epigenetic measurements. The approach opens a promising pathway toward scalable, foundation‑model‑style tools for in silico hypothesis testing in functional genomics and drug discovery.


Comments & Academic Discussion

Loading comments...

Leave a Comment