Amortized Causal Discovery with Prior-Fitted Networks
In recent years, differentiable penalized likelihood methods have gained popularity, optimizing the causal structure by maximizing its likelihood with respect to the data. However, recent research has shown that errors in likelihood estimation, even on relatively large sample sizes, disallow the discovery of proper structures. We propose a new approach to amortized causal discovery that addresses the limitations of likelihood estimator accuracy. Our method leverages Prior-Fitted Networks (PFNs) to amortize data-dependent likelihood estimation, yielding more reliable scores for structure learning. Experiments on synthetic, simulated, and real-world datasets show significant gains in structure recovery compared to standard baselines. Furthermore, we demonstrate directly that PFNs provide more accurate likelihood estimates than conventional neural network-based approaches.
💡 Research Summary
The paper tackles a fundamental bottleneck in modern causal discovery pipelines: the accuracy of data‑dependent likelihood estimation. While differentiable penalized‑likelihood methods have become popular for learning causal graphs by maximizing a score of the form log p(D | G) − λ|G|, recent work has shown that even with relatively large samples, errors in the likelihood estimator can prevent recovery of the true structure, especially as graph size and density increase. To address this, the authors propose an amortized approach that separates likelihood estimation from graph optimization and replaces the former with Prior‑Fitted Networks (PFNs), specifically the TabPFN transformer model.
PFNs are large transformer models pre‑trained on a massive collection of synthetic Bayesian networks and structural causal models. During inference, a PFN receives both the training set and a test point as a single context and directly outputs an approximation of the posterior predictive distribution qθ(y | x, Dtrain). This “in‑context” inference eliminates the need to train a separate neural network for each dataset, dramatically reducing variance and bias in the estimated log‑likelihood, even when the available data are scarce.
For graph optimization, the authors adopt a continuous DAG sampling scheme based on two real‑valued matrices P and W. Using Gumbel‑Softmax and SoftSort, these matrices are transformed into a binary edge matrix E and a permutation matrix Π, which together with an upper‑triangular mask M produce an adjacency matrix A that is guaranteed to be acyclic. This sampling process is fully differentiable, allowing the use of gradient‑based reinforcement learning.
The workflow proceeds as follows: (1) split the observed dataset D into a training portion Dtrain and an estimation portion Dest; (2) sample a candidate DAG A from the current posterior; (3) for each variable i, extract its parent set paA(i) from A and fit a PFN in‑context using only the columns corresponding to paA(i) together with Dtrain; (4) compute the log‑likelihood log p_i(Dest | Dtrain, paA(i)) from the PFN output; (5) aggregate these per‑variable log‑likelihoods, subtract the λ‑penalty on the number of edges, and obtain a scalar reward ŝ(A, D); (6) store (A, log πθ(A), ŝ(A, D)) in a trajectory buffer; (7) after K such roll‑outs, update the posterior parameters (P, W) using Proximal Policy Optimization (PPO). Because the PFN likelihood estimates are non‑differentiable, the reinforcement‑learning formulation naturally accommodates them as rewards while preserving differentiability in the policy (the DAG sampler).
The authors first validate that PFN‑based likelihood estimates are more accurate and less variable than a baseline multilayer perceptron (MLP) trained specifically for this task. Using bootstrap resampling on synthetic data generated from a 5‑node Erdős‑Rényi graph, they show that PFNs achieve lower variance (both mean and median) and lower negative log‑likelihood (NLL) across a range of sample sizes (125 to 2000). Moreover, PFNs correctly rank the true graph higher than all alternatives for all but the smallest sample size, indicating superior fidelity of the score.
For full causal discovery, the method—named Amortized Causal Discovery (ACD)—is benchmarked against two strong baselines: the constraint‑based PC algorithm (using Fisher‑Z conditional independence tests) and DCDI, a differentiable causal discovery method that directly optimizes a penalized likelihood. Experiments span three data regimes: (i) synthetic Erdős‑Rényi graphs of varying sizes and densities, (ii) semi‑synthetic data from the SERGIO single‑cell gene expression simulator (scale‑free graphs with biologically realistic dynamics), and (iii) a real‑world physical system (Causal Chambers) with a known 20‑node DAG. Performance is measured by Structural Hamming Distance (SHD) between the estimated CPDAG and the ground‑truth CPDAG.
Results show that ACD achieves the lowest SHD on synthetic Erdős‑Rényi benchmarks, substantially outperforming both PC and DCDI. On the SERGIO datasets, ACD’s SHD is comparable to the baselines, with overlapping 95 % confidence intervals, indicating competitive performance. On the Causal Chambers benchmark, ACD slightly trails PC but still beats DCDI. The authors attribute the strong synthetic performance to the alignment between TabPFN’s pre‑training distribution (synthetic Bayesian networks) and the synthetic test graphs. Conversely, the weaker performance on SERGIO and real data suggests a mis‑alignment between PFN’s prior and the true data‑generating processes, motivating future work on domain‑specific prior fine‑tuning or new PFN pre‑training regimes.
A key theoretical contribution is that, because only the likelihood estimation is amortized while the graph score remains the classic penalized likelihood, the method retains the identifiability guarantees established for penalized‑likelihood causal discovery (e.g., recovering the Markov equivalence class under mild assumptions). This contrasts with fully amortized or large‑language‑model‑based approaches that lack such guarantees. Moreover, the method requires minimal hyper‑parameter tuning and can be applied “out‑of‑the‑box” to a wide range of problems.
In summary, the paper demonstrates that leveraging pre‑trained PFNs to provide high‑quality, amortized likelihood estimates dramatically improves causal structure learning, especially in regimes where data are limited and traditional likelihood estimators are noisy. By integrating these estimates into a reinforcement‑learning‑based DAG sampler, the authors achieve state‑of‑the‑art performance on synthetic benchmarks and competitive results on semi‑synthetic and real datasets, while preserving theoretical identifiability. The work opens a promising avenue for combining foundation models with causal inference, suggesting that further alignment of PFN priors to specific scientific domains could yield even greater gains in practical causal discovery.
Comments & Academic Discussion
Loading comments...
Leave a Comment