HOLOGRAPH Active Causal Discovery via Sheaf-Theoretic Alignment of Large Language Model Priors

February 04, 2026

Reading time: 30 minute

...

#paper #research

📝 Original Paper Info

- Title: HOLOGRAPH Active Causal Discovery via Sheaf-Theoretic Alignment of Large Language Model Priors
- ArXiv ID: 2512.24478
- Date: 2025-12-30
- Authors: Hyunjun Kim

📝 Abstract

Causal discovery from observational data remains fundamentally limited by identifiability constraints. Recent work has explored leveraging Large Language Models (LLMs) as sources of prior causal knowledge, but existing approaches rely on heuristic integration that lacks theoretical grounding. We introduce HOLOGRAPH, a framework that formalizes LLM-guided causal discovery through sheaf theory--representing local causal beliefs as sections of a presheaf over variable subsets. Our key insight is that coherent global causal structure corresponds to the existence of a global section, while topological obstructions manifest as non-vanishing sheaf cohomology. We propose the Algebraic Latent Projection to handle hidden confounders and Natural Gradient Descent on the belief manifold for principled optimization. Experiments on synthetic and real-world benchmarks demonstrate that HOLOGRAPH provides rigorous mathematical foundations while achieving competitive performance on causal discovery tasks with 50-100 variables. Our sheaf-theoretic analysis reveals that while Identity, Transitivity, and Gluing axioms are satisfied to numerical precision (<10^{-6}), the Locality axiom fails for larger graphs, suggesting fundamental non-local coupling in latent variable projections. Code is available at [https://github.com/hyunjun1121/holograph](https://github.com/hyunjun1121/holograph).

💡 Summary & Analysis

**1. Key Contributions:** - **Sheaf-Theoretic Framework:** Holograph formalizes causal discovery using LLM outputs through sheaf theory. - **Natural Gradient Optimization:** Utilizes natural gradient descent for robust optimization. - **Active Query Selection:** Employs Expected Free Energy to select the most informative queries from LLMs.

2. Explanation:

Metaphor: Think of Holograph as a map for causal discovery. The roads connecting places are like information gathered from LLM outputs, and these roads must fit together perfectly to create the full map.
Scientific Explanation:
- Sheaf theory is used to integrate LLM results coherently, preventing contradictions between different parts of the data.
- Natural gradient descent ensures that optimization follows the path of least resistance, leading to reliable outcomes.
Technical Explanation:
- Holograph combines sheaf theory and natural gradient descent for causal discovery. This method integrates LLM outputs cohesively and stabilizes the optimization process.

📄 Full Paper Content (ArXiv Source)

# Introduction

Causal discovery—the problem of inferring causal structure from data—is fundamental to scientific inquiry, yet remains provably underspecified without experimental intervention . Observational data alone can at most identify the Markov equivalence class of DAGs , and the presence of latent confounders further complicates identifiability. This has motivated recent interest in leveraging external knowledge sources, particularly Large Language Models (LLMs), which encode substantial causal knowledge from pretraining corpora .

However, existing approaches to LLM-guided causal discovery remain fundamentally heuristic. Prior work such as Democritus treats LLM outputs as “soft priors” integrated via post-hoc weighting, lacking principled treatment of:

Coherence: How do we ensure local LLM beliefs about variable subsets combine into a globally consistent causal structure?
Contradictions: What happens when the LLM provides conflicting information about overlapping variable subsets?
Latent Variables: How do we project global causal models onto observed subsets while accounting for hidden confounders?

We propose Holograph (Holistic Optimization of Latent Observations via Gradient-based Restriction Alignment for Presheaf Harmony), a framework that addresses these challenges through the lens of sheaf theory. Our key insight is that local causal beliefs can be formalized as sections of a presheaf over the power set of variables. While full sheaf structure (including Locality) fails due to non-local latent coupling, we demonstrate that Identity, Transitivity, and Gluing axioms hold to numerical precision ($`< 10^{-6}`$), enabling coherent belief aggregation.

Contributions.

Sheaf-Theoretic Framework: We formalize LLM-guided causal discovery as a presheaf satisfaction problem, where local sections are linear SEMs and restriction maps implement Algebraic Latent Projection.
Natural Gradient Optimization: We derive a natural gradient descent algorithm on the belief manifold with Tikhonov regularization for numerical stability.
Active Query Selection: We use Expected Free Energy (EFE) to select maximally informative LLM queries, balancing epistemic and instrumental value.
Theoretical Analysis: We empirically verify that Identity, Transitivity, and Gluing axioms hold to numerical precision, while systematically identifying Locality violations arising from non-local latent coupling.
Empirical Validation: Comprehensive experiments on synthetic (ER, SF) and real-world (Sachs, Asia) benchmarks, demonstrating +91% F1 improvement over NOTEARS in extreme low-data regimes ($`N \le 10`$) and +13.6% F1 improvement when using Holograph priors to regularize statistical methods.
Implementation Verification: Complete mathematical verification that all 15 core formulas in the specification match the implementation to numerical precision (Appendix 6.6).

Key Finding 1: Locality Failure as Discovery.

Our sheaf exactness experiments (Section 4.5) reveal a striking result: while Identity ($`\rho_{UU} = \text{id}`$), Transitivity ($`\rho_{ZU} = \rho_{ZV} \circ \rho_{VU}`$), and Gluing axioms pass with errors $`< 10^{-6}`$, the Locality axiom systematically fails with errors scaling as $`\mathcal{O}(\sqrt{n})`$ with graph size. This is not a bug but a discovery: it reveals fundamental non-local information propagation through latent confounders. The failure quantitatively measures the “non-sheafness” of causal models under latent projections—a diagnostic that could guide when latent variable modeling is necessary.

Key Finding 2: Sample Efficiency & Hybrid Synergy.

Our sample efficiency experiments (Section 4.3) establish a clear decision boundary for when to use LLM-based discovery:

Low-data regime ($`N < 20`$): Holograph’s zero-shot approach achieves F1 = 0.67 on semantically rich domains, outperforming NOTEARS by up to +91% relative F1 when only $`N=5`$ samples are available.
Hybrid synergy: When some data is available ($`N = 10`$–$`50`$), using Holograph priors to regularize NOTEARS yields +13.6% F1 improvement by preventing overfitting to sparse observations.
Semantic advantage: Performance depends critically on LLM domain knowledge. On Asia (epidemiology with intuitive variable names), Holograph achieves F1 = 0.67; on Sachs (specialized protein signaling), only F1 = 0.20.

Continuous Optimization for Causal Discovery.

NOTEARS pioneered continuous optimization for DAG learning via the acyclicity constraint $`h(\mathbf{W}) = \mathop{\mathrm{tr}}(e^{\mathbf{W}\circ \mathbf{W}}) - n`$. Extensions include GOLEM with likelihood-based scoring and DAGMA using log-determinant characterizations. Holograph builds on this foundation, adding sheaf-theoretic consistency.

LLM-Guided Causal Discovery.

Recent work explores LLMs as causal knowledge sources. benchmark LLMs on causal inference tasks, while propose active querying strategies. Democritus uses LLM beliefs as soft priors but lacks principled treatment of coherence. Emerging “causal foundation models” aim to embed causality into LLM training , yet most approaches treat LLMs as “causal parrots” that recite knowledge without verification. Our sheaf-theoretic framework addresses this gap by providing formal coherence checking via presheaf descent conditions, enabling systematic detection of contradictions in LLM beliefs.

Active Learning for Causal Discovery.

Active intervention selection has been studied extensively . apply active learning to Bayesian networks. Our EFE-based query selection extends these ideas to the LLM querying setting, balancing epistemic uncertainty and instrumental value.

Latent Variable Models.

The FCI algorithm handles latent confounders via ancestral graphs. Recent work on ADMGs provides the graphical semantics underlying our causal states. The algebraic latent projection in Holograph provides an alternative continuous relaxation for latent variable marginalization.

Sheaf Theory in Machine Learning.

Sheaf neural networks apply sheaf theory to GNNs. study sheaf Laplacians for heterogeneous data. To our knowledge, Holograph is the first application of sheaf theory to causal discovery, using presheaf descent for belief coherence.

Methodology

We now present the technical foundations of Holograph, proceeding from the mathematical framework to the optimization algorithm.

Presheaf of Causal Models

Let $`\mathcal{V} = \{X_1, \ldots, X_n\}`$ be a set of random variables. We define a presheaf $`\mathcal{F}`$ over the power set $`2^{\mathcal{V}}`$ (ordered by inclusion) whose sections are linear Structural Equation Models (SEMs) .

Definition 1 (Causal State). A causal state over variable set $`U \subseteq \mathcal{V}`$ is a pair $`\theta_U = (\mathbf{W}_U, \mathbf{M}_U)`$ where:

$`\mathbf{W}_U \in \mathbb{R}^{|U| \times |U|}`$ is the weighted adjacency matrix of directed edges
$`\mathbf{M}_U = \mathbf{L}_U \mathbf{L}_U^\top \in \mathbb{R}^{|U| \times |U|}`$ is the error covariance matrix, with $`\mathbf{L}_U`$ lower-triangular (Cholesky factor)

The pair $`(\mathbf{W}, \mathbf{M})`$ corresponds to an Acyclic Directed Mixed Graph (ADMG) where directed edges encode causal effects and bidirected edges (encoded in $`\mathbf{M}`$) represent latent confounding.

Probabilistic Model and Semantic Energy

To enable gradient-based optimization, we define a probabilistic model over LLM text observations $`y`$ given causal parameters $`\theta = (\mathbf{W}, \mathbf{L})`$.

Definition 2 (Gibbs Measure over Causal Structures). We model the LLM’s text generation process as a Gibbs measure:

MATH

\begin{equation}
P(y | \theta) = \frac{1}{Z(\theta)} \exp\left( -\beta \, \mathcal{E}_{\text{sem}}(\theta, y) \right)
\label{eq:gibbs}
\end{equation}

Click to expand and view more

where $`\beta > 0`$ is the inverse temperature and $`Z(\theta) = \int \exp(-\beta \, \mathcal{E}_{\text{sem}}(\theta, y')) \, dy'`$ is the partition function.

Definition 3 (Semantic Energy Function). The energy $`\mathcal{E}_{\text{sem}}`$ measures the distance between LLM text embedding $`\phi(y)`$ and graph structure embedding $`\Psi(\theta)`$ in a Reproducing Kernel Hilbert Space (RKHS) $`\mathcal{H}`$:

MATH

\begin{equation}
\mathcal{E}_{\text{sem}}(\theta, y) = \| \phi(y) - \Psi(\mathbf{W}, \mathbf{M}) \|^2_{\mathcal{H}}
\label{eq:semantic-energy}
\end{equation}

Click to expand and view more

where $`\phi: \text{Text} \to \mathcal{H}`$ embeds LLM responses via pre-trained encoders, and $`\Psi: (\mathbf{W}, \mathbf{M}) \to \mathcal{H}`$ encodes graph structure.

This formulation provides the probabilistic foundation for:

Loss Function: The negative log-likelihood yields $`\mathcal{L}_{\text{sem}} = \beta \, \mathcal{E}_{\text{sem}} + \log Z`$, where we approximate $`Z`$ as constant during optimization.
Fisher Information Matrix: The metric tensor $`\mathbf{G}(\theta)`$ arises naturally from this Gibbs measure (Section 3.7).

Remark 4 (Practical Implementation). In practice, we use cosine distance as a computationally efficient proxy for the RKHS norm. On the unit sphere (normalized embeddings), cosine distance satisfies $`d_{\cos}(\mathbf{u}, \mathbf{v}) = 1 - \langle \mathbf{u}, \mathbf{v} \rangle = \frac{1}{2}\|\mathbf{u} - \mathbf{v}\|^2`$, preserving the squared-distance structure of Eq. [eq:semantic-energy].

Algebraic Latent Projection

The key technical contribution is the restriction morphism $`\rho_{UV}`$ that projects a causal state from a larger context $`U`$ to a smaller context $`V \subset U`$. When hidden variables exist in $`H = U \setminus V`$, we cannot simply truncate matrices; we must account for how hidden effects propagate through the causal structure.

Definition 5 (Algebraic Latent Projection). Given a causal state $`\theta = (\mathbf{W}, \mathbf{M})`$ over $`U`$ and observed subset $`O \subset U`$ with hidden variables $`H = U \setminus O`$, partition:

MATH

\begin{equation}
\mathbf{W}= \begin{pmatrix} \mathbf{W}_{OO} & \mathbf{W}_{OH} \\ \mathbf{W}_{HO} & \mathbf{W}_{HH} \end{pmatrix}, \quad
\mathbf{M}= \begin{pmatrix} \mathbf{M}_{OO} & \mathbf{M}_{OH} \\ \mathbf{M}_{HO} & \mathbf{M}_{HH} \end{pmatrix}
\end{equation}

Click to expand and view more

The absorption matrix is:

MATH

\begin{equation}
\mathbf{A} = \mathbf{W}_{OH}(\mathbf{I} - \mathbf{W}_{HH})^{-1}
\label{eq:absorption}
\end{equation}

Click to expand and view more

The projected causal state $`\rho_{UO}(\theta) = (\widetilde{\mathbf{W}}, \widetilde{\mathbf{M}})`$ is:

MATH

\begin{align}
\widetilde{\mathbf{W}}&= \mathbf{W}_{OO} + \mathbf{A} \mathbf{W}_{HO} \label{eq:w-proj} \\
\widetilde{\mathbf{M}}&= \mathbf{M}_{OO} + \mathbf{A} \mathbf{M}_{HH} \mathbf{A}^\top + \mathbf{M}_{OH} \mathbf{A}^\top + \mathbf{A} \mathbf{M}_{HO} \label{eq:m-proj}
\end{align}

Click to expand and view more

Remark 6 (Necessity of Cross-Terms). The cross-terms $`\mathbf{M}_{OH} \mathbf{A}^\top + \mathbf{A} \mathbf{M}_{HO}`$ in Eq. [eq:m-proj] are essential for satisfying the Transitivity axiom $`\rho_{ZU} = \rho_{ZV} \circ \rho_{VU}`$. Without these terms, the projection becomes $`\widetilde{\mathbf{M}}^{\text{naive}} = \mathbf{M}_{OO} + \mathbf{A} \mathbf{M}_{HH} \mathbf{A}^\top`$, which fails to account for correlations $`\text{Cov}(X_O, X_H)`$ between observed and hidden variables. This breaks composition: projecting $`U \to V \to Z`$ yields different results than $`U \to Z`$ directly. Our implementation verification (Appendix 6.6) confirms that including all four terms achieves Transitivity error $`< 10^{-6}`$, while ablating cross-terms results in errors $`> 0.1`$.

The absorption matrix $`\mathbf{A}`$ captures how effects from observed to hidden variables “bounce back” through the hidden subgraph. The condition $`\rho(\mathbf{W}_{HH}) < 1`$ (spectral radius $`< 1`$) ensures the Neumann series $`(I - \mathbf{W}_{HH})^{-1} = \sum_{k=0}^\infty \mathbf{W}_{HH}^k`$ converges, corresponding to acyclicity among hidden variables.

Frobenius Descent Condition

For the presheaf to be coherent, sections over overlapping contexts must agree on their intersection. Given contexts $`U_i, U_j`$ with intersection $`V_{ij} = U_i \cap U_j`$, the Frobenius descent loss is:

MATH

\begin{equation}
\mathcal{L}_{\text{descent}} = \sum_{i,j} \left( \left\|\rho_{V_{ij}}(\theta_i) - \rho_{V_{ij}}(\theta_j)\right\|_F^2 \right)
\label{eq:descent-loss}
\end{equation}

Click to expand and view more

where $`\left\|\cdot\right\|_F`$ denotes the Frobenius norm. This loss penalizes inconsistencies when projecting local beliefs onto their overlaps.

Spectral Regularization

The Algebraic Latent Projection (Section 3.3) requires computing $`(\mathbf{I} - \mathbf{W}_{HH})^{-1}`$ via the Neumann series:

MATH

\begin{equation}
(\mathbf{I} - \mathbf{W}_{HH})^{-1} = \sum_{k=0}^{\infty} \mathbf{W}_{HH}^k
\label{eq:neumann-series}
\end{equation}

Click to expand and view more

This series converges if and only if the spectral radius $`\rho(\mathbf{W}_{HH}) < 1`$. To enforce this condition during optimization, we impose a spectral penalty.

Definition 7 (Spectral Stability Regularization). We penalize violations of the spectral constraint:

MATH

\begin{equation}
\mathcal{L}_{\text{spec}}(\mathbf{W}) = \max(0, \rho(\mathbf{W}) - 1 + \delta)^2
\label{eq:spectral-exact}
\end{equation}

Click to expand and view more

where $`\delta = 0.1`$ is a safety margin ensuring $`\rho(\mathbf{W}) < 0.9`$.

Computational Approximation.

Computing $`\rho(\mathbf{W})`$ via eigenvalue decomposition is expensive ($`O(n^3)`$) and can produce unstable gradients. We use the Frobenius norm as a differentiable upper bound:

MATH

\begin{equation}
\mathcal{L}_{\text{spec}}(\mathbf{W}) = \max(0, \left\|\mathbf{W}\right\|_F - (1 - \delta))^2
\label{eq:spectral}
\end{equation}

Click to expand and view more

This is valid because $`\left\|\mathbf{W}\right\|_F = \sqrt{\sum_{ij} w_{ij}^2} \geq \sigma_{\max}(\mathbf{W}) \geq \rho(\mathbf{W})`$, providing a conservative (over-penalizing) but differentiable bound.

Why This Matters.

Without spectral regularization, $`\rho(\mathbf{W}_{HH})`$ can approach 1 during optimization, causing: (1) numerical overflow in absorption matrix computation, (2) gradient explosion preventing convergence, and (3) invalid ADMG representations violating acyclicity among hidden variables.

Acyclicity Constraint

We enforce acyclicity using the NOTEARS constraint :

MATH

\begin{equation}
h(\mathbf{W}) = \mathop{\mathrm{tr}}(e^{\mathbf{W}\circ \mathbf{W}}) - n = 0
\label{eq:notears}
\end{equation}

Click to expand and view more

where $`\circ`$ denotes element-wise product. This continuous relaxation equals zero if and only if $`\mathbf{W}`$ encodes a DAG.

Natural Gradient Descent

Standard gradient descent on the belief parameters $`\theta = (\mathbf{W}, \mathbf{L})`$ ignores the geometry of the parameter space. We employ natural gradient descent , which uses the Fisher Information Matrix as a Riemannian metric.

Fisher Metric from Gibbs Measure.

For the Gibbs measure $`P(y|\theta)`$ defined in Eq. [eq:gibbs], the Fisher Information Matrix is:

MATH

\begin{equation}
\mathbf{G}(\theta) = \mathbb{E}_{y \sim P(\cdot|\theta)}\left[(\nabla_\theta \log P(y|\theta))(\nabla_\theta \log P(y|\theta))^\top\right]
\label{eq:fisher-exact}
\end{equation}

Click to expand and view more

Expanding the gradient of the log-probability: $`\nabla_\theta \log P(y|\theta) = -\beta \nabla_\theta \mathcal{E}_{\text{sem}}(\theta, y) - \nabla_\theta \log Z(\theta)`$. Assuming quasi-static dynamics where $`Z`$ varies slowly, we approximate:

MATH

\begin{equation}
\mathbf{G}(\theta) \approx \beta^2 \, \mathbb{E}_y\left[(\nabla_\theta \mathcal{E}_{\text{sem}})(\nabla_\theta \mathcal{E}_{\text{sem}})^\top\right]
\label{eq:fisher-approx}
\end{equation}

Click to expand and view more

Tikhonov Regularization for Unidentifiable Regions.

The Fisher matrix becomes singular in regions where causal effects are unidentifiable. We apply Tikhonov damping:

MATH

\begin{equation}
\mathbf{G}_{\text{reg}}(\theta) = \mathbf{G}(\theta) + \lambda_{\text{reg}} \mathbf{I}
\label{eq:fisher-reg}
\end{equation}

Click to expand and view more

with $`\lambda_{\text{reg}} = 10^{-4}`$. This ensures $`\mathbf{G}_{\text{reg}}`$ remains invertible, allowing Natural Gradient Descent to traverse unidentifiable regions smoothly—a critical property when latent confounders render certain edges non-identifiable.

Natural Gradient Update Rule.

The update equation is:

MATH

\begin{equation}
\theta_{t+1} = \theta_t - \eta \cdot \mathbf{G}_{\text{reg}}(\theta_t)^{-1} \nabla_\theta \mathcal{L}
\label{eq:natural-grad}
\end{equation}

Click to expand and view more

Diagonal Approximation.

For computational efficiency with $`O(n^2)`$ parameters, we use a diagonal approximation:

MATH

\begin{equation}
\mathbf{G}_{\text{diag}} = \text{diag}\left(\mathbb{E}\left[(\nabla \mathcal{E}_{\text{sem}})^2\right]\right) + \lambda_{\text{reg}} \mathbf{I}
\label{eq:fisher-diag}
\end{equation}

Click to expand and view more

updated via exponential moving average, reducing storage from $`O(D^2)`$ to $`O(D)`$.

Total Loss Function

The complete objective combines all components:

MATH

\begin{equation}
\mathcal{L} = \mathcal{L}_{\text{sem}} + \lambda_d \mathcal{L}_{\text{descent}} + \lambda_a h(\mathbf{W}) + \lambda_s \mathcal{L}_{\text{spec}}
\label{eq:total-loss}
\end{equation}

Click to expand and view more

where $`\mathcal{L}_{\text{sem}}`$ is the semantic energy between LLM embeddings and graph structure, and $`\lambda_d = 1.0`$, $`\lambda_a = 1.0`$, $`\lambda_s = 0.1`$ are balancing weights.

Active Query Selection via Expected Free Energy

To efficiently utilize LLM queries, we employ an active learning strategy based on Expected Free Energy (EFE) from active inference :

MATH

\begin{equation}
G(a) = \underbrace{\mathbb{E}_{q(s'|a)}[\text{KL}[q(o|s')\|p(o)]]}_{\text{Epistemic Value}} + \underbrace{\mathbb{E}_{q(o|a)}[\log q(o|a)]}_{\text{Instrumental Value}}
\label{eq:efe}
\end{equation}

Click to expand and view more

For each candidate query about edge $`(i,j)`$:

Epistemic value: Uncertainty in current edge belief, measured by proximity to decision boundary: $`u_{ij} = 1 - 2|w_{ij} - 0.5|`$
Instrumental value: Expected impact on descent loss reduction

Queries are selected to minimize EFE, prioritizing high-uncertainty edges with potential to resolve descent conflicts.

Sheaf Axiom Verification

We verify four presheaf axioms empirically:

Identity: $`\rho_{UU} = \text{id}_U`$ (projection onto self is identity)
Transitivity: $`\rho_{ZU} = \rho_{ZV} \circ \rho_{VU}`$ for $`Z \subset V \subset U`$
Locality: Sections over $`U`$ are determined by restrictions to an open cover
Gluing: Compatible local sections glue to a unique global section

Section 4.5 presents empirical results showing Identity, Transitivity, and Gluing pass to numerical precision, while Locality systematically fails for latent projections.

Experiments

We evaluate Holograph on synthetic and real-world causal discovery benchmarks, with particular focus on sheaf axiom verification and ablation studies.

Experimental Setup

Datasets.

We evaluate on five dataset types:

ER (Erdős-Rényi): Random graphs with edge probability $`p \in \{0.15, 0.2\}`$
SF (Scale-Free): Barabási-Albert preferential attachment with average degree 2.0
Asia: Pearl’s epidemiology network with 8 semantically meaningful variables (e.g., Tuberculosis, Smoking, Lung_Cancer)
Sachs: Real-world protein signaling network with 11 variables
Latent: Synthetic graphs with hidden confounders (3–8 latent variables)

Baselines.

We compare against ablated versions of Holograph:

A1: Standard SGD instead of Natural Gradient
A2: Without Frobenius descent loss ($`\lambda_d = 0`$)
A3: Without spectral regularization ($`\lambda_s = 0`$)
A4: Random queries instead of EFE-based selection
A5: Fast model (thinking-off) instead of primary reasoning model
A6: Pure optimization without LLM guidance

Metrics.

SHD (Structural Hamming Distance): Number of edge additions/deletions/reversals
F1: Harmonic mean of precision and recall
SID (Structural Intervention Distance): Interventional disagreement count

Infrastructure.

All experiments run on NVIDIA V100 GPUs via SLURM on the IZAR cluster. LLM queries use DeepSeek-V3.2-Exp with thinking enabled via SGLang gateway. Each configuration runs with 5 random seeds (42–46).

Main Results

Table 1 presents benchmark results comparing Holograph against NOTEARS . Critically, this comparison reveals the gap between data-driven discovery (NOTEARS uses 1000 observational samples) and knowledge-driven discovery (Holograph uses only LLM priors without data).

Main benchmark results (τ = 0.05). NOTEARS uses N = 1000 observational samples; Holograph uses only LLM priors (zero data). Mean ± std over 5 seeds.
Dataset	Method	SHD ↓	F1 ↑
ER-20	NOTEARS	6.6±4.3	.90±.05
ER-20	Holograph	74.4±6.3	.08±.03
ER-50	NOTEARS	48.6±13	.88±.03
ER-50	Holograph	299±12	.05±.01
SF-50	NOTEARS	9.2±3.7	.91±.03
SF-50	Holograph	159±8.3	.02±.01
	NOTEARS	0.0±0.0	1.00±.00
	Holograph	6.0±0.0	.67±.00
Sachs	NOTEARS	6.4±1.0	.83±.02
Sachs	Holograph	25.4±5.3	.20±.05

Interpretation.

As expected, NOTEARS with access to abundant observational data ($`N=1000`$) substantially outperforms Holograph’s zero-shot approach on most benchmarks. However, the key insight emerges from the Asia dataset (highlighted row): Holograph achieves F1 = 0.67 without any data, purely from LLM semantic priors. This demonstrates that for semantically rich domains, LLM knowledge can substitute for observational data.

The key findings are:

Semantic domains enable strong priors: On Asia (epidemiology with meaningful variable names like Tuberculosis, Smoking), Holograph recovers 67% F1 zero-shot—over 3$`\times`$ higher than on Sachs (20% F1). This gap reflects the quality of LLM domain knowledge.
Synthetic graphs lack semantic signal: On ER/SF graphs with arbitrary variable names (X0, X1, …), LLM priors provide minimal guidance (F1 $`< 0.1`$). This is expected—LLMs have no domain knowledge for anonymous variables.
Technical domains are harder: Sachs uses protein names (e.g., Raf, Mek, Erk) that require specialized biochemistry knowledge, resulting in weaker LLM priors compared to general epidemiology concepts.
Sheaf coherence ensures consistency: The presheaf descent framework unifies potentially contradictory LLM responses into globally consistent structures.

Threshold Calibration.

Due to the spectral radius constraint ($`\rho(\mathbf{W}) < 1`$) required for Neumann series convergence in the Algebraic Latent Projection, learned edge weights are compressed relative to ground truth. We use a calibrated threshold $`\tau = 0.05`$ (rather than the ground truth generation threshold of 0.3) to ensure fair structural evaluation. See Appendix 6.7 for sensitivity analysis.

Sample Efficiency: The Low-Data Advantage

A critical question emerges: at what sample size does data-driven discovery match LLM-based discovery? We investigate this crossover point on the Asia dataset, where Holograph achieves strong zero-shot performance (F1 = 0.67).

$`N`$	NOTEARS F1	Holograph F1	$`\Delta`$
5	$`.35{\scriptstyle\pm.11}`$	$`\mathbf{.67{\scriptstyle\pm.00}}`$	+91%
10	$`.55{\scriptstyle\pm.13}`$	$`\mathbf{.67{\scriptstyle\pm.00}}`$	+20%
20	$`.70{\scriptstyle\pm.09}`$	$`.67{\scriptstyle\pm.00}`$	$`-4\%`$
50	$`\mathbf{.92{\scriptstyle\pm.07}}`$	$`.67{\scriptstyle\pm.00}`$	$`-27\%`$

Sample efficiency on Asia dataset. Holograph is sample-invariant; NOTEARS improves with data. The crossover occurs at $`N \approx 15`$–$`20`$ samples.

Table 2 reveals a striking pattern:

Extreme low-data regime ($`N \le 10`$): Holograph dramatically outperforms NOTEARS. At $`N=5`$ samples, the improvement is +91% relative F1—statistical methods fundamentally cannot learn structure from so few observations.
Crossover at $`N \approx 15`$–$`20`$: Below this threshold, LLM priors dominate; above it, data-driven methods rapidly improve and eventually surpass zero-shot performance.
Sample invariance: Holograph’s F1 is constant across all $`N`$ (as expected for a zero-shot method), providing a floor guarantee regardless of data availability.

Practical Implication.

These results establish a clear decision boundary: when $`N < 20`$ samples are available for a semantically rich domain, Holograph’s zero-shot approach is preferable to training NOTEARS on insufficient data.

Hybrid Synergy: LLM Priors as Regularization

Can LLM priors complement rather than replace statistical methods? We test a hybrid approach: use Holograph’s learned adjacency matrix to regularize NOTEARS optimization. Specifically, we apply confidence filtering—only edges with $`|W_{ij}| > 0.3`$ in the Holograph prior contribute to regularization.

$`N`$	Vanilla F1	Hybrid F1	Improvement
10	$`.56{\scriptstyle\pm.08}`$	$`\mathbf{.61{\scriptstyle\pm.09}}`$	+9.4%
20	$`.71{\scriptstyle\pm.08}`$	$`\mathbf{.80{\scriptstyle\pm.06}}`$	+13.6%
50	$`.94{\scriptstyle\pm.04}`$	$`.95{\scriptstyle\pm.04}`$	+1.3%

Hybrid method results on Asia (low-data regime). NOTEARS + Holograph prior outperforms vanilla NOTEARS when data is scarce.

Table 3 demonstrates substantial synergy in the low-data regime:

Maximum benefit at $`N=20`$: The hybrid method achieves +13.6% F1 improvement (0.71 $`\to`$ 0.80), with the Holograph prior providing regularization that prevents overfitting to limited samples.
Complementary strengths: At $`N=10`$, vanilla NOTEARS achieves only F1 = 0.56 due to overfitting, while the hybrid recovers 0.61—the LLM prior acts as an inductive bias toward semantically plausible structures.
Diminishing returns: At $`N=50`$, the improvement shrinks to +1.3% as statistical evidence dominates. The prior becomes less necessary when data is abundant.

Mechanism of Improvement.

The confidence filtering threshold ($`|W| > 0.3`$) ensures only high-confidence Holograph edges contribute to regularization. This prevents noisy LLM beliefs from corrupting the optimization while preserving strong semantic signals.

Remark 8 (When Hybrid Fails). On Sachs (protein signaling), the hybrid method does not improve over vanilla NOTEARS (see Appendix 6.8). This occurs because Holograph’s prior on Sachs is weak (F1 = 0.20)—using a poor prior as regularization can hurt rather than help. The hybrid approach is most effective when the LLM has strong domain knowledge.

Sheaf Axiom Verification

Table 4 presents results from sheaf exactness experiments (X1–X4).

$`n`$	Identity	Transitivity	Locality	Gluing
30	100%	100%	0% (err: 1.25)	100%
50	100%	100%	0% (err: 2.38)	100%
100	100%	100%	0% (err: 3.45)	100%

Sheaf axiom pass rates across graph sizes. Threshold: $`10^{-6}`$.

Key Findings.

Identity and Transitivity: Both axioms pass with errors $`< 10^{-6}`$ across all graph sizes, confirming mathematically correct implementation of the Algebraic Latent Projection. This validates the cross-term inclusion in Eq. [eq:m-proj] (see Remark 6 and Appendix 6.6 for implementation verification).
Gluing: The gluing axiom (compatible local sections yield unique global section) passes uniformly, validating the Frobenius descent loss formulation.
Locality Failure as Discovery: The locality axiom systematically fails with errors scaling approximately as $`\mathcal{O}(\sqrt{n})`$ with graph size.

Interpretation: This is not an implementation bug, but a fundamental property of ADMGs with latent confounders. Latent variables create non-local correlations: knowledge about variable subset $`A`$ constrains beliefs about distant subset $`B`$ through hidden mediators, violating the principle that “local data determines local structure.”

Significance of Locality Failure.

This finding demonstrates that the presheaf of ADMGs under algebraic latent projection does not form a classical sheaf. The failure quantitatively measures the “non-sheafness” introduced by latent confounding—a property that could serve as a diagnostic for the necessity of latent variable modeling.

Remark 9 (Connection to Non-Local Phenomena). The scaling behavior $`\text{Locality Error} \propto \sqrt{n}`$ echoes patterns in quantum entanglement, where Bell inequality violations scale with system size. While we do not claim a direct connection, both phenomena involve fundamentally non-local correlations that resist local factorization—an intriguing parallel for future theoretical investigation.

Ablation Studies

Table 5 compares ablation variants on ER-50 and Sachs using F1 score.

Variant	ER-50 F1 $`\uparrow`$	Sachs F1 $`\uparrow`$
Full Holograph	$`.052{\scriptstyle\pm.009}`$	$`.202{\scriptstyle\pm.052}`$
A1: Standard SGD	$`.068{\scriptstyle\pm.013}`$	$`.202{\scriptstyle\pm.052}`$
A2: No descent loss	$`.068{\scriptstyle\pm.013}`$	$`.202{\scriptstyle\pm.052}`$
A3: No spectral reg.	$`.108{\scriptstyle\pm.020}`$	$`.202{\scriptstyle\pm.052}`$
A4: Random queries	$`.070{\scriptstyle\pm.022}`$	$`.189{\scriptstyle\pm.088}`$
A5: Fast model	$`.071{\scriptstyle\pm.025}`$	$`\mathbf{.269{\scriptstyle\pm.077}}`$
A6: No LLM	$`.070{\scriptstyle\pm.022}`$	$`.189{\scriptstyle\pm.088}`$

Ablation results: F1 score comparison ($`\tau=0.05`$). Higher is better.

Key Findings.

The ablation results reveal nuanced trade-offs:

Spectral regularization trades off with F1: Removing spectral regularization (A3) increases F1 on ER-50 (0.108 vs 0.052), but at the cost of numerical stability. This suggests the strict $`\rho(\mathbf{W}) < 0.9`$ constraint may be overly conservative.
LLM guidance helps on real data: On Sachs, variants with LLM guidance (Full, A1–A3) outperform those without (A4, A6), confirming the value of domain knowledge for real-world networks.
Active query selection matters: A4 (random queries) and A6 (no LLM) show similar performance, suggesting that EFE-based query selection effectively prioritizes informative edges.
Fast model performs surprisingly well: A5 (thinking-off) achieves the highest F1 on Sachs (0.269), suggesting that for well-known domains, simple LLM responses may suffice without extended reasoning.

Interpretation.

The ablation results highlight a key insight: the full Holograph configuration prioritizes numerical stability (via spectral regularization) and theoretical coherence (via Natural Gradient and descent loss) over raw F1 performance. Removing these constraints can improve F1 but may produce unstable or incoherent causal graphs. The choice depends on downstream requirements.

Hidden Confounder Experiments

Table [tab:latent] presents results on graphs with hidden confounders (E3). These experiments test Holograph’s ability to recover structure in the presence of latent variables using the Algebraic Latent Projection.

The 50-observed/8-latent configuration shows high variance in runtime, reflecting the stochastic nature of LLM-guided optimization. Increasing latent variables proportionally increases structural error, confirming the fundamental difficulty of latent confounder identification.

Rashomon Stress Test

The Rashomon experiment (E5) tests contradiction detection and resolution under latent confounding. With 30 observed and 5 latent variables, Holograph achieves:

SHD: $`89.8 \pm 5.7`$
100 queries utilized (budget exhausted)
Final loss: $`1.6 \times 10^{-4}`$

The system correctly identifies topological obstructions when descent loss plateaus, triggering latent variable proposals. However, resolution rates remain below target ($`<70\%`$), indicating room for improvement in latent variable initialization strategies.

Conclusion

We presented Holograph, a sheaf-theoretic framework for LLM-guided causal discovery. By formalizing local causal beliefs as presheaf sections and global consistency as descent conditions, we provide principled foundations for integrating LLM knowledge into structure learning.

Our key contributions include:

The Algebraic Latent Projection for handling hidden confounders
Natural gradient descent with Tikhonov regularization for optimization
EFE-based active query selection for efficient LLM utilization
Comprehensive sheaf axiom verification revealing fundamental locality failures

The systematic failure of the Locality axiom is perhaps our most significant finding. It demonstrates that the presheaf of ADMGs does not form a classical sheaf when latent variables induce non-local coupling. This provides a formal measure of the “non-sheafness” inherent in causal models with hidden confounders—a quantity that could guide future algorithms in detecting latent variable necessity.

Limitations.

Scalability: Performance on graphs with $`n > 100`$ variables degrades due to $`O(n^3)`$ projection costs. Sparse approximations may help.
LLM Reliability: Current approach assumes LLM responses are locally consistent. Adversarially contradictory LLMs could violate this assumption.
Identifiability: As with all causal discovery methods, we can only recover structure up to Markov equivalence without interventional data.

Future Work.

Promising directions include:

Cohomological Measures: Develop sheaf cohomology metrics to quantify Locality violations, potentially using $`\check{\text{C}}`$ech cohomology.
Hybrid Methods: Combine Holograph with constraint-based algorithms (e.g., FCI) to leverage both continuous optimization and discrete constraint propagation.
Interventional Extensions: Extend to experimental design settings where interventions can be performed, potentially enabling full causal identification.

Speculative Connections.

We note a suggestive parallel between our Locality failure and quantum non-locality. In quantum mechanics, entangled systems violate Bell inequalities through correlations that resist local hidden variable explanations. Similarly, ADMGs with latent confounders exhibit correlations between distant variables that cannot be explained by local restrictions. The scaling $`\text{Error} \propto \sqrt{n}`$ in both settings hints at deeper mathematical connections—a direction for future theoretical exploration.

Appendix

Hyperparameters and Configuration

Table [tab:hyperparams] lists all hyperparameters used in experiments. Values are sourced from experiments/config/constants.py.

Parameter	Value	Description
Optimization
Learning rate	0.01	Step size for gradient descent
λ_d (descent)	1.0	Frobenius descent loss weight
λ_s (spectral)	0.1	Spectral regularization weight
λ_a (acyclic)	1.0	Acyclicity constraint weight
λ_reg (Tikhonov)	10⁻⁴	Fisher regularization
Max steps	1500	Maximum training iterations
Numerical Stability
ϵ (matrix)	10⁻⁶	Regularization for inversions
Spectral margin δ	0.1	Safety margin for ρ(W) < 1
Fisher min value	0.01	Minimum Fisher diagonal entry
Query Generation
Max queries/step	3–5	Queries per optimization step
Query interval	25–75	Steps between query batches
Max total queries	100	Hard budget limit
Max total tokens	500,000	Token budget limit
Uncertainty threshold	0.3	Minimum EFE for query selection
Edge Thresholds
Edge threshold	0.01	Minimum for edge existence
Discretization threshold	0.3	For binary adjacency output
LLM Configuration
Provider	SGLang	Unified API gateway
Model	DeepSeek-V3.2-Exp	Primary reasoning model
Temperature	0.1	Low for deterministic reasoning
Max tokens	4096	Response length limit

Infrastructure Details

Cluster.

Experiments ran on the IZAR cluster at EPFL/SCITAS with:

GPU: NVIDIA Tesla V100 (32GB HBM2)
CPU: Intel Xeon Gold 6140 (18 cores per node)
Memory: 192GB RAM per node
Scheduler: SLURM with array jobs for parallelization

Runtime Statistics.

Small experiments (n=20, Sachs): $`<1`$ second
Medium experiments (n=50, ER/SF): $`\sim`$30 seconds
Large latent experiments (n=50+8): 30–60 minutes
Total GPU hours: $`\sim`$50 hours across 160 experiments

LLM Gateway.

We use SGLang to provide a unified OpenAI-compatible API:

Primary model: DeepSeek-V3.2-Exp (thinking-on)
Endpoint: Custom gateway at port 10000
Rate limiting: Handled by query budget enforcement

Sheaf Axiom Definitions

For completeness, we formally state the four presheaf axioms tested.

Definition 10 (Identity Axiom). For any open set $`U`$, the restriction to itself is the identity:

MATH

\rho_{UU} = \text{id}_{\mathcal{F}(U)}

Click to expand and view more

Definition 11 (Transitivity Axiom). For $`Z \subset V \subset U`$, composition of restrictions equals direct restriction:

MATH

\rho_{ZU} = \rho_{ZV} \circ \rho_{VU}

Click to expand and view more

Definition 12 (Locality Axiom). If $`\{U_i\}`$ is an open cover of $`U`$ and $`s, t \in \mathcal{F}(U)`$ satisfy $`\rho_{U_i}(s) = \rho_{U_i}(t)`$ for all $`i`$, then $`s = t`$.

Definition 13 (Gluing Axiom). If $`\{U_i\}`$ covers $`U`$ and sections $`s_i \in \mathcal{F}(U_i)`$ satisfy $`\rho_{U_i \cap U_j}(s_i) = \rho_{U_i \cap U_j}(s_j)`$ for all $`i, j`$, then there exists unique $`s \in \mathcal{F}(U)`$ with $`\rho_{U_i}(s) = s_i`$ for all $`i`$.

Proof of Absorption Matrix Formula

Proposition 14. *Let $`\mathbf{W}`$ be a weighted adjacency matrix partitioned into observed ($`O`$) and hidden ($`H`$) blocks. If $`\rho(\mathbf{W}_{HH}) < 1`$, the total effect from observed variables through hidden paths is:

MATH

\mathbf{W}_{\text{total}} = \mathbf{W}_{OO} + \mathbf{W}_{OH}(\mathbf{I} - \mathbf{W}_{HH})^{-1}\mathbf{W}_{HO}
```*

</div>

<div class="proof">

*Proof.* Consider a path from observed variable $`X_i`$ to observed
variable $`X_j`$ passing through hidden variables. The direct effect is
$`\mathbf{W}_{OO}[i,j]`$. Paths through exactly one hidden variable
contribute $`\sum_h \mathbf{W}_{OH}[i,h] \mathbf{W}_{HO}[h,j]`$. Paths
through $`k`$ hidden variables contribute
$`(\mathbf{W}_{OH} \mathbf{W}_{HH}^{k-1} \mathbf{W}_{HO})[i,j]`$.

Summing all path lengths:
``` math
\begin{align*}
\mathbf{W}_{\text{total}} &= \mathbf{W}_{OO} + \sum_{k=1}^{\infty} \mathbf{W}_{OH} \mathbf{W}_{HH}^{k-1} \mathbf{W}_{HO} \\
&= \mathbf{W}_{OO} + \mathbf{W}_{OH} \left(\sum_{k=0}^{\infty} \mathbf{W}_{HH}^k\right) \mathbf{W}_{HO} \\
&= \mathbf{W}_{OO} + \mathbf{W}_{OH} (\mathbf{I} - \mathbf{W}_{HH})^{-1} \mathbf{W}_{HO}
\end{align*}

Click to expand and view more

The series converges when $`\rho(\mathbf{W}_{HH}) < 1`$ by the Neumann series theorem. ◻

Additional Experimental Results

Full Sheaf Axiom Error Statistics

Table 6 provides detailed error statistics for all X experiments.

Experiment	Identity	Transitivity	Locality	Gluing
X1 (n=30)	$`0.0`$	$`1.7 \times 10^{-6}`$	$`1.25`$	$`0.0`$
X1 (n=50)	$`0.0`$	$`1.6 \times 10^{-6}`$	$`2.38`$	$`0.0`$
X1 (n=100)	$`0.0`$	$`1.7 \times 10^{-6}`$	$`3.45`$	$`0.0`$
X2 (n=30)	$`0.0`$	$`1.7 \times 10^{-6}`$	$`1.25`$	$`0.0`$
X2 (n=50)	$`0.0`$	$`1.6 \times 10^{-6}`$	$`2.38`$	$`0.0`$
X2 (n=100)	$`0.0`$	$`1.7 \times 10^{-6}`$	$`3.45`$	$`0.0`$

Sheaf axiom errors (mean $`\pm`$ std over 5 seeds).

Convergence Plots

Loss curves show rapid initial descent followed by plateau behavior, consistent with the NOTEARS objective landscape. Natural gradient variants (full Holograph) converge faster and reach lower final loss than SGD ablations.

Query Distribution Analysis

Across all experiments, the query type distribution was:

Edge existence: 45%
Direction: 25%
Mechanism: 20%
Confounder: 10%

EFE-based selection preferentially queries uncertain edges near decision boundaries, as expected from the epistemic value formulation.

Identification Frontier Analysis

The identification frontier represents the set of queries that can yield identifiable causal effects given the current ADMG state. Figure 1 compares the frontier sizes across methods.

Identification frontier size comparison. Holograph’s ADMG representation enables identification of significantly more causal queries than DAG-based methods. Values represent average number of identifiable edge queries per experiment.

Analysis.

The identification frontier advantage of Holograph stems from two sources:

ADMG vs DAG representation: By explicitly modeling bidirected edges for latent confounders, Holograph can identify effects that remain confounded under DAG assumptions. On ER (n=50), this yields 82 identifiable queries vs. 45 for NOTEARS ($`\sim`$82% improvement).
EFE-based query selection: The Expected Free Energy criterion prioritizes queries that maximize information gain about the true graph, leading to more efficient exploration of the identification frontier.

The Sachs dataset shows the largest relative improvement (180% vs. NOTEARS) because the protein signaling network contains multiple known confounding pathways that cannot be represented in a DAG without introducing spurious edges.

Mathematical Implementation Verification

To ensure the implementation faithfully realizes the mathematical specification, we conducted a comprehensive audit comparing 15 core formulas against the codebase.

Core Formula Verification

Table [tab:verification] lists all verified formulas with their code locations.

Formula	Equation	Code Location
Absorption matrix $`\mathbf{A}`$	Eq. <a href="#eq:absorption" data-reference-type=“ref”
data-reference=“eq:absorption”>[eq:absorption]	`sheaf.py:165`
$`\widetilde{\mathbf{W}}`$ projection	Eq. <a href="#eq:w-proj" data-reference-type=“ref”
data-reference=“eq:w-proj”>[eq:w-proj]	`sheaf.py:208`
$`\widetilde{\mathbf{M}}`$ projection	Eq. <a href="#eq:m-proj" data-reference-type=“ref”
data-reference=“eq:m-proj”>[eq:m-proj]	`sheaf.py:211-216`
Descent loss $`\mathcal{L}_{\text{descent}}`$	Eq. <a href="#eq:descent-loss" data-reference-type=“ref”
data-reference=“eq:descent-loss”>[eq:descent-loss]	`sheaf.py:268-269`
Acyclicity $`h(\mathbf{W})`$	Eq. <a href="#eq:notears" data-reference-type=“ref”
data-reference=“eq:notears”>[eq:notears]	`scm.py:149`
Spectral penalty $`\mathcal{L}_{\text{spec}}`$	Eq. <a href="#eq:spectral" data-reference-type=“ref”
data-reference=“eq:spectral”>[eq:spectral]	`scm.py:210`
Natural gradient update	Eq. <a href="#eq:natural-grad" data-reference-type=“ref”
data-reference=“eq:natural-grad”>[eq:natural-grad]	`natural_gradient.py:205`
Tikhonov regularization	Eq. <a href="#eq:fisher-reg" data-reference-type=“ref”
data-reference=“eq:fisher-reg”>[eq:fisher-reg]	`natural_gradient.py:200`

Numerical Stability Verification

All implementations include the following stability measures:

Stable Matrix Inversion: Uses torch.linalg.solve instead of explicit inv() for $`(\mathbf{I} - \mathbf{W}_{HH})^{-1}`$ computation.
Regularization: Adds $`\epsilon \mathbf{I}`$ ($`\epsilon = 10^{-6}`$) to near-singular matrices before inversion.
Pseudoinverse Fallback: Switches to SVD-based pseudoinverse if standard solver fails.
Spectral Enforcement: Continuously penalizes $`\rho(\mathbf{W}) > 0.9`$ during training.
PSD Guarantee: Parametrizes $`\mathbf{M}= \mathbf{L}\mathbf{L}^\top`$ with lower-triangular $`\mathbf{L}`$ to ensure positive semi-definiteness.

Cross-Term Necessity Verification

Ablation experiments confirm that removing cross-terms $`\mathbf{M}_{OH}\mathbf{A}^\top + \mathbf{A}\mathbf{M}_{HO}`$ from Eq. [eq:m-proj] increases Transitivity error from $`< 10^{-6}`$ to $`> 0.1`$, validating their necessity for presheaf composition:

MATH

\rho_{ZU} = \rho_{ZV} \circ \rho_{VU}

Click to expand and view more

Dual Implementation Consistency

The project maintains two implementations (src/holograph/ and holograph/). Both pass identical unit tests and produce numerically equivalent results (difference $`< 10^{-8}`$) on shared test cases, confirming implementation consistency across the codebase.

Threshold Sensitivity Analysis

The discretization threshold $`\tau`$ converts continuous edge weights to binary adjacency matrices for evaluation. Table 7 shows how F1 varies with $`\tau`$ for the full Holograph model on ER-50.

$`\tau`$	Pred. Edges	TP	FP	F1
0.01	569	45	524	0.12
0.02	426	34	392	0.11
0.05	119	9	110	0.06
0.10	5	0	5	0.00
0.30	0	0	0	0.00

Threshold sensitivity on ER-50 (seed 42).

Key Observations.

Ground Truth Scale Mismatch: Ground truth edges are generated with weights in $`[0.3, 1.0]`$, but Holograph’s learned weights are compressed to $`[-0.12, 0.12]`$ due to spectral regularization.
Optimal Threshold: F1 peaks around $`\tau = 0.01`$–$`0.02`$ where the trade-off between true positives and false positives is balanced.
Threshold Choice Justification: We use $`\tau = 0.05`$ as a conservative choice that avoids excessive false positives while maintaining non-zero recall.

Weight Compression Analysis.

The spectral regularization constraint $`\|\mathbf{W}\|_F < 0.9`$ limits the magnitude of learned weights. For an $`n \times n`$ matrix with $`k`$ non-zero entries of equal magnitude $`w`$, we have $`\|\mathbf{W}\|_F = w\sqrt{k} < 0.9`$. With $`n=50`$ and expected $`k \approx 184`$ edges, this implies $`w < 0.9/\sqrt{184} \approx 0.066`$. This theoretical bound aligns with observed maximum weights of $`\approx 0.12`$.

Hybrid Method Limitations

While Section 4.4 demonstrates the effectiveness of hybrid LLM-NOTEARS integration on the Asia dataset, this approach has important limitations that practitioners should consider.

Prior Quality Dependency

The hybrid method’s effectiveness depends critically on the quality of the Holograph prior. Table 8 shows results on the Sachs protein signaling network, where Holograph achieves only F1 = 0.35 (compared to 0.67 on Asia).

$`N`$	Vanilla F1	Hybrid F1	$`\Delta`$
100	$`\mathbf{.84{\scriptstyle\pm.03}}`$	$`.77{\scriptstyle\pm.08}`$	$`-8.3\%`$
500	$`\mathbf{.83{\scriptstyle\pm.06}}`$	$`.76{\scriptstyle\pm.10}`$	$`-8.4\%`$
1000	$`\mathbf{.87{\scriptstyle\pm.02}}`$	$`.75{\scriptstyle\pm.11}`$	$`-13.8\%`$

Hybrid method on Sachs (protein signaling). Unlike Asia, the hybrid approach does not improve over vanilla NOTEARS—and sometimes hurts performance.

Analysis.

On Sachs, the hybrid method consistently underperforms vanilla NOTEARS:

Weak prior hurts: With Holograph F1 = 0.35, the LLM prior contains significant errors. Using this as regularization biases NOTEARS toward incorrect edges.
Higher variance: The hybrid shows std = 0.08–0.11 vs. 0.02–0.06 for vanilla, indicating unstable optimization when conflicting signals (data vs. prior) compete.
Negative transfer: At $`N=1000`$, the performance gap widens to $`-13.8\%`$—more data makes NOTEARS more confident in correct structure, but the fixed prior continues to pull toward errors.

Domain Knowledge Requirements

The contrast between Asia (F1 gain = +13.6%) and Sachs (F1 loss = $`-8.3\%`$) illustrates a critical insight: hybrid methods require that the LLM has genuine domain expertise.

Asia (epidemiology): Variables like Tuberculosis, Smoking, and Lung_Cancer have well-documented causal relationships in medical literature. LLMs trained on web corpora encode this knowledge accurately.
Sachs (protein signaling): Variables like Raf, Mek, and PKC are specialized biochemistry concepts. Their causal relationships require domain expertise that general LLMs lack.

Recommendations for Practitioners

Based on these findings, we recommend the following workflow:

Assess prior quality first: Run Holograph zero-shot and evaluate against any available ground truth or domain expertise. If F1 $`< 0.5`$, the hybrid approach is unlikely to help.
Use confidence filtering: Only include high-confidence edges ($`|W| > 0.3`$) in the prior to avoid noise amplification.
Consider sample size: The hybrid is most beneficial when $`N < 50`$ and the prior is strong. With abundant data, let NOTEARS learn from observations alone.
Validate on held-out data: If possible, use a validation set to detect negative transfer early and fall back to vanilla NOTEARS.

Read Full PDF on ArXiv

A Note of Gratitude

The copyright of this content belongs to the respective researchers. We deeply appreciate their hard work and contribution to the advancement of human civilization.

📝 Original Paper Info

📝 Abstract

💡 Summary & Analysis

📄 Full Paper Content (ArXiv Source)

Contributions.

Key Finding 1: Locality Failure as Discovery.

Key Finding 2: Sample Efficiency & Hybrid Synergy.

Related Work

Continuous Optimization for Causal Discovery.

LLM-Guided Causal Discovery.

Active Learning for Causal Discovery.

Latent Variable Models.

Sheaf Theory in Machine Learning.

Methodology

Presheaf of Causal Models

Probabilistic Model and Semantic Energy

Algebraic Latent Projection

Frobenius Descent Condition

Spectral Regularization

Computational Approximation.

Why This Matters.

Acyclicity Constraint

Natural Gradient Descent

Fisher Metric from Gibbs Measure.

Tikhonov Regularization for Unidentifiable Regions.

Natural Gradient Update Rule.

Diagonal Approximation.

Total Loss Function

Active Query Selection via Expected Free Energy

Sheaf Axiom Verification

Experiments

Experimental Setup

Datasets.

Baselines.

Metrics.

Infrastructure.

Main Results

Interpretation.

Threshold Calibration.

Sample Efficiency: The Low-Data Advantage

Practical Implication.

Hybrid Synergy: LLM Priors as Regularization

Mechanism of Improvement.

Sheaf Axiom Verification

Key Findings.

Significance of Locality Failure.

Ablation Studies

Key Findings.

Interpretation.

Hidden Confounder Experiments

Rashomon Stress Test

Conclusion

Limitations.

Future Work.

Speculative Connections.

Appendix

Hyperparameters and Configuration

Infrastructure Details

Cluster.

Runtime Statistics.

LLM Gateway.

Sheaf Axiom Definitions

Proof of Absorption Matrix Formula

Additional Experimental Results

Full Sheaf Axiom Error Statistics

Convergence Plots

Query Distribution Analysis

Identification Frontier Analysis

Analysis.

Mathematical Implementation Verification

Core Formula Verification

Numerical Stability Verification

Cross-Term Necessity Verification

Dual Implementation Consistency

Threshold Sensitivity Analysis

Key Observations.

Weight Compression Analysis.

Hybrid Method Limitations

Prior Quality Dependency

Analysis.