CodeCircuit: Toward Inferring LLM-Generated Code Correctness via Attribution Graphs

CodeCircuit: Toward Inferring LLM-Generated Code Correctness via Attribution Graphs
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Current paradigms for code verification rely heavily on external mechanisms-such as execution-based unit tests or auxiliary LLM judges-which are often labor-intensive or limited by the judging model’s own capabilities. This raises a fundamental, yet unexplored question: Can an LLM’s functional correctness be assessed purely from its internal computational structure? Our primary objective is to investigate whether the model’s neural dynamics encode internally decodable signals that are predictive of logical validity during code generation. Inspired by mechanistic interpretability, we propose to treat code verification as a mechanistic diagnostic task, mapping the model’s explicit algorithmic trajectory into line-level attribution graphs. By decomposing complex residual flows, we aim to identify the structural signatures that distinguish sound reasoning from logical failure within the model’s internal circuits. Analysis across Python, C++, and Java confirms that intrinsic correctness signals are robust across diverse syntaxes. Topological features from these internal graphs predict correctness more reliably than surface heuristics and enable targeted causal interventions to fix erroneous logic. These findings establish internal introspection as a decodable property for verifying generated code. Our code is at https:// github.com/bruno686/CodeCircuit.


💡 Research Summary

The paper introduces CodeCircuit, a novel white‑box framework that assesses the functional correctness of code generated by large language models (LLMs) by inspecting the models’ internal computational dynamics rather than relying on external execution or auxiliary judging models. The authors argue that, analogous to recent advances in mechanistic interpretability, the latent algorithmic trajectory of an LLM during code generation can be captured as a line‑level attribution graph (AG), where nodes represent interpretable components (token embeddings, sparse features from per‑layer transcoders, error residuals, and output logits) and directed edges encode linear contributions derived from Jacobians of the residual stream.

Technical pipeline

  1. Per‑Layer Transcoders (PLTs) replace each MLP layer with a sparse autoencoder that encodes the residual stream into a low‑dimensional feature vector f(l) and decodes it back to reconstruct the original MLP output. This yields a locally linearized computation that isolates interpretable features while preserving the exact forward pass.
  2. Graph construction: For a given generation step (i.e., a line of code), the authors build a DAG where an edge weight w₍ᵢⱼ₎ quantifies the linear influence of source node i on target node j using the Jacobian of the frozen residual transformation. Error nodes V_err capture the discrepancy between true MLP outputs and PLT reconstructions, providing a proxy for unexplained computation.
  3. Pruning: Nodes and edges contributing negligibly to the final logit are removed, yielding a sparse, mechanistically faithful subgraph.
  4. Feature extraction (Φ): From each AG the authors compute a rich set of statistics (>30), including global density, number of connected components, clustering coefficient, error‑to‑feature influence ratio η, graph density ρ, betweenness and degree centrality moments, and total logit attribution scores. These capture both the overall structural health of the latent algorithm and localized bottlenecks.
  5. Prediction model (h_φ): The extracted feature vector xᵢ = Φ(Gᵢ) is fed into a lightweight classifier (logistic regression or shallow MLP) that outputs a probability that the i‑th line is correct. The model is trained with cross‑entropy loss on a dataset of human‑annotated correct/incorrect lines.

Experimental evaluation
The authors evaluate CodeCircuit on three popular LLM families (CodeX‑GLM, LLaMA‑2‑7B, StarCoder) across three programming languages (Python, C++, Java). For each language they collect 5–10 k code snippets, annotate each line as correct or erroneous, and compare against several baselines: temperature‑scaled probability heuristics, Chain‑of‑Embedding gray‑box methods, and LLM‑as‑judge approaches.

Key findings:

  • CodeCircuit consistently outperforms baselines, achieving 7–12 % higher line‑level accuracy and 0.05–0.08 improvements in ROC‑AUC across all settings.
  • The error‑to‑feature ratio η is a strong predictor of failure; high η correlates with fragmented reasoning and higher error rates.
  • Centrality metrics (mean and variance of betweenness) differentiate correct from incorrect steps, indicating that successful code generation relies on well‑coordinated “hub” nodes in the internal circuit.
  • Causal intervention: By attenuating the activation of specific error nodes or reinforcing high‑centrality features, the authors demonstrate that the generated code can be repaired without re‑prompting the model, confirming that the AG captures causal mechanisms rather than mere correlation.

Limitations

  • Constructing AGs requires PLT training and Jacobian computation, which are memory‑ and compute‑intensive; current experiments are limited to sequences under ~1 k tokens on high‑end GPUs.
  • The approach depends on line‑level ground‑truth labels, which may be costly to obtain at scale.
  • Generalization to unseen languages, libraries, or larger models remains an open question.

Future directions suggested by the authors include:

  1. Developing lightweight transcoders and graph compression techniques to enable real‑time verification pipelines.
  2. Extending the method to multi‑line or function‑level graphs for global program correctness detection.
  3. Integrating reinforcement learning or meta‑learning to automate corrective interventions, moving toward self‑repairing code‑generating LLMs.

In summary, CodeCircuit demonstrates that internal attribution graphs provide a decodable, causally relevant signal of code correctness across multiple programming languages. By shifting verification from external execution to intrinsic model dynamics, the work opens a new research avenue for trustworthy, introspective code generation.


Comments & Academic Discussion

Loading comments...

Leave a Comment