Protein Circuit Tracing via Cross-layer Transcoders

Reading time: 5 minute
...

📝 Original Info

  • Title: Protein Circuit Tracing via Cross-layer Transcoders
  • ArXiv ID: 2602.12026
  • Date: 2026-02-12
  • Authors: ** 논문에 명시된 저자 정보가 제공되지 않았습니다. (보통 “제목·초록·본문”에 저자 명시가 없으므로, 원문 PDF 혹은 preprint 페이지에서 확인 필요) **

📝 Abstract

Protein language models (pLMs) have emerged as powerful predictors of protein structure and function. However, the computational circuits underlying their predictions remain poorly understood. Recent mechanistic interpretability methods decompose pLM representations into interpretable features, but they treat each layer independently and thus fail to capture cross-layer computation, limiting their ability to approximate the full model. We introduce ProtoMech, a framework for discovering computational circuits in pLMs using cross-layer transcoders that learn sparse latent representations jointly across layers to capture the model's full computational circuitry. Applied to the pLM ESM2, ProtoMech recovers 82-89% of the original performance on protein family classification and function prediction tasks. ProtoMech then identifies compressed circuits that use <1% of the latent space while retaining up to 79% of model accuracy, revealing correspondence with structural and functional motifs, including binding, signaling, and stability. Steering along these circuits enables high-fitness protein design, surpassing baseline methods in more than 70% of cases. These results establish ProtoMech as a principled framework for protein circuit tracing.

💡 Deep Analysis

📄 Full Content

Protein language models (pLMs) have driven rapid progress in the biosciences by learning rich statistical representations from large protein sequence databases. They now achieve strong performance across a broad range of downstream tasks, including 3D structure and function prediction (Lin et al., 2023;Hayes et al., 2025;Bhatnagar et al., 2025). These results suggest that pLMs may capture latent structural and functional motifs governing protein se-Preprint. February 13, 2026. quences (Rives et al., 2021;Tsui et al., 2025a;Tsui & Aghazadeh, 2024). However, the internal computational pathways, or circuits, responsible for these predictions remain opaque, poorly understood, and difficult to extract.

Recent progress in mechanistic interpretability, particularly through sparse autoencoders (SAEs) (Templeton et al., 2024;Gao et al., 2025), has enabled the decomposition of pLM hidden states into interpretable features (Adams et al., 2025;Simon & Zou, 2025;Walton et al., 2025a;Gujral et al., 2025;Nainani et al., 2025;Parsan et al., 2025). Steering these features has been shown to generate protein sequences with specific functional attributes (Tsui et al., 2025b;Corominas et al., 2025;Garcia & Ansuini, 2025). However, SAEs provide only a representational factorization and do not capture the layer-to-layer transformations that capture the model’s computation. Consequently, they cannot recover the full circuitry responsible for pLM predictions. Recovering such circuits requires a replacement model that faithfully emulates the internal computation of the original network.

In the natural language processing literature, recent efforts have aimed to construct replacement models using transcoders (Dunefsky et al., 2024). In contrast to SAEs, which factorize representations, transcoders approximate the functional mapping of individual transformer MLP layers by passing activations through a sparse latent bottleneck. Composing these approximations across layers yields perlayer transcoders (PLTs), which seek to reconstruct model computation from locally sparse surrogates. Yet this construction remains inherently local: approximating each layer in isolation neglects the accumulation of context and computation across depth, leading to degraded representations and unreliable circuit recovery (Ameisen et al., 2025).

In this work, we bridge the gap between interpretable feature recovery and mechanistic circuit discovery in pLMs. We introduce ProtoMech, a framework for uncovering computational circuits in pLMs using cross-layer transcoders (CLTs) (Ameisen et al., 2025). Similar to standard transcoders, CLTs learn an input-output mapping for each transformer MLP layer. However, rather than relying on isolated layerwise approximations, CLTs compute each layer’s output as a function of the sparse latent variables from all preceding layers. By explicitly modeling these cross-layer dependencies, ProtoMech constructs a replacement model that more faithfully reproduces the internal computation of the pLM (Fig. 1), enabling direct identification of the circuits that govern its predictions.

Our contributions are as follows:

• We develop ProtoMech, a framework for discovering and analyzing computational circuits within and across transformer layers in pLMs. Applied to ESM2 (Lin et al., 2023), ProtoMech achieves state-of-the-art recovery of the original model’s performance, attaining 89% and 82% on protein family classification and function prediction tasks, respectively.

computational pathways, (iii) a steering mechanism for manipulating these circuits, and (iv) visualization tools for interpreting the circuits.

CLTs extend standard transcoders by approximating the input-output mapping of each MLP layer as a function of the sparse latent variables derived from all preceding layers (Fig. 2a). This formulation replaces independent layerwise surrogates with a compositional model of interlayer computation.

Let L be the number of transformer layers and d model the hidden dimension of the pLM. For ℓ ∈ 1, . . . , L, we denote by x ℓ ∈ R dmodel the residual stream activation prior to the MLP block at layer ℓ. Sparsity in the latent space is enforced using a TopK activation function (Makhzani & Frey, 2013).

Each residual stream activation is encoded into a latent vector a ℓ ∈ R dlatent via an encoder matrix:

where W ℓ enc ∈ R dlatent×dmodel denotes the encoder matrix and b ℓ pre ∈ R dmodel is the corresponding bias term at layer ℓ. To regulate the number of active latent features, the TopK operator retains only the k largest-magnitude latent activations and sets all others to zero. To reconstruct the output of the MLP block at layer ℓ, denoted y ℓ ∈ R dmodel , CLTs employ decoder matrices that map latent representations from preceding layers to layer ℓ according to:

where ŷℓ ∈ R dmodel denotes the reconstructed MLP output at layer ℓ, and W ℓ ′ →ℓ dec is a decoder matrix mapping latent features from layer ℓ ′ to layer ℓ. By requiring each r

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut