Artificial Intelligence 9 min read

Claude Team Unveils "Circuit Tracing" to Reveal Large Language Model Reasoning

The Claude research team introduced a novel "circuit tracing" technique that builds substitute models and attribution graphs to expose the internal reasoning steps of large language models, uncovering capabilities such as multilingual understanding, long‑term planning, multi‑step inference, and hidden mathematical computation strategies.

DevOps

Mar 31, 2025

Claude Team Unveils "Circuit Tracing" to Reveal Large Language Model Reasoning

The Claude team has released a new interpretability tool called circuit tracing , which constructs a surrogate model that mimics the original transformer’s multi‑layer perceptron (MLP) using a Cross‑Layer Encoder (CLT). This surrogate enables the generation of attribution graphs that visualize the model’s computation flow for a given prompt.

Using this method, researchers observed several notable behaviours in Claude 3.5 Haiku:

Shared conceptual space across languages, suggesting a universal "thought language".

Pre‑planning of output, such as anticipating rhyming words in poetry, indicating long‑range planning abilities.

Generation of plausible but fabricated reasoning chains to align with user expectations.

Accurate mental arithmetic without explicit mathematical algorithms.

To build the surrogate, the team trained CLT features to reconstruct the original MLP outputs while minimizing reconstruction error and sparsity penalties. These features replace the original MLP neurons, and for specific prompts a local surrogate model is created that also fixes attention patterns and normalisation terms, ensuring identical activations and outputs.

Attribution graphs are then constructed by tracing linear influence edges between input, intermediate, output, and error nodes, using reverse‑mode Jacobians. A pruning algorithm removes low‑impact nodes, yielding a concise, interpretable graph. Interactive visualisations allow manual annotation of feature semantics and grouping into super‑nodes.

Experiments include feature‑perturbation studies that confirm the causal impact of identified nodes, and global‑weight analyses that address spurious connections by limiting feature scope or incorporating co‑activation statistics (e.g., TWERA). The authors evaluate CLT feature interpretability and graph fidelity, noting that while the approach reveals many internal mechanisms, it struggles with highly complex semantic relations and fine‑grained behavioural changes.

Overall, circuit tracing provides a powerful lens into large model cognition, offering insights into multilingual reasoning, planning, multi‑step inference, and parallel mathematical computation, while also highlighting current limitations and avenues for future research.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Artificial Intelligence model interpretability Claude Attribution Graphs Circuit Tracing

Written by

DevOps

Share premium content and events on trends, applications, and practices in development efficiency, AI and related technologies. The IDCF International DevOps Coach Federation trains end‑to‑end development‑efficiency talent, linking high‑performance organizations and individuals to achieve excellence.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.