Induction Heads: The Circuit Behind In-Context Learning

Give a language model a few examples of a pattern — say, foo → FOO, bar → BAR, baz → — and it completes the sequence correctly without retraining. No weights change. Somehow the model reads the pattern and applies it. This is in-context learning: the ability to adapt to a new task using only examples in the prompt.

In-context learning emerged in GPT-3 and has gotten sharper with every generation since. But the mechanism behind it was a black box. Why should predicting the next token — the thing transformers are trained to do — produce the ability to recognize and execute new tasks on the fly?

In 2022, a team at Anthropic published “In-context Learning and Induction Heads” (Olsson et al.) and provided the first mechanistic answer. They found that a specific two-head attention circuit — the induction head — is responsible for a substantial portion of in-context learning across transformers of all sizes, from two-layer toy models to 13-billion parameter language models.

The Operation: Copy-and-Complete

An induction head implements one basic operation: given a sequence that contains the pattern [A][B] somewhere earlier, and you’re now at a second occurrence of [A], predict [B].

Sequence: ... [A] [B] ... ... [A]  ?
                                ↑
                    Induction head predicts [B]

This is remarkably general. If A is “Marie” and B is “Curie”, the head completes a name. If A is a few-shot prompt example and B is its label, the head completes the task. The operation is the same in all cases: find the pattern, continue it.

The Two-Head Circuit

The induction circuit consists of two attention heads working in composition, most clearly visible in two-layer attention-only transformers:

Layer 1 — Previous token head: At each position, this head attends to the token immediately before it. Its job is simple: copy information about the preceding token into the current position’s representation in the residual stream.

Layer 2 — Induction head: This head uses the output of the previous token head through key-query composition. The keys in layer 2 are computed from the residual stream, which now contains both the original token embedding and the previous token head’s contribution. This means the key at position j encodes information about both token j and token j-1.

The result: when the model is positioned at the second occurrence of [A], the induction head’s query matches most strongly against positions where j-1 == A — i.e., positions immediately following the first occurrence of [A]. It attends to position j (where [B] is) and reads the value there.

Layer 1: Previous Token Head
─────────────────────────────────────────────────────────

  Pos:  ...  [j-1] [j] ...  [t-1] [t]
              (A)  (B)        (A)
                              ↑     ↑
               Attends to ────┘     └─── Attends to t-1
               (writes A's info          (writes A's info
                into key at j)            into query at t)

Layer 2: Induction Head  (K-Q composition)
─────────────────────────────────────────────────────────

  Key at j:   encodes [j] and [j-1] = B + A
  Query at t: encodes "what followed A?"

  Strongest match: K_j, because j-1 == A

        ...  [A] [B] ...  [A]  ?
              ↑   ↑        ↑
              j-1  j       t   → attends to j → predicts B

The elegance here is that the circuit works compositionally: head 1 writes a signal into the residual stream, and head 2 reads it through its key computation. Neither head alone could implement copy-and-complete; both are necessary.

The Phase Change

The researchers didn’t identify this circuit by reading attention weights. They found it by observing something unusual during training.

When you train a two-layer transformer, loss doesn’t decrease smoothly. There’s a sudden drop — a phase change — after which in-context learning measurably improves:

Loss
     │
High │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓\
     │                  \
     │                   \_____
Low  │                         ▔▔▔▔▔▔▔▔▔▔▔▔▔▔
     │
     └─────────────────────────────────────────▶ Training steps
                         ↑
                   Phase change:
                   induction heads form

Before the transition: no induction heads, weak in-context learning. After: the circuit is present and in-context learning ability jumps. This transition wasn’t gradual. It happened sharply — a single computational structure clicking into place and unlocking a capability class.

The transition occurs because induction heads require composition between two layers. The previous token head needs to exist and write a useful signal before the induction head can make use of it. Discovering this joint structure takes longer than any individual head’s local optimization. Once it forms, the circuit is stable.

Critically, this phase change was observed across model sizes and architectures — not just in tiny toy models. The same discontinuous transition in in-context learning ability appeared in models ranging from two layers to GPT-Neo (2.7B). The circuit is not an artifact of scale; it’s a convergent solution that transformers reliably discover.

In-Context Learning as Meta-Learning

The key insight of the paper is that few-shot prompts are just sequences full of [A][B] patterns.

Take a sentiment classification prompt:

Review: "The food was excellent."   → Positive
Review: "Service was terrible."     → Negative
Review: "Decent, nothing special."  → ?

From the model’s perspective, this is: [Review₁][Positive][Review₂][Negative][Review₃][?]. The induction head sees Review₃ and asks: “what followed similar inputs earlier?” It finds that review-like inputs were followed by sentiment labels and predicts one.

The model hasn’t “learned” sentiment analysis in the prompt. It’s running a general pattern-completion algorithm that happens to implement task inference. The paper calls this meta-learning: during pre-training on a vast corpus, the model learns a task-recognition algorithm (the induction circuit), and that algorithm generalizes at inference time to tasks it never explicitly trained on.

This reframes in-context learning from something almost magical into something mechanistic. The model isn’t understanding your examples in a deep semantic sense. It’s running a learned algorithm that says: find patterns in the context, continue them.

Dai et al. (2023) pushed this further, showing mathematically that the attention-based mechanism for in-context learning is functionally similar to gradient descent. The model is, in a sense, running an implicit optimizer inside the forward pass — updating an internal task representation based on the examples you provided, without touching the weights.

Evidence That the Circuit Is Causal

Correlation between induction head formation and in-context learning ability is suggestive but not proof. To establish causation, the researchers used ablation: setting specific attention heads’ outputs to zero and measuring the effect.

They measured in-context learning through an “in-context learning score”: for each token in a sequence, how much does the loss on that token decrease when that same token appeared earlier in the context? Higher score = the model is using prior occurrences to improve predictions.

Ablating the identified induction heads caused a large drop in this score. Ablating other heads of similar size did not. Ablating the specific previous-token-head that feeds into the induction head also caused degradation — consistent with the circuit requiring both components.

This is as close to causal identification as you can get without rewriting the model architecture.

What This Reveals About Transformer Intelligence

Induction heads are interesting beyond their specific function. They are a case where we have a mechanistic explanation for an emergent capability — and that combination is rare.

The residual stream as a communication bus

The Transformer Circuits framework (Elhage et al., 2021) — the theoretical foundation for this line of work — reframes how to think about transformer computation. Instead of “each layer transforms the representation,” think of it this way: there’s a shared residual stream running from input to output, and attention heads are computations that read from and write to that stream.

Embeddings
    │
    ▼
[Head 1 reads/writes] ──▶ residual stream ──▶ [Head 2 reads/writes] ──▶ ... ──▶ Logits

Each head is small and focused. The residual (skip) connection isn’t just a training trick — it’s the architecture’s communication channel. Induction heads demonstrate this perfectly: head 1 writes a signal into the stream at layer 1, head 2 reads it at layer 2 through key computation. The circuit is a communication protocol.

Circuits, not neurons

Earlier interpretability work tried to understand neural networks by analyzing individual neurons. This mostly led to dead ends — neurons tend to be polysemantic, activating for multiple unrelated concepts due to a phenomenon called superposition (models pack more features than they have dimensions by using nearly-orthogonal directions). Analyzing neurons individually obscures the actual computational structure.

The Transformer Circuits approach looks for circuits instead: subgraphs of the network that implement specific functions through composition. Induction heads are the cleanest example. Wang et al. (2022) used the same methodology to reverse-engineer how models perform indirect object identification — completing “John and Mary went to the store. John gave a present to ___” — and found a multi-head circuit spanning several layers that implements subject-verb-object tracking.

The shift from neurons to circuits is like switching from individual logic gates to understanding entire algorithms. It’s slower, more painstaking work, but it actually answers the question “what is this model doing?”

The Current Research Frontier

The induction head paper opened a productive research program. A few threads:

Sparse autoencoders (SAEs) have become the main tool for scaling mechanistic interpretability. The polysemanticity problem means individual neurons aren’t interpretable. SAEs decompose activations into sparse combinations of interpretable features — each feature activates rarely but, when it does, corresponds to a recognizable concept. Anthropic’s “Scaling Monosemanticity” (Templeton et al., 2024) applied this to Claude Sonnet and found over 34 million interpretable features, including representations as specific as “the Golden Gate Bridge” or behavioral features associated with sycophancy.

Universal circuits. The induction head circuit appears across GPT-2, GPT-Neo, and models trained from scratch with different seeds. The same functional circuits appear to be convergently discovered — optimal solutions to common computational subproblems that any sufficiently trained transformer will find. This suggests there’s a grammar of transformer computation waiting to be catalogued.

Causal tracing (Meng et al., 2022) let researchers surgically identify where factual knowledge lives in model weights. The finding: factual associations are stored and retrieved in specific MLP layers in the middle-to-late network, through a recognizable computation pattern. “The Eiffel Tower is in [Paris]” is not distributed across all weights — it’s in specific places that can be located and edited.

Reasoning circuits are the current hard problem. Multi-step logical inference, mathematical reasoning, chain-of-thought — these involve more heads, cross more layers, and don’t fit into the clean two-head template of induction heads. Progress is happening but slowly.

What the Future Might Look Like

The bigger the model, the harder the analysis. Most circuit-level explanations cover carefully chosen tasks in small models. Scaling mechanistic interpretability to frontier models with hundreds of billions of parameters is an open engineering problem.

But the trajectory suggests several concrete possibilities:

Interpretability-assisted alignment. If specific circuits are responsible for deceptive behavior, refusal, or sycophancy, they can potentially be monitored or edited directly — not by adjusting prompts, but by intervening on activations. The SAE work already identified features associated with specific behavioral tendencies. Features and circuits become levers.

Diagnosing failures. When a model fails on a task, mechanistic analysis can sometimes locate the misfiring circuit. This could make failure diagnosis a principled engineering activity rather than empirical guesswork — identify what the model is doing wrong structurally, not just what output it produces.

Architecture from first principles. Understanding why induction heads form and what makes them effective could inform architecture decisions. If we want strong in-context learning, we could design circuits that implement it more efficiently. If we understand superposition better, we could build models that are natively more interpretable.

The skeptical view deserves mention: most interesting behaviors in large models may not decompose into human-understandable circuits. Superposition means the decomposition is never clean. The complexity may simply be too high. Full mechanistic understanding of a frontier model may be intractable even in principle.

Still: finding induction heads changed the conversation from “emergence is mysterious” to “emergence has structure.” One circuit doesn’t explain everything about intelligence in transformers. But it proved there’s something to find — that underneath the black box is a mechanism, and mechanisms can be understood.

The paper: Olsson et al. (2022), “In-context Learning and Induction Heads”. The foundational framework: Elhage et al. (2021), “A Mathematical Framework for Transformer Circuits”. Both are on Anthropic’s Transformer Circuits site.

Comments

Came here from LinkedIn or X? Join the conversation below — all discussion lives here.