Speculative Decoding: Getting K Tokens for the Price of One
Every token you’ve ever received from an LLM was generated one at a time. No matter how capable the model, no matter how fast the hardware: one forward pass, one token, repeat. This constraint is so fundamental that it has a name — autoregressive decoding — and it is the dominant factor in LLM inference latency.
Speculative decoding is a technique that breaks this constraint. Not by changing how the model works, but by exploiting a structural asymmetry in transformer computation: generating a token is sequential and slow, but verifying a sequence of proposed tokens can be done in a single parallel pass. The trick is to use a cheap draft model to make the proposals, then let the expensive target model verify all of them at once. If the proposals are good, you get several tokens for roughly the cost of one target model forward pass. If they’re bad, you fall back to normal decoding. Either way, the output distribution is provably unchanged.
This is not an approximation. It’s a lossless speedup.
Why Autoregressive Decoding Is Slow
To understand why speculative decoding helps, you need to understand what makes normal decoding slow.
The bottleneck isn’t computation — it’s memory bandwidth. At each decoding step, the GPU must load every weight in the model from HBM (High Bandwidth Memory) to compute the next token. For a 70-billion-parameter model stored in float16, that’s roughly 140 GB of data that must travel from HBM to the compute units for each token. A high-end H100 has about 3.35 TB/s of HBM bandwidth, which puts a floor of ~42 milliseconds per token purely on the data movement — before any actual computation.
The problem is that a transformer’s computation per token is relatively small. Feeding one token through a 70B model requires roughly 140 billion floating-point operations. At the H100’s peak of ~2,000 TFLOPS, that’s about 0.07 milliseconds of pure compute. The ratio between compute and memory movement is wildly lopsided: the GPU is mostly waiting for data to arrive, not crunching numbers.
This is called arithmetic intensity — the ratio of FLOPs to bytes transferred. Autoregressive decoding has low arithmetic intensity. The GPU hardware was designed for workloads (like training, or the attention pass over a long prompt) where the intensity is much higher. During token generation, a large fraction of the GPU’s compute capacity sits idle.
Speculative decoding addresses this directly: if the bottleneck is memory bandwidth per token, generate more tokens per weight-loading cycle.
The Asymmetry That Makes It Possible
Here is the structural fact that speculative decoding exploits: a transformer forward pass is parallelizable across positions.
During training, this is obvious — you feed in the full training sequence and the model processes all positions simultaneously, which is what makes training efficient. During generation, you only have one new token per step, so there’s nothing to parallelize. But what if you had a candidate sequence of K tokens that you wanted to evaluate? You could pass all K positions through the model in a single forward pass and get the model’s probability distribution at every position simultaneously.
This is the verification step. And the cost of verifying a K-token sequence is only modestly more expensive than verifying a single token — you’re loading the same weights regardless. The arithmetic intensity improves roughly linearly with K.
Without speculation:
target pass 1 → token 1
target pass 2 → token 2
target pass 3 → token 3
target pass 4 → token 4
4 passes, 4 tokens
With speculation (K=4, all accepted):
draft passes 1-4 → draft tokens 1,2,3,4 (cheap)
target pass 1 → verifies all 4 + bonus (one expensive pass)
1 target pass, ≥4 tokens
The draft model is a smaller version of the same architecture — smaller enough that its four sequential passes cost significantly less than one target model pass. In practice, the draft model is typically 10-100× smaller in parameter count.
The Algorithm
Here’s the procedure in full, as originally described by Leviathan, Kalman, and Weiss (2023):
Step 1: Draft. Using the draft model, autoregressively generate $K$ candidate tokens $\tilde{x}_1, \tilde{x}_2, \ldots, \tilde{x}_K$. Record the draft model’s probability at each step: $q(\tilde{x}_t \mid \text{context})$ for $t = 1 \ldots K$.
Step 2: Verify. Pass all $K$ candidate tokens through the target model in a single forward pass. This produces the target model’s probability distributions at positions $1$ through $K+1$: $p(x \mid \text{context})$, $p(x \mid \text{context}, \tilde{x}_1)$, …, $p(x \mid \text{context}, \tilde{x}_1, \ldots, \tilde{x}_K)$.
Step 3: Accept or reject, left to right. For each position $t$ from $1$ to $K$, independently decide whether to accept $\tilde{x}_t$:
- Draw $u \sim \text{Uniform}[0, 1]$
- If $u \leq \dfrac{p(\tilde{x}_t)}{q(\tilde{x}_t)}$, accept $\tilde{x}_t$
- Otherwise, reject $\tilde{x}_t$ and stop processing further positions
Step 4: Sample the fallback. If $\tilde{x}_t$ was rejected, sample a corrected token from the distribution:
$$p'(x) = \text{normalize}\!\left(\max\!\left(0,\; p(x \mid \ldots) - q(x \mid \ldots)\right)\right)$$and stop. If all $K$ tokens were accepted, sample one additional token from $p(x \mid \text{context}, \tilde{x}_1, \ldots, \tilde{x}_K)$ — the distribution the target model computed at position $K+1$ for free.
Step 5: Repeat. The accepted tokens (plus the fallback or bonus token) extend the sequence. Return to Step 1.
Context: "The capital of France is"
Draft model generates (K=4):
"Paris" (q=0.82), "," (q=0.91), " which" (q=0.44), " is" (q=0.71)
Target model verifies in one pass:
p("Paris") = 0.89 → accept (0.89/0.82 > 0.95, lucky draw)
p(",") = 0.87 → accept (0.87/0.91 ≈ 0.96, lucky draw)
p(" which") = 0.11 → reject (0.11/0.44 = 0.25, u=0.41 > 0.25)
Sample fallback from normalize(max(0, p(x) - q(x)))
Tokens produced from one target pass: "Paris", "," + one fallback token
Why the Output Distribution Is Exact
The key question is: does this actually produce the same distribution as running the target model alone?
The answer is yes, and the proof follows from properties of rejection sampling. Consider a single position where the draft model proposes token $x$ with probability $q(x)$ and the target model assigns probability $p(x)$.
The probability that token $x$ appears in the output is:
$$P(\text{output} = x) = q(x) \cdot \min\!\left(1, \frac{p(x)}{q(x)}\right) + P(\text{reject}) \cdot \frac{\max(0, p(x) - q(x))}{Z}$$where the first term is the probability the draft model produces $x$ and it gets accepted, and the second term is the probability the draft token is rejected and we resample from the fallback distribution.
The rejection probability is:
$$P(\text{reject}) = \sum_{x'} q(x') \cdot \max\!\left(0, 1 - \frac{p(x')}{q(x')}\right) = \sum_{x'} \max(0, q(x') - p(x'))$$And $Z = \sum_x \max(0, p(x) - q(x))$ is the normalization constant for the fallback distribution, which equals $P(\text{reject})$ because the total variation distance is symmetric.
After substituting and simplifying:
$$P(\text{output} = x) = \min(p(x), q(x)) + \frac{P(\text{reject})}{Z} \cdot \max(0, p(x) - q(x)) = p(x)$$The output distribution at every position is exactly $p$, regardless of what $q$ is. The draft model can be terrible — the output is still distributed as if you ran the target model alone. Bad draft models just slow things down (more rejections, fewer accepted tokens per pass); they don’t change correctness.
The Speedup
How much faster is speculative decoding? It depends on the acceptance rate — how often the draft model’s proposals are accepted.
Let $\alpha$ be the expected acceptance probability per token (a simplification, since it varies by position and context). The expected number of tokens produced per target model forward pass is:
$$\mathbb{E}[\text{tokens per pass}] = \frac{1 - \alpha^{K+1}}{1 - \alpha}$$This follows from the geometric distribution: the expected position of the first rejection among $K$ candidates, plus 1 (for the bonus or fallback token). Some values:
| $\alpha$ | $K = 4$ | $K = 8$ |
|---|---|---|
| 0.5 | 1.97 | 1.99 |
| 0.7 | 2.57 | 2.83 |
| 0.9 | 3.44 | 4.69 |
| 0.95 | 3.71 | 5.44 |
High acceptance rate + large $K$ is where the speedup is significant. But there are diminishing returns: once $\alpha$ is high, doubling $K$ doesn’t double throughput because you’re already capturing most of the available wins.
The actual wall-clock speedup also depends on the cost ratio between draft and target passes. If the draft model takes $c$ seconds per token and the target takes $T$ seconds per token, the speedup factor is roughly:
$$\text{speedup} \approx \frac{\mathbb{E}[\text{tokens per pass}]}{1 + K \cdot (c / T)}$$For speculative decoding to help, you need $K \cdot c \ll T$. If the draft model is 1/10 the cost of the target, and $K = 4$, the denominator is $1 + 0.4 = 1.4$, and with $\alpha = 0.8$ the numerator is about $2.8$, giving a $2\times$ speedup. Real-world numbers are in this range for favorable workloads.
When Speculation Helps (and When It Doesn’t)
Speculative decoding is not universally beneficial. The gain depends on what you’re generating.
High acceptance rate tasks are the sweet spot: code completion with a domain-specific draft model, factual question answering where the answer is predictable, structured output generation, and repetitive or formulaic text. In these cases, the small model’s distribution closely tracks the large model’s, and most proposals are accepted.
Low acceptance rate tasks are where speculation backfires: creative writing with high temperature, diverse open-ended generation, tasks where the large model’s behavior diverges significantly from any small model. Here, the draft tokens are mostly rejected, and you’re paying the cost of the draft model for almost no benefit.
Batch size matters too. Speculative decoding was designed for single-stream, latency-sensitive inference — one conversation, one user, minimizing time to each token. In high-throughput serving scenarios (large batches of requests processed together), the target model’s forward pass is already compute-bound rather than memory-bandwidth-bound. The arithmetic intensity is higher, and the bottleneck shifts. Batching makes better use of the GPU’s compute capacity, reducing the opportunity for speculation to help.
There’s also no improvement to time to first token. The first token still requires a full target model pass. Speculative decoding improves the throughput of subsequent tokens, not the initial latency.
Variants: When You Don’t Have a Draft Model
The original algorithm assumes you have a separate draft model — ideally one trained on the same data as the target, just smaller. Maintaining two models adds operational complexity. Several variants remove this requirement.
Self-Speculative Decoding
Some architectures support “early exit” — producing an approximate prediction from an intermediate layer rather than running all layers. If the intermediate layers are good enough to draft, the full model can serve as both draft and target, saving the draft tokens from the early exit and verifying with the full pass. The cost is that the draft quality is limited by the early layers, but there’s no second model to manage.
Medusa
Medusa trains multiple additional “heads” on top of the target model, where each head predicts a future token. Head 1 predicts position $t+1$, head 2 predicts $t+2$, and so on. These heads are much smaller than the main model and are trained to mimic the model’s output at each future position.
┌── head 1 → predicts t+1
Token t → [main model] ───┤── head 2 → predicts t+2
└── head 3 → predicts t+3
Accept/reject combinations with target logits from the same pass.
Because the heads run alongside the main model in the same forward pass (adding minimal overhead), Medusa avoids the separate draft model entirely. The tradeoff is that the heads are less accurate than a dedicated small model — they see less context and have less capacity — so acceptance rates are lower.
EAGLE
EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) trains a lightweight draft model that operates on the target model’s internal feature representations, not just its token outputs. Rather than predicting the next token from the output distribution, the EAGLE draft model predicts the next feature vector, then uses the target model’s existing heads to convert that to a token distribution. This feature-level alignment gives much higher acceptance rates than a standalone small model of comparable size, because the draft model is working in the target model’s representation space.
The Tradeoff Space
Summarizing the design choices:
Separate draft model vs. Same-model speculation
─────────────────────────────────────────────────────────────
Higher acceptance rate No second model to host
More operational complexity Lower acceptance rate
Can specialize draft per domain Architecture-dependent
Small K (few draft tokens) vs. Large K (many draft tokens)
─────────────────────────────────────────────────────────────
Lower overhead when rejected Higher potential speedup
Lower potential speedup More wasted compute when rejected
Good for uncertain tasks Good for predictable tasks
In practice, K between 4 and 8 with a dedicated draft model 5-20× smaller than the target is the most common configuration. For production deployments, the draft model is often fine-tuned on the same domain as the expected workload to maximize acceptance rates.
Where It Runs Today
Speculative decoding is now widely deployed. Anthropic uses it for Claude. Google uses it in Gemini’s serving infrastructure. The open-source ecosystem has broad support: llama.cpp added speculative decoding in 2023, vLLM and SGLang both support it with configurable draft models, and the HuggingFace generate() API supports it via the assistant_model parameter.
The technique was introduced simultaneously in two 2023 papers: Leviathan et al. (“Fast Inference from Transformers via Speculative Decoding,” Google) and Chen et al. (“Accelerating Large Language Model Decoding with Speculative Sampling,” DeepMind). Both papers proved the same losslessness result independently.
What It Tells You About LLM Inference
Speculative decoding works because of a fundamental asymmetry in transformer computation: the model can check faster than it can create. Verifying a proposed completion — deciding whether each token is plausible — requires less sequential work than generating it from scratch. The draft model is just a cheap way to populate the candidate sequence with tokens that have a reasonable chance of passing verification.
This asymmetry shows up elsewhere too. It explains why chain-of-thought verification is easier than chain-of-thought generation. It explains why it’s easier to review a proposed code edit than to write the code from scratch. Speculation just industrializes it: given that checking is cheaper than generating, offload the generation to a cheap model and use the expensive model exclusively for checking.
The output distribution is preserved because the mathematics of rejection sampling guarantees it. The draft model’s probability assignments influence only the efficiency — how many proposals get accepted — not the final token distribution. Whether the draft model is brilliant or terrible, every token you receive is exactly as likely as it would have been from the target model alone. The speedup is free in the information-theoretic sense: you get more tokens per unit time without any change to what those tokens are.
Comments
Came here from LinkedIn or X? Join the conversation below — all discussion lives here.