microGPT from First Principles: 200 Lines That Explain LLMs

Andrej Karpathy recently published microGPT – a complete GPT implementation in 200 lines of pure Python with zero dependencies. No PyTorch, no TensorFlow, no NumPy. Just math, random, and the raw algorithm. He also wrote an excellent companion blog post explaining the motivation and design.

What makes this implementation remarkable is the claim in its opening docstring: “This file is the complete algorithm. Everything else is just efficiency.” That’s a strong claim. And it’s true. The same mathematical operations running in these 200 lines are what run inside ChatGPT, Claude, Gemini, and every other transformer-based language model. The difference is scale and speed – not algorithm.

This post walks through microGPT’s key lines from first principles. I’ve added ASCII diagrams at every stage to make the data flow visible. The goal is not to replace Karpathy’s explanation but to add another layer of accessibility – to make these ideas click for people who haven’t spent years in machine learning.

The 30-Second Version

Here’s what the entire program does:

┌─────────────────────────────────────────────────────┐
│  1. DATASET: Load 32,000 human names ("emma", ...)  │
│  2. TOKENIZER: Map each character → integer ID      │
│  3. MODEL: Build a tiny GPT (4,192 parameters)      │
│  4. TRAIN: Show it names, adjust parameters         │
│  5. GENERATE: Ask it to invent new names            │
└─────────────────────────────────────────────────────┘

After training, the model produces plausible-sounding names it has never seen, like “Aalina” or “Relyn”. It learned the statistical patterns of English names – which letters follow which, how names start and end – purely from examples.

Part 1: Data and Tokenization

Neural networks don’t understand text. They understand numbers. The first job is to convert characters into integers.

docs = [line.strip() for line in open('input.txt') if line.strip()]
uchars = sorted(set(''.join(docs)))
BOS = len(uchars)
vocab_size = len(uchars) + 1

The dataset is 32,000 names. uchars collects every unique character across all names and sorts them, giving us a character-to-integer mapping:

Character:  a  b  c  d  e ... x  y  z
Token ID:   0  1  2  3  4 ... 23 24 25
BOS token:  26

BOS (Beginning of Sequence) is a special token that marks “the name starts here” and “the name ends here.” A name like “emma” becomes the token sequence [26, 4, 12, 12, 0, 26] – BOS, then e-m-m-a, then BOS again. The second BOS acts as an end marker:

 BOS   e    m    m    a   BOS
[26]  [4]  [12] [12] [0] [26]
  ↑                        ↑
start                    end

This is a character-level tokenizer. Production models like GPT-4 use subword tokenizers (BPE) with vocabularies of ~200,000 tokens, where common words like “the” are a single token and rare words get split into pieces. The principle is identical: map text to a sequence of integers.

Part 2: The Autograd Engine

This is the most elegant part of the code. The Value class implements automatic differentiation – the ability to compute derivatives through an arbitrary chain of math operations. This is what makes neural network training possible.

Why Derivatives Matter

Training a neural network means finding parameter values that minimize a loss function. The loss measures how wrong the model’s predictions are. To reduce it, we need to know: for each parameter, if I nudge it slightly, does the loss go up or down? That’s the derivative (gradient) of the loss with respect to each parameter.

Parameter: 0.5
                        ┌─────────────┐
  nudge right → 0.501 ──┤             ├── loss = 2.38   ← went up
  original  →   0.500 ──┤    model    ├── loss = 2.37
  nudge left →  0.499 ──┤             ├── loss = 2.36   ← went down
                        └─────────────┘

  Gradient is positive → move the parameter left (decrease it)

With 4,192 parameters, we need 4,192 gradients. Computing them by nudging each parameter one at a time would require 4,192 forward passes. Backpropagation computes all of them in a single backward pass. That’s the magic of autograd.

The Value Class

class Value:
    def __init__(self, data, children=(), local_grads=()):
        self.data = data
        self.grad = 0
        self._children = children
        self._local_grads = local_grads

Every Value stores three things: the computed number (data), its gradient (grad, filled in during the backward pass), and how it was created (_children and _local_grads). Together, these form a computation graph – a record of every mathematical operation.

When you write c = a + b, the resulting Value remembers that it came from a and b via addition:

  a (data=3.0) ──┐
                  ├──(+)──→ c (data=5.0)
  b (data=2.0) ──┘

  children: (a, b)
  local_grads: (1, 1)     ← derivative of (a+b) w.r.t. a is 1,
                              derivative of (a+b) w.r.t. b is 1

For multiplication, the local gradients are different:

def __mul__(self, other):
    return Value(self.data * other.data, (self, other), (other.data, self.data))

  a (data=3.0) ──┐
                  ├──(×)──→ c (data=6.0)
  b (data=2.0) ──┘

  children: (a, b)
  local_grads: (2.0, 3.0)  ← d(a×b)/da = b = 2.0
                               d(a×b)/db = a = 3.0

This is the product rule from calculus: the derivative of $a \times b$ with respect to $a$ is $b$, and vice versa. Each operation in the Value class encodes its own derivative rule:

Operation	Forward	Local gradient(s)
`a + b`	$a + b$	$1, 1$
`a * b`	$a \times b$	$b, a$
`a ** n`	$a^n$	$n \cdot a^{n-1}$
`a.exp()`	$e^a$	$e^a$
`a.log()`	$\ln(a)$	$1/a$
`a.relu()`	$\max(0, a)$	$1$ if $a > 0$, else $0$

Backpropagation

The backward() method walks the computation graph in reverse and accumulates gradients using the chain rule: if $z$ depends on $y$ which depends on $x$, then $\frac{dz}{dx} = \frac{dz}{dy} \cdot \frac{dy}{dx}$.

def backward(self):
    topo = []
    visited = set()
    def build_topo(v):
        if v not in visited:
            visited.add(v)
            for child in v._children:
                build_topo(child)
            topo.append(v)
    build_topo(self)
    self.grad = 1
    for v in reversed(topo):
        for child, local_grad in zip(v._children, v._local_grads):
            child.grad += local_grad * v.grad

First, build_topo performs a topological sort – it arranges all nodes so that every node appears after its children. Then gradients flow backward from the loss (whose gradient is 1 by definition) through every operation to every parameter:

Forward pass (left to right):  compute values
─────────────────────────────────────────────────────────────→

  a=3.0 ──(×)──→ d=6.0 ──(+)──→ f=7.0 ──(-log)──→ loss=−1.95
  b=2.0 ──┘      e=1.0 ──┘

←─────────────────────────────────────────────────────────────
Backward pass (right to left): compute gradients

  loss.grad = 1.0

  f.grad = 1.0 × (−1/f.data) = −0.143        ← chain rule through -log
  d.grad = f.grad × 1 = −0.143                 ← chain rule through +
  e.grad = f.grad × 1 = −0.143                 ← chain rule through +
  a.grad = d.grad × b.data = −0.143 × 2 = −0.286  ← chain rule through ×
  b.grad = d.grad × a.data = −0.143 × 3 = −0.429  ← chain rule through ×

The += in child.grad += local_grad * v.grad is important – when a value is used in multiple operations, its gradient accumulates contributions from all of them. This handles the case where one parameter influences the loss through multiple paths.

This is the same backpropagation algorithm that PyTorch, TensorFlow, and JAX implement. The difference is that those frameworks operate on tensors (multi-dimensional arrays of numbers) and run on GPUs. microGPT operates on individual scalar values, which makes it ~1,000,000× slower but conceptually identical.

Part 3: The Model Architecture

With autograd in place, we can define the transformer. microGPT follows the GPT-2 architecture with minor simplifications.

Parameters

n_layer = 1       # number of transformer layers
n_embd = 16       # embedding dimension
block_size = 16   # maximum sequence length
n_head = 4        # number of attention heads
head_dim = n_embd // n_head  # = 4 dimensions per head

The model has 4,192 learnable parameters. For comparison, GPT-2 has 1.5 billion, GPT-4 reportedly has over a trillion, and Claude’s parameter count is undisclosed. But the architecture is the same.

Embeddings

The first thing the model does with a token is look up its embedding – a learned vector of numbers that represents that token in the model’s internal space:

tok_emb = state_dict['wte'][token_id]   # token embedding
pos_emb = state_dict['wpe'][pos_id]     # position embedding
x = [t + p for t, p in zip(tok_emb, pos_emb)]

Token "e" (id=4)                Position 1
        │                              │
        ▼                              ▼
 ┌─────────────┐               ┌─────────────┐
 │  wte[4]     │               │  wpe[1]     │
 │  (lookup    │               │  (lookup    │
 │   row 4)    │               │   row 1)    │
 └──────┬──────┘               └──────┬──────┘
        │                              │
        ▼                              ▼
 [0.02, -0.05, 0.11, ...]      [0.08, 0.01, -0.03, ...]
        │                              │
        └──────────┬───────────────────┘
                   ▼
              element-wise add
                   │
                   ▼
         [0.10, -0.04, 0.08, ...]
              16 numbers

The token embedding (wte) is a table with one row per vocabulary entry (27 rows × 16 columns). Each row is a 16-dimensional vector that the model will learn to associate with that character’s meaning. Initially these are random numbers; during training, the model adjusts them so that characters with similar roles (like vowels) end up near each other in this 16-dimensional space.

The position embedding (wpe) is a separate table (16 rows × 16 columns) that encodes where in the sequence a token appears. The model needs this because the transformer processes tokens in parallel – without position information, it couldn’t distinguish “ab” from “ba”. Adding the position embedding to the token embedding gives each token a representation that encodes both what it is and where it is.

RMSNorm: Keeping Numbers Stable

def rmsnorm(x):
    ms = sum(xi * xi for xi in x) / len(x)
    scale = (ms + 1e-5) ** -0.5
    return [xi * scale for xi in x]

Before the transformer layers process a vector, rmsnorm normalizes it. Without normalization, values can grow or shrink uncontrollably as they pass through layers, making training unstable.

RMSNorm computes the root mean square of the vector and divides each element by it:

Input:  [4.0, 3.0, 0.0, 1.0]

Mean square:  (16 + 9 + 0 + 1) / 4 = 6.5
RMS:          √6.5 ≈ 2.55
Scale:        1 / 2.55 ≈ 0.39

Output: [1.57, 1.18, 0.0, 0.39]

The result is a vector with roughly unit magnitude. The relative proportions between elements are preserved – 4.0 is still the largest – but the absolute scale is controlled. This is a simplified version of LayerNorm (which also subtracts the mean), used in modern architectures like LLaMA because it’s faster and works just as well.

Linear Layers: Matrix Multiplication

def linear(x, w):
    return [sum(wi * xi for wi, xi in zip(wo, x)) for wo in w]

This is a matrix-vector multiply – the fundamental operation of neural networks. Each output element is a dot product between one row of the weight matrix and the input vector:

           Input x (4 values)
           [x₀, x₁, x₂, x₃]
                │
    ┌───────────┼───────────┐
    ▼           ▼           ▼
 ┌──────┐   ┌──────┐   ┌──────┐
 │ w₀·x │   │ w₁·x │   │ w₂·x │    w is a 3×4 matrix
 │ =Σwx │   │ =Σwx │   │ =Σwx │    each row is a different
 └──┬───┘   └──┬───┘   └──┬───┘    "filter" or "feature detector"
    ▼           ▼           ▼
   [y₀,        y₁,        y₂]
           Output (3 values)

A linear layer with a 64×16 weight matrix takes a 16-dimensional input and produces a 64-dimensional output. Each of the 64 output values is a weighted combination of the 16 inputs – what those weights are determines what “feature” that output detects. Training adjusts these weights so the features become useful for prediction.

Multi-Head Attention: How Tokens Communicate

This is the core innovation of the transformer. Attention lets each token look at all previous tokens and decide which ones are relevant.

q = linear(x, state_dict[f'layer{li}.attn_wq'])
k = linear(x, state_dict[f'layer{li}.attn_wk'])
v = linear(x, state_dict[f'layer{li}.attn_wv'])

The current token’s embedding is projected into three vectors:

Query (Q): “What am I looking for?”
Key (K): “What do I contain?”
Value (V): “What information do I carry?”

Think of it like a search engine. The query is your search terms. Each previous token has a key (its label) and a value (its content). Attention computes how well each key matches the query, then returns a weighted blend of the values.

                    Attention for token at position 3
                    Query: "What should follow 'e','m','m'?"

         Position 0     Position 1     Position 2     Position 3
         (BOS)          (e)            (m)            (m)
            │               │              │              │
            ▼               ▼              ▼              ▼
         Key₀           Key₁           Key₂           Key₃
            │               │              │              │
            ▼               ▼              ▼              ▼
     ┌──────────────────────────────────────────────────────┐
     │  score₀ = Q₃·K₀  score₁ = Q₃·K₁  score₂ = Q₃·K₂      │
     │    = 0.3            = 1.8            = 2.1           │ score₃ = Q₃·K₃
     │                                                      │   = 0.9
     │  softmax → weights: [0.04,    0.20,     0.26,        │  0.50]
     └──────────────────────────────────────────────────────┘
            │               │              │              │
            ▼               ▼              ▼              ▼
       0.04 × Val₀   0.20 × Val₁   0.26 × Val₂    0.50 × Val₃
            │               │              │              │
            └───────┬───────┘──────┬───────┘──────┬───────┘
                    ▼              ▼              ▼
                        Weighted sum = Output

The scores are dot products between the query and each key, scaled by $\sqrt{d}$ where $d$ is the head dimension:

attn_logits = [
    sum(q_h[j] * k_h[t][j] for j in range(head_dim)) / head_dim**0.5
    for t in range(len(k_h))
]

The $\div \sqrt{d}$ scaling prevents the dot products from becoming so large that softmax saturates (producing weights like [0.0, 0.0, 1.0, 0.0] that ignore most tokens).

Multi-head means this process runs in parallel across multiple “heads,” each operating on a different slice of the embedding:

Full embedding (16 dims)
[████████████████]
  ▼    ▼    ▼    ▼
 Head  Head Head Head     4 heads × 4 dims each = 16 dims
  0     1    2    3
  │     │    │    │
  ▼     ▼    ▼    ▼       Each head attends independently
 [██] [██] [██] [██]
  │     │    │    │
  └──┬──┘──┬─┘──┬┘
     ▼     ▼    ▼
 [████████████████]       Concatenate back to 16 dims
          │
          ▼
   linear(x_attn, Wo)    Output projection mixes head results

Different heads can learn different attention patterns. One head might learn “look at the previous character,” another might learn “look at the first character of the name,” and another might learn “look at the most recent vowel.” The output projection (attn_wo) combines these perspectives into a single updated representation.

The KV Cache

Notice that keys and values are appended to lists that grow with each position:

keys[li].append(k)
values[li].append(v)

This is the KV cache. When processing position 3, the model needs keys and values from positions 0, 1, 2, and 3. Rather than recomputing them, it stores them. At each new position, only the new token’s key and value are computed – the previous ones are read from the cache:

Processing position 0:  keys = [K₀]             values = [V₀]
Processing position 1:  keys = [K₀, K₁]         values = [V₀, V₁]
Processing position 2:  keys = [K₀, K₁, K₂]     values = [V₀, V₁, V₂]
Processing position 3:  keys = [K₀, K₁, K₂, K₃] values = [V₀, V₁, V₂, V₃]

In production systems, the KV cache is one of the biggest memory consumers. GPT-4 serving millions of users needs to store KV caches for all active conversations simultaneously, which is why KV cache compression and eviction strategies are active areas of engineering.

Residual Connections: The Highway

After attention (and later, the MLP), the output is added back to the input:

x = [a + b for a, b in zip(x, x_residual)]

            ┌───────────────────────────┐
            │                           │
  x ────────┤──→ [Attention/MLP] ───(+)─┤──→ output
  (input)   │         │                 │
            │    (transformation)       │
            └───────────────────────────┘
                   residual connection

This is a residual (skip) connection. Instead of replacing the input, the transformation is added to it. This has two benefits:

Gradient flow. During backpropagation, gradients flow through both the transformation and the skip connection. The skip connection provides a direct gradient highway that doesn’t diminish, even through many layers.
Default identity. If the transformation learns to output all zeros, the output equals the input unchanged. This makes it easy for layers to learn “do nothing” when that’s optimal.

Without residual connections, deep networks (100+ layers) are nearly impossible to train. With them, each layer only needs to learn a small delta to the representation.

The MLP: Thinking About Each Token

After attention lets tokens communicate, the MLP (feed-forward network) processes each token individually:

x = linear(x, state_dict[f'layer{li}.mlp_fc1'])   # 16 → 64
x = [xi.relu() for xi in x]                        # nonlinearity
x = linear(x, state_dict[f'layer{li}.mlp_fc2'])   # 64 → 16

Input (16 dims)
      │
      ▼
 ┌──────────┐
 │ Linear   │  16 → 64  (expand)
 │ (fc1)    │
 └────┬─────┘
      │
      ▼
 ┌──────────┐
 │  ReLU    │  max(0, x) — zero out negatives
 └────┬─────┘
      │
      ▼
 ┌──────────┐
 │ Linear   │  64 → 16  (compress)
 │ (fc2)    │
 └────┬─────┘
      │
      ▼
Output (16 dims)

The MLP expands the representation to 4× the embedding dimension (16 → 64), applies a nonlinearity, then compresses back (64 → 16). The expansion gives the network a high-dimensional space to compute in, and the ReLU (Rectified Linear Unit) introduces nonlinearity – the ability to model relationships that aren’t straight lines.

Why does nonlinearity matter? Without it, stacking linear layers is mathematically equivalent to a single linear layer. No matter how many layers you add, the network can only learn linear relationships. ReLU breaks this – by zeroing out negative values, it creates different linear regions that, together, can approximate any function:

           Linear only:              With ReLU:
           (can only learn           (can learn curves
            straight lines)           and complex patterns)

    y│    /                   y│         ╱
     │   /                    │     ___╱
     │  /                     │   ╱
     │ /                      │  ╱
     │/                       │_╱
     └──────── x              └──────── x

Full Transformer Block

Putting it all together, one transformer layer looks like this:

Input x
   │
   ├────────────────────────────────┐
   ▼                                │
 RMSNorm                            │
   │                                │
   ▼                                │
 Multi-Head Attention               │
 (tokens communicate)               │
   │                                │
   ▼                                │
 (+) ←──────────────────────────────┘  residual connection
   │
   ├────────────────────────────────┐
   ▼                                │
 RMSNorm                            │
   │                                │
   ▼                                │
 MLP                                │
 (per-token computation)            │
   │                                │
   ▼                                │
 (+) ←──────────────────────────────┘  residual connection
   │
   ▼
 Output x

microGPT uses 1 layer. GPT-2 uses 48. GPT-4 reportedly uses 120. Each additional layer gives the model more capacity to learn complex patterns – more rounds of tokens communicating (attention) and being individually processed (MLP).

From Hidden State to Prediction

After the transformer layers, the final hidden state is projected to vocabulary-sized logits:

logits = linear(x, state_dict['lm_head'])

Hidden state (16 dims)
[0.3, -0.1, 0.8, ...]
          │
          ▼
    ┌───────────┐
    │  lm_head  │   16 → 27 (one score per vocab token)
    │  (linear) │
    └─────┬─────┘
          │
          ▼
Raw logits (27 values):
[ 1.2, -0.5,  0.3,  2.1, -1.0, ... ]
   a     b     c     d     e    ...

          │
          ▼
       softmax
          │
          ▼
Probabilities (27 values, sum to 1):
[0.08, 0.01, 0.03, 0.19, 0.01, ... ]
   a     b     c     d     e    ...

"After 'emm', the model thinks 'd' is most likely"

The logits are raw scores – they can be any number. Softmax converts them to probabilities (positive, summing to 1):

def softmax(logits):
    max_val = max(val.data for val in logits)
    exps = [(val - max_val).exp() for val in logits]
    total = sum(exps)
    return [e / total for e in exps]

The subtraction of max_val is a numerical stability trick. Mathematically, $\text{softmax}(z - c) = \text{softmax}(z)$ for any constant $c$. But practically, computing $e^{1000}$ overflows while $e^{0}$ doesn’t. By subtracting the maximum, the largest exponent is always $e^0 = 1$.

Part 4: Training

Training is the process of adjusting parameters so the model’s predictions get better. The loop is conceptually simple:

┌──────────────────────────────────────────────┐
│  for each training step:                     │
│                                              │
│    1. Pick a name from the dataset           │
│    2. For each position in the name:         │
│       - Ask model: "what comes next?"        │
│       - Measure how wrong it was (loss)      │
│    3. Backpropagate: compute all gradients   │
│    4. Update parameters to reduce the loss   │
│                                              │
│  Repeat 1000 times                           │
└──────────────────────────────────────────────┘

The Forward Pass

doc = docs[step % len(docs)]
tokens = [BOS] + [uchars.index(ch) for ch in doc] + [BOS]

for pos_id in range(n):
    token_id, target_id = tokens[pos_id], tokens[pos_id + 1]
    logits = gpt(token_id, pos_id, keys, values)
    probs = softmax(logits)
    loss_t = -probs[target_id].log()
    losses.append(loss_t)

For the name “emma”, the model makes predictions at each position:

Position 0: See BOS  → predict next → target is 'e'
Position 1: See 'e'  → predict next → target is 'm'
Position 2: See 'm'  → predict next → target is 'm'
Position 3: See 'm'  → predict next → target is 'a'
Position 4: See 'a'  → predict next → target is BOS (end)

At each position, the loss is $-\log(p_{\text{target}})$ – the negative log probability the model assigned to the correct next token. This is cross-entropy loss:

Model's predicted probabilities for next token after 'e':

  a: 0.05   d: 0.03   m: 0.08 ← correct answer
  b: 0.02   e: 0.04   n: 0.12
  c: 0.01   ...       ...

  loss = -log(0.08) = 2.53     ← high loss (model was uncertain)

After training, the model might predict:

  a: 0.02   d: 0.01   m: 0.45 ← correct answer
  b: 0.01   e: 0.02   n: 0.05
  c: 0.01   ...       ...

  loss = -log(0.45) = 0.80     ← low loss (model was confident and right)

The loss is 0 when the model assigns probability 1.0 to the correct token, and it approaches infinity as the probability approaches 0. The average loss across all positions gives a single number measuring model quality.

The Backward Pass

loss.backward()

This single line triggers the entire backpropagation algorithm we described earlier. It walks backward through every computation that produced the loss – through the softmax, the linear layers, the attention operations, the embeddings – and computes the gradient of the loss with respect to every one of the 4,192 parameters.

After this call, p.grad on every parameter holds the answer to “how should this parameter change to reduce the loss?”

Adam Optimizer

lr_t = learning_rate * (1 - step / num_steps)
for i, p in enumerate(params):
    m[i] = beta1 * m[i] + (1 - beta1) * p.grad
    v[i] = beta2 * v[i] + (1 - beta2) * p.grad ** 2
    m_hat = m[i] / (1 - beta1 ** (step + 1))
    v_hat = v[i] / (1 - beta2 ** (step + 1))
    p.data -= lr_t * m_hat / (v_hat ** 0.5 + eps_adam)
    p.grad = 0

The simplest optimizer would be gradient descent: p.data -= lr * p.grad. Move each parameter in the direction that reduces the loss, proportional to the learning rate. Adam is smarter. It maintains two running statistics for each parameter:

m (first moment): A smoothed average of recent gradients. This gives the optimizer momentum – if the gradient has been pointing the same direction for many steps, Adam moves faster in that direction.
v (second moment): A smoothed average of recent squared gradients. This is an estimate of the gradient’s variance. Parameters with volatile gradients get smaller updates; parameters with stable gradients get larger updates.

Gradient descent:              Adam:

  ·····→→→→→→→→→→→→          ·····→→→→→→→→→→→→
  Step size is always          Step size adapts:
  learning_rate × gradient     - Accelerates in consistent directions
                               - Slows down in noisy directions
                               - Adjusts per-parameter

The m_hat and v_hat lines apply bias correction. Because m and v start at zero, they’re underestimates during early steps. Dividing by $(1 - \beta^{t+1})$ corrects this, making the estimates accurate from the first step.

The learning rate decays linearly: lr_t = learning_rate * (1 - step / num_steps). This means large updates early (when parameters are far from good values) and small, careful updates later (when fine-tuning).

Part 5: Inference

After training, the model generates new names by sampling from its learned probability distributions:

temperature = 0.5
for sample_idx in range(20):
    keys, values = [[] for _ in range(n_layer)], [[] for _ in range(n_layer)]
    token_id = BOS
    sample = []
    for pos_id in range(block_size):
        logits = gpt(token_id, pos_id, keys, values)
        probs = softmax([l / temperature for l in logits])
        token_id = random.choices(range(vocab_size), weights=[p.data for p in probs])[0]
        if token_id == BOS:
            break
        sample.append(uchars[token_id])

Generation is autoregressive – each generated token becomes the input for the next step:

Step 0: Input BOS    → Model predicts → Sample 'a'
Step 1: Input 'a'    → Model predicts → Sample 'l'
Step 2: Input 'l'    → Model predicts → Sample 'i'
Step 3: Input 'i'    → Model predicts → Sample 'n'
Step 4: Input 'n'    → Model predicts → Sample 'a'
Step 5: Input 'a'    → Model predicts → Sample BOS (stop!)

Generated name: "alina"

Temperature controls how “creative” the model is. The logits are divided by temperature before softmax:

Original logits:     [1.0, 2.0, 3.0]

Temperature = 1.0:   softmax([1.0, 2.0, 3.0]) = [0.09, 0.24, 0.67]
                     → moderate diversity

Temperature = 0.5:   softmax([2.0, 4.0, 6.0]) = [0.02, 0.12, 0.86]
                     → confident, predictable (picks top choice more often)

Temperature = 2.0:   softmax([0.5, 1.0, 1.5]) = [0.19, 0.27, 0.34]  ← note: these don't sum to 1
                     → flatter, more random          exactly due to rounding

At temperature 0.5 (microGPT’s setting), the model mostly generates conventional-sounding names. At higher temperatures, it would produce more unusual combinations.

This is the same temperature parameter exposed in the OpenAI, Anthropic, and Google APIs. When you set temperature=0.7 on a ChatGPT call, this exact operation is happening – just on a model with billions of parameters instead of thousands.

The Full Data Flow

Here’s every operation the model performs for a single token, end to end:

Token ID: 4 ("e")     Position: 1
     │                      │
     ▼                      ▼
 ┌────────┐            ┌────────┐
 │ wte[4] │            │ wpe[1] │     Embedding lookup
 └───┬────┘            └───┬────┘
     │     16 dims         │ 16 dims
     └──────────┬──────────┘
                ▼
           Add (element-wise)
                │
                ▼
            RMSNorm
                │
     ┌──────────┤──────────────────────────── (save for residual)
     │          ▼
     │      RMSNorm
     │          │
     │    ┌─────┼─────┐
     │    ▼     ▼     ▼
     │   Wq    Wk    Wv               Q, K, V projections
     │    │     │     │
     │    │     │     │
     │    ▼     ▼     ▼
     │  ┌─────────────────┐
     │  │  Multi-Head      │
     │  │  Attention       │            4 heads, each 4 dims
     │  │  (score, weight, │
     │  │   blend values)  │
     │  └────────┬─────────┘
     │           ▼
     │       Wo (output projection)
     │           │
     └────► Add (residual) ◄────────┘
                │
     ┌──────────┤──────────────────────────── (save for residual)
     │          ▼
     │      RMSNorm
     │          │
     │          ▼
     │    fc1 (16 → 64)                MLP expand
     │          │
     │          ▼
     │        ReLU                     Nonlinearity
     │          │
     │          ▼
     │    fc2 (64 → 16)                MLP compress
     │          │
     └────► Add (residual) ◄────────┘
                │
                ▼
         lm_head (16 → 27)             Project to vocab
                │
                ▼
            Softmax
                │
                ▼
      Probabilities over 27 tokens
      [P(a), P(b), ..., P(z), P(BOS)]

Every arrow in this diagram is a differentiable operation tracked by the Value class. When loss.backward() is called, gradients flow backward through this entire graph, from the loss all the way to the embedding tables.

What Production Models Add

microGPT is algorithmically complete. But production LLMs differ in engineering and scale:

Aspect	microGPT	Production (GPT-4, Claude, etc.)
Parameters	4,192	Hundreds of billions to trillions
Layers	1	80-120+
Embedding dim	16	8,192-16,384+
Vocab size	27 (characters)	100,000-200,000 (subword BPE)
Context length	16 tokens	128K-1M+ tokens
Training data	32K names	Trillions of tokens
Compute	Single CPU, hours	Thousands of GPUs, months
Math library	Python `Value` scalars	CUDA tensor kernels
Optimizer	Adam (scalar)	AdamW + gradient checkpointing
Normalization	RMSNorm	RMSNorm (same)
Attention	Standard	+ GQA, RoPE, Flash Attention
Post-training	None	RLHF, DPO, constitutional AI

The algorithmic core – embeddings, attention, MLPs, residual connections, softmax, cross-entropy, backpropagation, Adam – is identical. Everything in the right column is about making it bigger and faster, not about changing what the math does.

Karpathy put it well: “The model is a big math function that maps input tokens to a probability distribution over the next token.” microGPT makes that sentence literal. You can read every operation, trace every gradient, and see exactly how 4,192 numbers learn to generate plausible English names.

The complete code is here. Karpathy’s companion blog post is here. If you’ve read this far, I’d encourage running it yourself – it takes a few hours on a laptop, and watching the loss decrease and the generated names go from gibberish to recognizable is deeply satisfying.

Comments

Came here from LinkedIn or X? Join the conversation below — all discussion lives here.