Fine-Tuning LLMs: What Happens to the Weights

In a previous post, we looked at post-training as a category — SFT, RLHF, DPO — and contrasted it with in-context learning. But we glossed over the mechanics: when fine-tuning actually runs, what happens to the model’s weights? Which weights change? How much do they change? And why does it matter?

These questions have practical consequences. Full fine-tuning produces a complete copy of the model — hundreds of gigabytes you need to store, serve, and manage. Parameter-efficient methods like LoRA produce adapters measured in megabytes, and multiple adapters can share one base model, swapping in and out per request. Understanding what’s physically happening to the weight matrices explains why these approaches work and when each one makes sense.

Full Fine-Tuning

The most straightforward approach: take a pre-trained model, pass your training data through it, compute the loss, and update every parameter via backpropagation. If the model has 70 billion parameters, all 70 billion get gradient updates.

Pre-trained weights W        Training data
      │                           │
      ▼                           ▼
┌──────────────────────────────────────┐
│         Forward pass                 │
│  (compute predictions)               │
└──────────────┬───────────────────────┘
               │
               ▼
         Compute loss
               │
               ▼
┌──────────────────────────────────────┐
│         Backward pass                │
│  (compute gradients for ALL params)  │
└──────────────┬───────────────────────┘
               │
               ▼
    W ← W - lr × ∇W    (update ALL 70B parameters)

The result is a new model: $W' = W - \eta \sum_t \nabla_W \mathcal{L}_t$, where every weight matrix in every layer has shifted from its pre-trained value. You now have two complete models on disk — the original and the fine-tuned version.

The Costs

Full fine-tuning is expensive in several ways:

Memory. You need to store the model weights, the gradients, and the optimizer states (Adam maintains two running averages per parameter). For a 70B parameter model in 16-bit precision, that’s roughly 70B × 2 bytes (weights) + 70B × 2 bytes (gradients) + 70B × 8 bytes (Adam states) ≈ 840 GB of GPU memory. That’s multiple A100s just for a single training run.
Storage. Each fine-tuned variant is a full copy of the model. Ten tasks means ten copies.
Catastrophic forgetting. Updating every parameter risks overwriting the general knowledge the model learned during pre-training. A model fine-tuned on medical Q&A might get worse at general conversation, because the weight updates that improved medical accuracy shifted other capabilities away from their pre-trained optima.

Catastrophic forgetting is the fundamental tension: you want the model to learn something new without losing what it already knows. Full fine-tuning makes this hard because every parameter is in play.

Feature Extraction and Last-Layer Tuning

At the opposite extreme, you can freeze all of the model’s pre-trained weights and only train a new head on top. This treats the pre-trained model as a fixed feature extractor — the transformer layers produce a rich representation of the input, and you train a small classifier or regression head on those representations.

Input
  │
  ▼
┌──────────────────────┐
│  Pre-trained layers   │  ← FROZEN (no gradient updates)
│  (all weights fixed)  │
└──────────┬───────────┘
           │
           ▼
    Hidden representation
           │
           ▼
┌──────────────────────┐
│  New trainable head   │  ← TRAINED (updated via backprop)
│  (small linear layer) │
└──────────┬───────────┘
           │
           ▼
       Prediction

This is fast and cheap — you’re only training a tiny fraction of the model’s parameters. But the ceiling is low. Because the pre-trained layers can’t adapt their representations to your task, the model can only use features that were already useful during pre-training. For tasks that align well with what the model already understands, this works surprisingly well. For tasks requiring genuinely new representations, it falls short.

A middle ground is to unfreeze the last few transformer layers while keeping the rest frozen. This lets the model adapt its high-level representations while preserving the lower-level features, which tend to be more general and transferable across tasks.

LoRA: Low-Rank Adaptation

LoRA (Hu et al., 2022) is the approach that has come to dominate fine-tuning in practice. Its key insight is that the weight changes produced by fine-tuning have low intrinsic dimensionality — the updates can be well-approximated by low-rank matrices.

The Intuition

When you fully fine-tune a model, each weight matrix $W$ gets updated to $W' = W + \Delta W$, where $\Delta W$ is the accumulated change from training. Hu et al. showed that $\Delta W$ tends to have low effective rank. Even though $W$ might be a $4096 \times 4096$ matrix (16.7 million parameters), the actual change $\Delta W$ often lives in a subspace of rank 8, 16, or 32 — orders of magnitude smaller.

This means instead of storing and computing $\Delta W$ directly, you can decompose it into two much smaller matrices: $\Delta W = BA$, where $B$ is $d \times r$ and $A$ is $r \times k$, with rank $r \ll \min(d, k)$.

Full fine-tuning:                  LoRA:

W (4096 × 4096)                    W (4096 × 4096) ← FROZEN
  │                                  │
  │ update all                       │  no updates
  │ 16.7M params                     │
  ▼                                  ▼
W' (4096 × 4096)                   W + BA
                                     │
                                   B (4096 × 16)  ← TRAINED
                                   A (16 × 4096)  ← TRAINED
                                     │
                                   Only 131K params
                                   (0.78% of original)

With rank $r = 16$, the LoRA adapter has $4096 \times 16 + 16 \times 4096 = 131{,}072$ parameters per weight matrix — less than 1% of the full matrix. Across the whole model, typical LoRA adapters are 0.1–1% of the base model’s parameter count.

The Math

During the forward pass, the output of a LoRA-adapted layer is:

$$h = Wx + \frac{\alpha}{r} BAx$$

The pre-trained weight $W$ stays frozen. The matrices $B$ and $A$ are the only trainable parameters. At initialization, $A$ is drawn from a random Gaussian and $B$ is set to zero, so $BA = 0$ and the model starts exactly at pre-trained behavior. The scaling factor $\frac{\alpha}{r}$ controls the magnitude of the adaptation, where $\alpha$ is a hyperparameter typically set equal to $r$ or $2r$.

The product $B(Ax)$ is computed as two sequential multiplications rather than materializing the full $BA$ matrix:

Input x (dim k)
     │
     ▼
┌──────────┐
│  A        │   k → r  (project DOWN to low-rank space)
│ (r × k)   │
└─────┬────┘
      │   dim r (small!)
      ▼
┌──────────┐
│  B        │   r → d  (project UP back to full dimension)
│ (d × r)   │
└─────┬────┘
      │   dim d
      ▼
  Scale by α/r
      │
      ▼
  Add to Wx ──→ output h

$A$ projects the input into a low-dimensional “task” space, and $B$ projects it back up. The rank $r$ controls the adapter’s capacity — with $r = 1$, it can only learn a single direction of change; with $r = 64$, it approaches (but doesn’t reach) full fine-tuning expressiveness.

Which Layers Get LoRA?

In practice, LoRA is typically applied to the attention projection matrices — $W_Q$, $W_K$, $W_V$, and $W_O$ — in each transformer layer. These are the matrices that project the input into queries, keys, values, and the output projection that combines attention heads. The original LoRA paper found that adapting the attention projections gave the best results per parameter.

Some practitioners also apply LoRA to the MLP weight matrices ($W_{fc1}$ and $W_{fc2}$). QLoRA (Dettmers et al., 2023) showed that applying LoRA to all linear layers is more effective when using very low ranks. The choice depends on the trade-off between adapter size and task performance.

Transformer Layer
┌─────────────────────────────────────────┐
│                                         │
│  Attention:                             │
│    Wq  ──→  Wq + Bq·Aq    ← LoRA      │
│    Wk  ──→  Wk + Bk·Ak    ← LoRA      │
│    Wv  ──→  Wv + Bv·Av    ← LoRA      │
│    Wo  ──→  Wo + Bo·Ao    ← LoRA      │
│                                         │
│  MLP:                                   │
│    Wfc1 ──→  (frozen or + LoRA)         │
│    Wfc2 ──→  (frozen or + LoRA)         │
│                                         │
└─────────────────────────────────────────┘
× n_layers

Merging: Zero-Cost Inference

After training, you can merge the adapter back into the base weights. Since $h = (W + BA)x$, you compute $W' = W + BA$ once and store $W'$. Inference now runs at exactly the same speed as the original model — no additional computation for the LoRA path, no overhead at all.

Training time:               After merging:

     x                            x
     │                             │
  ┌──┴──┐                         │
  │     │                          ▼
  ▼     ▼                    ┌────────┐
┌───┐ ┌───┐                 │   W'   │   W' = W + BA
│ W │ │B·A│                 │(merged)│
└─┬─┘ └─┬─┘                 └────┬───┘
  │     │                        │
  ▼     ▼                        ▼
  Add ──→ h                      h

Two matrix-vector              One matrix-vector
multiplies + add               multiply (same as original)

This gives you the best of both worlds: efficient training (only update the small adapter) and efficient inference (no adapter overhead). But once you merge, you’ve committed — if you want to switch tasks, you need the unmerged base weights and a different adapter.

This brings us to the most practically interesting capability of LoRA.

Multi-LoRA: Serving Many Tasks from One Model

Because LoRA adapters are separate from the base model and very small, you can maintain a library of adapters and apply the right one at request time. One base model in GPU memory, many adapters on disk or in CPU memory, swapped in per request.

                        ┌─────────────────────┐
                        │   Base Model (70B)   │
                        │   (loaded once in    │
                        │    GPU memory)        │
                        └──────────┬──────────┘
                                   │
               ┌───────────────────┼───────────────────┐
               │                   │                   │
               ▼                   ▼                   ▼
        ┌─────────────┐   ┌─────────────┐   ┌─────────────┐
        │ LoRA: Legal  │   │ LoRA: Medical│   │ LoRA: Code  │
        │ (~50 MB)     │   │ (~50 MB)     │   │ (~50 MB)    │
        └─────────────┘   └─────────────┘   └─────────────┘

Request 1 (legal question)   → apply Legal adapter
Request 2 (medical question) → apply Medical adapter
Request 3 (code generation)  → apply Code adapter

This is transformative for serving. Instead of deploying three separate 70B models (210B parameters total, requiring massive GPU clusters), you deploy one 70B model plus three ~50 MB adapters. The memory savings are enormous.

How Hot-Loading Works

The naive approach: for each request, load the adapter’s $A$ and $B$ matrices, compute $BA$, add it to the base weights, run the forward pass, then remove it. But this is wasteful — the addition and removal happen on large matrices for every request.

The efficient approach keeps the base weights untouched and applies the LoRA computation on the fly during the forward pass:

For each layer during inference:

  h = Wx + B_adapter · (A_adapter · x)
       │          │
       │          └── small matmuls using the adapter's
       │              tiny B and A matrices
       │
       └── standard matmul using the frozen base weights
           (shared across ALL requests)

The base model’s $Wx$ computation is shared. Only the small $B(Ax)$ computation is adapter-specific. Since $r$ is typically 8–64, these additional matrix multiplications are negligible compared to the base model’s operations.

Batching Across Adapters

The real challenge is batching requests that use different adapters. Batching is critical for GPU utilization — processing 32 requests simultaneously is far more efficient than processing them one at a time. But if those 32 requests use 5 different adapters, you can’t just do a single batched matrix multiplication for the LoRA part.

S-LoRA (Sheng et al., 2023) solved this with two key innovations:

Unified Paging. S-LoRA borrows the concept of virtual memory paging from operating systems to manage adapter memory. Adapters are stored in a paged memory pool that spans GPU memory, CPU memory, and disk. Frequently used adapters stay on GPU; rarely used ones are paged to CPU or disk. The system dynamically loads and evicts adapters based on the request mix — if legal questions spike, the legal adapter is pulled to GPU while a less-used adapter gets paged out.

Custom CUDA kernels. In a batch of 32 requests where 10 use the legal adapter, 15 use medical, and 7 use code, the base model computation ($Wx$) runs as a single batched operation. But the LoRA computation ($BAx$) must apply different $B$ and $A$ matrices to different requests within the batch. S-LoRA introduces custom CUDA kernels that perform these heterogeneous batched matrix multiplications efficiently, gathering the right adapter weights for each request without breaking the batch.

Batch of 4 requests:

Request:  R1 (legal)    R2 (medical)  R3 (legal)   R4 (code)
            │              │              │             │
            ▼              ▼              ▼             ▼
       ┌──────────────────────────────────────────────────┐
       │          Base model: W × [x1, x2, x3, x4]       │
       │          (one batched matmul for all 4)           │
       └─────────────────────┬────────────────────────────┘
                             │
            ┌────────────────┼────────────────┬────────────────┐
            ▼                ▼                ▼                ▼
    B_legal·A_legal    B_med·A_med     B_legal·A_legal   B_code·A_code
          × x1              × x2             × x3              × x4
            │                │                │                │
            ▼                ▼                ▼                ▼
     h1 = Wx1 +       h2 = Wx2 +       h3 = Wx3 +       h4 = Wx4 +
      LoRA_L(x1)       LoRA_M(x2)       LoRA_L(x3)       LoRA_C(x4)

S-LoRA demonstrated serving thousands of LoRA adapters simultaneously on a single machine, with throughput close to serving the base model alone.

QLoRA: Fine-Tuning on a Budget

QLoRA (Dettmers et al., 2023) combines LoRA with aggressive quantization to make fine-tuning accessible on consumer hardware. The idea: quantize the base model to 4-bit precision, freeze it, and train LoRA adapters in 16-bit precision on top.

Standard LoRA:                    QLoRA:

Base model: 16-bit (140 GB)       Base model: 4-bit (35 GB)
LoRA adapters: 16-bit (~50 MB)    LoRA adapters: 16-bit (~50 MB)

Total GPU memory: ~140 GB         Total GPU memory: ~35 GB
(needs multiple A100s)            (fits on a single 48 GB GPU)

QLoRA introduced two innovations to make this work:

NF4 (NormalFloat 4-bit) — A 4-bit data type designed for normally distributed weight values. Neural network weights are typically Gaussian, so NF4 places its 16 quantization levels at the quantiles of the normal distribution, giving equal representation to each range of weight values.
Double quantization — The quantization constants themselves are quantized, further reducing the memory overhead of the quantization scheme itself.

During the forward pass, the 4-bit base weights are dequantized to 16-bit on the fly (in small blocks), the computation runs in 16-bit, and the LoRA gradients flow through the dequantized weights. The LoRA parameters are always in 16-bit, so training stability is maintained despite the aggressively quantized base model.

QLoRA showed that fine-tuning a 65B parameter model on a single 48 GB GPU could produce results competitive with full 16-bit fine-tuning. This was a significant democratization — fine-tuning frontier-scale models no longer required a data center.

Choosing an Approach

Approach	Parameters Updated	Memory (Training)	Adapter Size	Inference Overhead
Full fine-tuning	All	Very high (~12× model size)	Full model copy	None
Last-layer tuning	< 0.1%	Low	Tiny	None
LoRA ($r$=16)	0.1–1%	Model + small adapters	~50 MB for 70B model	None (if merged)
QLoRA ($r$=16)	0.1–1%	~25% of full	~50 MB	Slight (dequantization)
Multi-LoRA serving	N/A (inference)	Model + active adapters	~50 MB per task	Minimal

For most practitioners today, LoRA or QLoRA is the default starting point. Full fine-tuning makes sense when you have the compute budget, need maximum performance, or are training a model for a single purpose. Last-layer tuning is a useful baseline — if a linear probe over frozen features solves your problem, you don’t need anything fancier.

The multi-LoRA serving pattern is where the field is heading for production deployments. Rather than fine-tuning one model for one purpose, organizations are building adapter libraries: collections of small, specialized LoRA modules that can be applied at inference time. The base model is a shared resource; the adapters are the customization layer. This maps naturally to multi-tenant platforms where different customers or use cases need different model behaviors, all served from the same infrastructure.

The key insight underlying all of this is that fine-tuning doesn’t require changing everything. The weight updates that matter for a specific task live in a surprisingly small subspace of the full parameter space. LoRA makes that insight concrete and practical.

Comments

Came here from LinkedIn or X? Join the conversation below — all discussion lives here.