Inference Engines

You have a file of twenty-seven billion numbers. A user just typed a question.
Make the numbers answer. This is the story of how._

Part I — How It Works

// 01 — The Pipeline

EMBED → NORM → ATTN → FFN → PROJECT → SAMPLE   ← the whole map

Text in, probability distribution out. 73 matrix multiplies for GPT-2, 449 for Qwen-3.5 27B.
Same skeleton, different scale. This chapter is the map you'll reference for everything after._

When you talk to a language model, words appear as if it's composing a thought. It isn't. Each word is a separate run through the same fixed pipeline — hundreds of matrix multiplications that read everything so far, produce a probability for every word it knows, and pick one. Append. Repeat. The model that wrote word five is the same machine that wrote word five thousand. Same weights, same path, same math.

Watch It Generate The Loop Trivia

The whole loop
def generate(prompt_tokens, model, max_tokens=50):
    tokens = list(prompt_tokens)
    for _ in range(max_tokens):
        logits = forward_pass(tokens, model)   # ALL tokens, ALL layers
        probs = softmax(logits[-1] / temperature)
        tokens.append(sample(probs))
        if tokens[-1] == eos_token: break
    return tokens
# forward_pass → pick a word → append → repeat. That's it.

“The capital of France” — ML’s Hello World

This sentence appears in almost every transformer tutorial and inference demo. It’s a single-factual-hop question — one lookup, one answer. Even the smallest model nails it, making it a perfect smoke test: if your model can’t finish “The capital of France is ___”, something is broken at the infrastructure level.

It also shows what certainty looks like inside the model. After “France”, the probability distribution is extraordinarily peaked — ~73% on “ is”, then ~92% on “ Paris”. Most prompts produce far flatter distributions. This one is a spotlight with almost zero ambiguity.

Seen in: Vaswani et al., HuggingFace quickstarts, llama.cpp smoke tests, vLLM benchmarks, and ~10,000 attention blog posts.

Now that you've seen the loop — text in, one word out, repeat — let's open the machine that runs each pass. The diagram below is the complete map: every stage your data touches between entering as text and leaving as a probability. Don't memorise it now. The rest of this article unpacks each box one at a time.

The Complete Forward Pass

The Map Stage by Stage

The Complete Forward Pass


  "The capital of France"
      ↓
  ★ EMBED    Look up each token in a table of 50,257 vectors.
  │           "The" → [0.12, -0.34, 0.56, …]  (768 numbers)
  │           Add a position signal so the model knows word order.
      ↓
  ┌───────────────────────────────────────────────────────┐
  │ ★ NORM      Stabilise the numbers (rescale to unit RMS)         │
  │     ↓                                                        │
  │ ★ ATTENTION  Each token asks: "who in this sentence matters  │
  │              to me?"  Produces a relevance-weighted blend    │
  │              of all previous tokens' information.              │
  │     ↓  + skip connection (add original input back)        │
  │                                                               │
  │ ★ NORM      Stabilise again                                   │
  │     ↓                                                        │
  │ ★ FFN       Expand 768 → 3072, apply non-linearity, shrink     │
  │              back to 768.  Attention gathers; the FFN      │
  │              rewrites — transforming mixed context into      │
  │              new features.                                     │
  │     ↓  + skip connection                                   │
  └───────────────────────────────────── ×12 identical ──┘
      ↓
  ★ NORM    Final stabilisation
      ↓
  ★ PROJECT  Multiply by the embedding table transposed to get
  │           a score for every word in the vocabulary.
      ↓
  [0.01, 0.02, …, 0.73, …]  ← 50,257 scores (logits)
  Highest score: token 318 = " is"

Stage	What it does
★ EMBED	Token ID → vector lookup. Each integer maps to a row in a table of 50,257×768 (GPT-2) or 152,064×5,120 (Qwen). This same table reappears at the end — the model asks which embedding is closest to its output.
★ NORM	Stabilise numbers before each operation. RMSNorm (modern) or LayerNorm (GPT-2). Three passes over the vector — trivial cost.
★ ATTENTION	"Who matters to me?" Each token broadcasts a query, every previous token advertises a key. High dot-product = relevance. The output is a weighted blend of value vectors — the token absorbs information from its context.
★ FFN	Expand → activate → compress. The vector inflates to 4× width (17,408 dims in Qwen), a gating function decides which features survive, then it shrinks back. The heaviest computation in the block — 78% of per-block FLOPs.
★ PROJECT	Multiply by the embedding table transposed → one score per vocabulary token. The model is asking: "which word's embedding vector is closest to my output?"
★ SAMPLE	50,257 scores (or 152,064 for Qwen). Apply temperature, top-p, top-k. Draw one token. That's the output.

Six weight-matrix multiplications per block (GPT-2) or seven (Qwen's SwiGLU adds a gate matrix). Twelve blocks (GPT-2) or 64 (Qwen). That's 73 matmuls for GPT-2, 449 for Qwen — to produce one token. For GPT-2: ~250M FLOPs. For Qwen-3.5 27B: ~44 billion. Same recipe, 176× the bill.

That diagram is the whole model. Here it is as working code — 60 lines of Python that implement GPT-2's forward pass. Every stage from the diagram maps to a function below.

Full Code (60 lines)

Python
import numpy as np

def gelu(x):
    return 0.5 * x * (1 + np.tanh(np.sqrt(2/np.pi) * (x + 0.044715 * x**3)))

def softmax(x, axis=-1):
    e = np.exp(x - x.max(axis=axis, keepdims=True))
    return e / e.sum(axis=axis, keepdims=True)

def layer_norm(x, gamma, beta, eps=1e-5):
    mean = x.mean(axis=-1, keepdims=True)
    var  = x.var(axis=-1, keepdims=True)
    return gamma * (x - mean) / np.sqrt(var + eps) + beta

def attention(x, w_qkv, w_proj, b_qkv, b_proj, n_heads):
    # x: [seq_len, 768]
    B, D = x.shape
    qkv = x @ w_qkv + b_qkv          # [B, 3*768]  — project to Q, K, V
    q, k, v = np.split(qkv, 3, axis=-1)  # each [B, 768]

    # Reshape into 12 independent heads of dim 64
    head_dim = D // n_heads            # 768/12 = 64
    q = q.reshape(B, n_heads, head_dim).transpose(1,0,2)
    k = k.reshape(B, n_heads, head_dim).transpose(1,0,2)
    v = v.reshape(B, n_heads, head_dim).transpose(1,0,2)

    # Causal attention: score every pair, mask the future, softmax, blend
    scores = (q @ k.transpose(0,2,1)) / np.sqrt(head_dim)  # [12, B, B]
    mask = np.triu(np.full((B, B), -1e9), k=1)   # -inf above diagonal
    scores += mask
    attn = softmax(scores)             # attention weights
    out = (attn @ v)                   # [12, B, 64]  weighted blend of values

    # Merge heads back to [B, 768] and project
    out = out.transpose(1,0,2).reshape(B, D)
    return out @ w_proj + b_proj

def ffn(x, w_fc, b_fc, w_proj, b_proj):
    h = gelu(x @ w_fc + b_fc)         # [B, 768] → [B, 3072]  expand + activate
    return h @ w_proj + b_proj         # [B, 3072] → [B, 768]  compress back

def transformer_block(x, block):
    # Pre-norm → attention → residual
    h = layer_norm(x, block['ln1_g'], block['ln1_b'])
    x = x + attention(h, block['attn_qkv_w'], block['attn_proj_w'],
                          block['attn_qkv_b'], block['attn_proj_b'], 12)
    # Pre-norm → FFN → residual
    h = layer_norm(x, block['ln2_g'], block['ln2_b'])
    x = x + ffn(h, block['fc_w'], block['fc_b'],
                   block['proj_w'], block['proj_b'])
    return x

def gpt2_forward(token_ids, model):
    x = model['wte'][token_ids]  +  model['wpe'][np.arange(len(token_ids))]
    # wte: [50257, 768] token embeddings
    # wpe: [1024, 768]  position embeddings (max 1024 context)

    for block in model['blocks']:       # 12 transformer blocks
        x = transformer_block(x, block)

    x = layer_norm(x, model['ln_f_g'], model['ln_f_b'])
    logits = x @ model['wte'].T       # tie weights: reuse embedding table
    return logits                       # [seq_len, 50257]

This is GPT-2 — 124M params, 12 layers, 768 dims. Qwen-3.5 27B is the same skeleton at 176× the scale: 64 layers, 5120 dims, 152K vocab. The variations (RMSNorm, SwiGLU, RoPE, GQA) are in Chapter 3. But we skipped the first step. The pipeline starts with token IDs — where do those come from?

// 02 — Text to Numbers

▶ EMBED → NORM → ATTN → FFN → PROJECT → SAMPLE   zooming into EMBED

"The" isn't a word to the model. It's token 791 → a vector of 5,120 learned numbers.
How text becomes geometry. How geometry becomes meaning._

The model can't read text. It needs numbers. But not just any numbers — it needs meaningful numbers, vectors where "cat" and "dog" land near each other and "cat" and "uranium" don't. Getting there takes two transformations: tokenization splits your text into pieces and assigns each an integer ID; embedding converts each ID into a dense vector that encodes meaning. Both steps are surprisingly consequential.

Tokenization: What Makes a Token

A token is a piece of text — sometimes a word, sometimes part of a word, sometimes a single character. The model has a fixed vocabulary: a numbered list of every text fragment it can recognize. GPT-2 has 50,257 tokens. Qwen has 152,064. Every input, no matter how exotic, gets split into pieces from this list. The question is how you build the list.

Characters give a tiny vocab but painfully long sequences. Whole words give short sequences but can't handle typos or new words. Byte Pair Encoding (BPE) finds the sweet spot: common words are single tokens, rare words decompose into recognizable subwords, and nothing is ever out-of-vocabulary.

Vocab Comparison Cost Per Token

Comparison
Approach        Vocab       "unhappiness" →         "The capital of France is Paris"
─────────────────────────────────────────────────────────────────────────────────────
Characters      256         u,n,h,a,p,p,i,n,e,s,s   30 tokens   (long sequence)
BPE (GPT-2)     50,257      un|happ|iness            8 tokens    (sweet spot)
BPE (Qwen)      152,064     unhappiness              6 tokens    (bigger vocab → shorter)
Words           ~500K       unhappiness              6 tokens    (but "ChatGPT" = ???)

Why vocab size matters for speed


  Every token costs compute: ~250 million FLOPs in GPT-2, ~44 billion in a 27B model.
  Fewer tokens = proportionally less work.

  GPT-2 (50K vocab):   ~4.5 chars/token  → more tokens, more compute
  Qwen  (152K vocab):  ~5.2 chars/token  → 15% fewer tokens, 15% faster

  Vocabulary size × sequence length is the budget.
  BPE lets you tune the tradeoff.

How BPE Builds a Vocabulary

Start with individual characters. Scan the training corpus, find the most frequent adjacent pair, merge it into a new token. Repeat 50,000 times. What emerges: common words are single tokens, rare words decompose into recognizable subwords, and nothing is ever out-of-vocabulary — the base includes all 256 byte values.

The results are not what you'd expect. Click through — each example reveals something about how the model actually sees your text.

Embedding: From IDs to Vectors

Tokenization gave us integers. Now each integer needs to become something the network can process: a vector — a list of numbers. GPT-2 uses 768 numbers per token. Qwen uses 5,120. Each number is a dimension, and each dimension encodes some learned feature of the token — not a feature anyone named, but one that emerged from training. One dimension might end up correlating with "is this a verb?", another with "does this relate to geography?", most with patterns too abstract to label. The key insight: 768 dimensions means 768 independent axes. Two tokens can be similar on some axes and different on others — "Paris" and "London" are close on the geography axes but far apart on whatever encodes French vs. English. This is what makes the representation rich enough to work.

Why 768 Dimensions? Code

Python
# The entire embedding step:
x = embedding_table[token_ids]   # shape: [seq_len, 768]

# That's it. A table lookup. No math, no matrix multiply.
# embedding_table.shape = [50257, 768]   for GPT-2
#                       = [152064, 5120]  for Qwen-3.5 27B

# The same table reappears at the end of the model:
logits = final_vector @ embedding_table.T   # shape: [50257]
# Dot product of the output against EVERY row. No shortcut this time.

Each token is now a vector — 768 numbers for GPT-2, 5,120 for Qwen. The transformer blocks will process these vectors through the attention and FFN stages you saw in the pipeline. Chapter 3 traces one vector through one block, operation by operation.

// 03 — The Forward Pass

EMBED → ▶ NORM → ATTN → FFN → PROJECT → SAMPLE   inside the block

One vector enters a transformer block. Six matrix multiplications later, it exits — changed.
Same shape, different meaning. Then the next block does it again. Sixty-four times._

Chapter 1 showed the skeleton. Now we open the block and trace what happens to a single vector — at instruction level, the way the hardware sees it. Concrete dimensions, real FLOP counts, every weight matrix named. RMSNorm stabilizes. Attention looks backwards. The FFN decides what to keep. Each operation is a matmul with a specific shape, and we'll trace every one.

One Block, Traced

One vector enters the top of a transformer block. Six matrix multiplications and two skip connections later, one updated vector exits the bottom. That block repeats 12 times (GPT-2) or 64 (Qwen). Here's the complete map — every weight matrix, every shape, every operation. The sections below trace each stage with real numbers.

One Transformer Block — Data Flow (Qwen 27B dimensions; GPT-2: 768 dims, 12 heads × 64, FFN 768→3072→768)


  input x [5120]
      │
  ① RMSNorm  ··· scale by learned γ, no bias
      │
      ├→ W_q [5120×6144] → Q [24 heads × 256]  ②
      ├→ W_k [5120×1024] → K [4 heads × 256]   ② QKV Projection
      ├→ W_v [5120×1024] → V [4 heads × 256]   ②
      │
      │   ↓ apply RoPE to Q, K (position encoding) ③
      │
  ④ Attention   scores = softmax(Q · K^T / √d)
      │               output = scores · V
      │               ↓
      ├→ W_out [6144×5120] → projection back to model dim
      │
  ⊕ residual   add original x back  ← skip connection
      │
  RMSNorm  ··· normalize again
      │
      ├→ W_gate [5120×17408] → gate  ⑤
      │                              │
      │       SiLU(gate) ⊙ up ←────────⑤ SwiGLU FFN
      │                              │
      ├→ W_up   [5120×17408] → up   ⑤
      │
      ├→ W_down [17408×5120] → back to [5120]
      │
  ⊕ residual   add pre-FFN input back  ← skip connection
      │
  output → next block (or final norm + logits)

  GPT-2: 6 matmuls (W_q, W_k, W_v, W_out, W_up, W_down)
  Qwen:  7 matmuls (adds W_gate for SwiGLU)
  × 64 blocks = 448 learned projections per token
  (not counting score computation, softmax, norms, or cache ops)

Step 1: RMSNorm

Before any matrix multiply, the vector gets normalized. Why? Each layer multiplies the vector by large weight matrices. After a few layers, some numbers in the vector get enormous while others shrink toward zero — the scale drifts. If the next layer's weights were tuned for inputs around 1.0 and they receive 340.0, everything breaks. RMSNorm — Root Mean Square Normalization — fixes this by dividing every number in the vector by a single value: the root mean square of the whole vector. That's literally: square every number, take the average, take the square root. Divide through. Now the vector's overall scale is back to ~1.0, and the next layer sees inputs in the range it expects. Two of these per block, 12 blocks in GPT-2 — 24 resets per token (128 in Qwen's 64 blocks).

Intuition LayerNorm vs RMSNorm Math

RMSNorm — why normalize at all?


  The problem: each layer multiplies by weight matrices.
  After a few layers, numbers drift — some hit 340, others drop
  to 0.003. The next layer’s weights expect inputs near 1.0.

  The fix: divide every number by the RMS of the whole vector.
  RMS = √(mean of all values²). One number. Divide through. Done.

  ─────────────────────────────────────────────────────────────

  Before RMSNorm:

    dim 0:  ████████████████████  340.2    ← exploding
    dim 1:  ██                      0.003    ← vanishing
    dim 2:  ████████████            89.7

  After RMSNorm:

    dim 0:  ████                     1.73     ✓ tamed
    dim 1:  █                        0.00002  ✓ ratio preserved
    dim 2:  ██                       0.46     ✓ tamed

  ─────────────────────────────────────────────────────────────

  Divide every dimension by the same number (the RMS).
  Relative proportions stay intact. Absolute scale is reset.
  Then γ (learned per-dimension) re-scales what matters.

LayerNorm vs RMSNorm

  LayerNorm (GPT-2 era)
  ─────────────────────────────────────────────────────────────

  1. Mean-center:  μ = mean(x), subtract from every dim
                    the vector’s average becomes zero

  2. Scale:        σ = std(x), divide every dim by it
                    now the vector has unit variance

  3. Re-scale:     x × γ + β  (learned per-dimension)
                    the network decides what to amplify & shift


  RMSNorm (modern LLMs — LLaMA, Qwen, Mistral, …)
  ─────────────────────────────────────────────────────────────

  1. Square:       x²  for every dimension

  2. Mean:         average those squared values

  3. Root+divide:  √(mean(x²)) = the RMS, divide every dim by it

  4. Re-scale:     x × γ  (no β — no bias term)

  ─────────────────────────────────────────────────────────────

  Why drop the mean-centering?
  Re-centering contributes almost nothing to training stability —
  the scale reset is what matters. Removing it:

    • cuts norm parameters in half  (no β vectors)
    • saves one pass over the vector per norm call
    • 12 blocks × 2 norms = 24 fewer reductions per token (GPT-2)
    • simpler numerics → fewer edge cases in FP16/BF16

Math
# Input: x = [0.5, -1.2, 0.8, 0.3, ...] (768 values)

RMS(x) = sqrt(mean(x²))
       = sqrt( (0.25 + 1.44 + 0.64 + 0.09 + ...) / 768 )
       = sqrt(0.847)    # typical RMS for a normalized hidden state
       = 0.920

x_norm = x / RMS(x)
       = [0.543, -1.304, 0.870, 0.326, ...]

output = x_norm * gamma   # gamma is a learned per-dimension scale [768]
       = [0.543 * 1.02, -1.304 * 0.97, ...]
       = [0.554, -1.265, ...]

# Cost: 3 passes over the vector (square, sum, divide+multiply)
# FLOPs: ~3 × 768 = 2,304. Trivial compared to matmul.

Step 2: Attention — QKV Projection

The attention mechanism — introduced in Attention Is All You Need (Vaswani et al., 2017) — starts here. We multiply our normalized vector by three weight matrices to produce queries, keys, and values:

Intuition Standard MHA GQA (common optimization)

Q, K, V — what they actually mean


  Every token produces three vectors from the same input:

  Q = "What am I looking for?"        the question this token asks
  K = "What do I contain?"             the label this token advertises
  V = "What information do I carry?"   the payload delivered if selected

  ─────────────────────────────────────────────────────────────

  Example: token "France" at position 4

  Q_France ≈ "I need context about countries, geography, politics"
  K_France ≈ "I am a country name, European, noun"
  V_France ≈ [rich vector encoding France's learned features]

  ─────────────────────────────────────────────────────────────

  The attention mechanism in two steps:

  Attention weights = softmax(Q · K^T)
      ↑ match every query against every key
      ↑ high dot product = "this key answers my query"

  Output = Attention · V
      ↑ read the winning payloads, weighted by relevance

  Key insight: K decides WHO gets attention. V decides WHAT flows.
  They're decoupled — a token can be highly relevant (strong K match)
  but carry different information (V) than what K advertises.

Standard Multi-Head Attention — equal Q, K, V projections


  d_model = 5120, n_heads = 20, head_dim = 256

  Q = x_norm @ W_q     [5120] × [5120, 5120] → [5120]  (20 heads × 256 dim)
  K = x_norm @ W_k     [5120] × [5120, 5120] → [5120]  (20 heads × 256 dim)
  V = x_norm @ W_v     [5120] × [5120, 5120] → [5120]  (20 heads × 256 dim)

  Three identical matmuls. Symmetric.
  FLOPs: 5120 × 5120 × 2 × 3 = ~157M per block per token.

  ─────────────────────────────────────────────────────────────

  Dimension flow

  Input:   x_norm [5120]

    x_norm [5120] × W_q [5120 × 5120] → Q [5120]
    x_norm [5120] × W_k [5120 × 5120] → K [5120]
    x_norm [5120] × W_v [5120 × 5120] → V [5120]

  Reshape into heads:

    Q [5120] → [20 heads × 256 dim]
    K [5120] → [20 heads × 256 dim]
    V [5120] → [20 heads × 256 dim]

  Attention (per head):

    scores = Q_head [1 × 256] · K_cache^T [256 × seq_len] → [1 × seq_len]
    weights = softmax(scores / √256) → [1 × seq_len]
    out = weights [1 × seq_len] · V_cache [seq_len × 256] → [1 × 256]

  Concatenate all 20 heads, project back:

    concat [20 × 256] = [5120] × W_out [5120 × 5120] → [5120]

  Every head has its own Q, K, and V — 20 independent attention patterns.
  KV cache stores all 20 heads × 256 dims = 5120 values per token per layer.

Grouped-Query Attention — Qwen 27B (24 Q heads, 4 KV heads)


  Most modern models use GQA instead of standard MHA. The idea:
  keep many query heads, but share KV heads across groups.

  ─────────────────────────────────────────────────────────

  Qwen 27B: 24 Q heads, 4 KV heads, head_dim = 256

  Q = x @ W_q     [5120] × [5120, 6144] → [6144]  (24 × 256)
  K = x @ W_k     [5120] × [5120, 1024] → [1024]  ( 4 × 256)
  V = x @ W_v     [5120] × [5120, 1024] → [1024]  ( 4 × 256)

  Q is 6× larger than K or V — 6 query heads share each KV head.
  Total QKV FLOPs: ~84M (down from ~157M with standard MHA).

  ─────────────────────────────────────────────────────────

  Query heads (24)                          KV heads (4)

  Q0  Q1  Q2  Q3  Q4  Q5   →→→  KV0    group 0
  Q6  Q7  Q8  Q9  Q10 Q11  →→→  KV1    group 1
  Q12 Q13 Q14 Q15 Q16 Q17 →→→  KV2    group 2
  Q18 Q19 Q20 Q21 Q22 Q23 →→→  KV3    group 3

  Each group: 6 query heads share 1 KV head

  ─────────────────────────────────────────────────────────
                Multi-Head      Grouped-Query      Multi-Query
                Attention        Attention          Attention
    KV heads:     24               4                  1
    KV cache:   100%             16.7%              4.2%
    Quality:    baseline         ≈ same             slightly worse

  6 queries share 1 KV pair → 6× less KV cache memory
  At 4K context: 1 GB KV cache instead of 6 GB

Step 3: Attention — RoPE

How does the model know that "Paris" is the fifth word, not the first? Rotary Position Embedding encodes position by rotating query and key vectors in pairs of dimensions. Each dimension pair rotates at a different frequency, creating a unique angular signature for each position.

Intuition Math / Code Why Relative

RoPE — position as rotation


  Each dimension pair is a clock hand spinning at a different speed.
  Position = how far each hand has rotated from the start.

  ─────────────────────────────────────────────────────────────

  Dimension pair i=0 (slowest clock):

    pos 0:  ↑  0°       pos 1:  ↗  0.06°    pos 5:  ↗  0.3°
            tiny rotation per step → detects long-range gaps

  Dimension pair i=32 (medium clock):

    pos 0:  ↑  0°       pos 1:  →  5.6°     pos 5:  ↓  28°
            moderate rotation → mid-range position sensitivity

  Dimension pair i=63 (fastest clock):

    pos 0:  ↑  0°       pos 1:  ↙  180°    pos 5:  ↙  900°
            rapid rotation → distinguishes adjacent tokens

  ─────────────────────────────────────────────────────────────

  Each position gets a unique angular “fingerprint” across all 64 pairs.
  No two positions produce the same combination of angles.

Python
# RoPE: rotate pairs of dimensions by position-dependent angles
# For dimension pair (2i, 2i+1) at position pos:

freq = 1.0 / (base ** (2*i / d_model))  # base=10000000 for Qwen-3.5
theta = pos * freq

# Low frequency pairs (i=0): rotate slowly → captures long-range position
# High frequency pairs (i=63): rotate fast → captures fine position

q[2*i]   = q_orig[2*i]   * cos(theta) - q_orig[2*i+1] * sin(theta)
q[2*i+1] = q_orig[2*i]   * sin(theta) + q_orig[2*i+1] * cos(theta)

# The key insight: Q·K^T now encodes RELATIVE position.
# "How far apart are these tokens?" — not "what absolute position?"
# This is why context extension (YaRN, NTK) works: adjust frequencies.

Position invariance — only the gap matters


  Scenario A:  Q at position 5,  K at position 3   → gap = 2
  Scenario B:  Q at position 100, K at position 98  → gap = 2

  ─────────────────────────────────────────────────────────────

  How RoPE makes this work:

  Q·K^T after rotation = f(pos_Q − pos_K)

  The rotation of Q at pos 5 minus the rotation of K at pos 3
  = the same angle as Q at pos 100 minus K at pos 98.

  R(5) · R(3)^T  =  R(5−3)  =  R(2)
  R(100) · R(98)^T =  R(100−98) =  R(2)  ✓ identical

  ─────────────────────────────────────────────────────────────

  Rotation matrices cancel: R(θ) · R(φ)^T = R(θ−φ)
  This is why context extension (YaRN, NTK) works:
  rescale the frequencies → stretch the position space.

Step 4: Attention — Scores & Output

Now the model asks its central question: for each token, which other tokens matter most? Every token's query gets compared against every previous token's key — a dot product that measures relevance. High dot product means "you have what I'm looking for." Softmax turns these raw scores into weights that sum to 1. The output is a weighted blend of value vectors — each token absorbs information from the tokens that scored highest. Here's what that looks like for "The capital of France":

Visual Worked Example Causal Mask

Attention weights — where is the model looking?


  The model is about to predict the word after “The capital of France”.
  Token “France” asks: “who in this sentence helps me decide what comes next?”

  It compares its query against every previous token’s key.
  The result: a weight for each token — how much to listen to it.

  Token        Weight    How much “France” listens
  ─────        ──────    ──────────────────────────────────
  "The"        0.08      ███░░░░░░░░░░░░░░░░░░░░░░
  "capital"    0.62      ████████████████████░░░░░  ← highest
  "of"         0.06      ██░░░░░░░░░░░░░░░░░░░░░░░
  "France"     0.24      ████████░░░░░░░░░░░░░░░░░

                                      ↓
                          weighted blend of V vectors
                                      ↓
  Output = 0.08·V_The + 0.62·V_capital + 0.06·V_of + 0.24·V_France

  “France” focuses 62% on “capital” — not the nearest word,
  but the most relevant one. It’s answering “capital of what?”
  This is the core insight: attention is content-addressed, not positional.

Worked Example (4 tokens, 2 heads, head_dim=4)
# Simplified to illustrate the mechanics

Q (current token, head 0) = [0.8, -0.3, 0.5, 0.1]

K cache (all 4 tokens, head 0):
  K₀ = [ 0.2,  0.5, -0.1,  0.3]   "The"
  K₁ = [-0.4,  0.1,  0.7,  0.2]   "capital"
  K₂ = [ 0.6, -0.2,  0.3,  0.8]   "of"
  K₃ = [ 0.3,  0.9, -0.5,  0.4]   "France"

Scores = Q · K^T / sqrt(4):
  s₀ = (0.16 - 0.15 - 0.05 + 0.03) / 2.0 = -0.005
  s₁ = (-0.32 - 0.03 + 0.35 + 0.02) / 2.0 = +0.010
  s₂ = (0.48 + 0.06 + 0.15 + 0.08) / 2.0 = +0.385
  s₃ = (0.24 - 0.27 - 0.25 + 0.04) / 2.0 = -0.120

After softmax: [0.231, 0.234, 0.329, 0.206]
Token "of" gets highest attention weight → model is looking at "of France"

Output = 0.231·V₀ + 0.234·V₁ + 0.329·V₂ + 0.206·V₃
Weighted blend of all value vectors

Causal mask — tokens can only look backward


  Who can attend to whom?

              K₀     K₁     K₂     K₃
             The   capital   of   France
           ┌──────┬──────┬──────┬──────┐
  Q₀ The   │ ✓    │ ×    │ ×    │ ×    │
           ├──────┼──────┼──────┼──────┤
  Q₁ cap   │ ✓    │ ✓    │ ×    │ ×    │
           ├──────┼──────┼──────┼──────┤
  Q₂ of    │ ✓    │ ✓    │ ✓    │ ×    │
           ├──────┼──────┼──────┼──────┤
  Q₃ Fra   │ ✓    │ ✓    │ ✓    │ ✓    │
           └──────┴──────┴──────┴──────┘

  Upper triangle = −∞ before softmax → attention weight = 0
  You can’t read the future.

  ─────────────────────────────────────────────────────────────

  During generation (decode), this simplifies:

  New token’s Q attends to all previous K’s in the cache.
  No mask needed — there’s nothing ahead to block.

  This is why the D3 attention heatmap below shows a triangle:
  each row has one more filled cell than the row above it.

Soft Lookup Table

Attention is a soft lookup table. Q is the search query, K is the index, V is the data. Unlike a hash table, every entry matches to some degree — softmax just decides how much.

In decode mode, this step is usually memory-bound, not math-bound — the GPU waits for data, not arithmetic. Part II explains why.

Step 5: SwiGLU FFN

After attention blends information across tokens, the FFN processes each token independently — expanding the vector to 4× width, selectively routing features through a learned gate, then compressing back. It's the largest single compute cost per block, and modern models use a gated variant called SwiGLU that's fundamentally different from GPT-2's simple expand-activate-shrink.

Intuition Math Code SiLU Deep Dive

SwiGLU — the problem it solves


  GPT-2's FFN was simple: expand to 4× width, apply GELU, shrink back.
  Two matrices, one non-linearity. The activation decided which dimensions
  to keep — but using only the expanded representation itself.
  No separate "should I use this?" signal.

  SwiGLU splits the decision from the content:

  Same input x feeds both paths:

  GATE path   x → W_gate → SiLU → "should this feature activate?"
                                     high → open the gate
                                     low  → shut it down

  UP path     x → W_up →          "what value should it carry?"
                                     the candidate content

  COMBINE     SiLU(gate) ⊙ up          element-wise multiply
                                     gate decides, up provides

  DOWN        result → W_down →       compress back to [5120]

  ─────────────────────────────────────────────────────────────

  Think of it like a mixing board. Each of the 17,408 channels has a
  fader (the gate) and an audio signal (the up value). The model learned
  which faders to push up for each kind of input.

  A token about geography opens channels for location features and
  closes channels for emotion features. A token about sentiment does
  the opposite. Same weights, different routing per token.

  The cost: three weight matrices instead of two. That's why the FFN
  dominates compute — 535M FLOPs per block, ~78% of the budget.
  But the quality gain is consistent, which is why every post-2022
  model uses it.

SwiGLU — formal definition


  The name: Swi from Swish (= SiLU) + GLU from Gated Linear Unit

  ─────────────────────────────────────────────────────────────

  GLU family (Dauphin et al., 2017):

    GLU(x, W, V)  = σ(xW) ⊙ xV         original: sigmoid gate

  SwiGLU variant (Shazeer, 2020) replaces σ with Swish/SiLU:

    SwiGLU(x, W_gate, W_up) = SiLU(x · W_gate) ⊙ (x · W_up)

  Where SiLU (Sigmoid Linear Unit) = Swish with β=1:

    SiLU(z) = z · σ(z) = z · 1/(1 + e^−z)

  ─────────────────────────────────────────────────────────────

  Full FFN with SwiGLU (as used in Qwen, Llama, Mistral, etc.):

    FFN(x) = (SiLU(x · W_gate) ⊙ x · W_up) · W_down

  Dimensions (Qwen 27B):  d_model = 5120,  d_ff = 17408

    W_gate ∈ ℝ^{5120 × 17408}    W_up ∈ ℝ^{5120 × 17408}    W_down ∈ ℝ^{17408 × 5120}

  ─────────────────────────────────────────────────────────────

  Why 3 matrices instead of 2?

  Classic FFN (GPT-2):      FFN(x) = GELU(x · W₁) · W₂         2 matmuls
  SwiGLU FFN:               FFN(x) = (SiLU(xW_g) ⊙ xW_u) · W_d   3 matmuls

  Extra matmul = ~50% more FFN FLOPs, but the gating mechanism
  gives the model fine-grained control over which features survive.
  Consistent quality wins across all benchmarks — worth the cost.

SwiGLU (4 lines)
gate = x @ W_gate    # [5120] → [17408]  "should this feature activate?"
up   = x @ W_up      # [5120] → [17408]  "what value should it carry?"
h    = SiLU(gate) ⊙ up   # element-wise: gate decides, up provides
out  = h @ W_down     # [17408] → [5120]  compress back

# vs GPT-2's FFN (2 matrices, no gate):
# h = GELU(x @ W_fc)     # expand + activate
# out = h @ W_proj        # compress

# FLOPs: 3 × 5120 × 17408 × 2 = 535M per block per token
# Attention FLOPs: ~147M. FFN is ~78% of per-block compute.

SiLU (Sigmoid Linear Unit) — the activation inside SwiGLU


  SiLU(x) = x × σ(x) = x × 1/(1 + e^−x)

  How SiLU behaves at different gate values:

  gate = -3.0  →  sigmoid = 0.05  →  SiLU = -0.14    nearly killed
  gate = -1.0  →  sigmoid = 0.27  →  SiLU = -0.27    suppressed
  gate =  0.0  →  sigmoid = 0.50  →  SiLU =  0.00    on the fence
  gate =  1.0  →  sigmoid = 0.73  →  SiLU =  0.73    mostly open
  gate =  3.0  →  sigmoid = 0.95  →  SiLU =  2.86    wide open

  vs ReLU: hard cutoff at 0. Everything negative → dead zero.
  vs SiLU: smooth transition. Gradients flow even for negative inputs.
  This smoothness gives the model fine-grained control over
  "how open" each gate is — not just on/off.

  ─────────────────────────────────────────────────────────────

  SiLU can output small negative values (minimum ≈ -0.28 at x ≈ -1.28).
  The gate can slightly invert a feature, not just suppress it.
  ReLU can't do this. GELU is similar but lacks the clean
  multiplicative gating structure that makes SwiGLU work.

Interactive Numbers

SwiGLU — the gate in action (6 dimensions shown)


  dim    gate    SiLU(gate)          up      gate × up          status
  ───    ────    ──────────          ──      ─────────          ──────
   0      2.1    ████ 1.88         0.7    ███  1.32          ✓ PASS
   1     -0.5    ░    -0.19         1.3    ░  -0.25          × kill
   2      3.4    █████ 3.30        -0.9    ████ -2.97          ✓ PASS
   3     -2.8    ░    -0.16         2.1    ░  -0.34          × kill
   4      0.1    ░     0.05         0.4    ░   0.02          × kill
   5      1.7    ███  1.44        -0.6    ██  -0.86          ✓ PASS

  ─────────────────────────────────────────────────────────────────

  Each row is one dimension’s full pipeline: gate → SiLU → × up → result.
  Dims 0, 2, 5 flow through. Dims 1, 3, 4 are suppressed.

Why Two Projections?

The gate learns which features matter. The up-projection computes candidate values. Multiplying them = selective routing. It's like an if-statement learned from data: if gate[i] > 0: output[i] = up[i], but smooth and differentiable. One projection decides, the other provides — that's why SwiGLU needs three matrices instead of GPT-2's two, and why it works better: the model can learn to shut off irrelevant features instead of blending everything.

Step 6: Exiting the Block

Two more operations close out the block. First, the attention output passed through an output projection — a matrix multiply (W_out: 768×768 in GPT-2) that merges all the attention heads back into a single vector. That's the 6th matmul. The FFN added the 7th. Second, a residual connection adds the block's input back to the FFN's output — the same skip-connection pattern from after attention. The block only had to learn what to change, not reconstruct the whole vector from scratch.

One block, summarised
  input x [768]
      ↓
  LayerNorm → Q,K,V projections (matmuls 1-3) → attention scores → W_out (matmul 4)
      ↓
  + x  (residual)
      ↓
  LayerNorm → FFN up (matmul 5) → GELU → FFN down (matmul 6) → GPT-2: 2 FFN matmuls
      ↓                                                          Qwen:  3 (gate + up + down)
  + x  (residual)
      ↓
  output [768]  ← same shape. different meaning. ready for the next block.

That's one block: six matmuls for GPT-2 (seven for Qwen's SwiGLU), two norms, two residual adds. The vector that enters is [768]. The vector that exits is [768]. Same shape, transformed meaning. Now do it 11 more times.

The Full Stack: 12 Blocks Deep

We traced one block. GPT-2 runs 12 of them, in sequence, on every token. The same six matrix multiplications, repeated with different learned weights. Here's what that stack looks like:

One Token Through 12 Blocks — GPT-2 (124M)


  Block    Operations                                    FLOPs         Cumulative
  ─────    ──────────                                    ─────         ──────────
    1      LayerNorm → QKV → Attn → Proj → FFN          ~20M          20M
    2      LayerNorm → QKV → Attn → Proj → FFN          ~20M          40M
    3      LayerNorm → QKV → Attn → Proj → FFN          ~20M          60M
    ⋮      same structure, different learned weights          ⋮             ⋮
   12      LayerNorm → QKV → Attn → Proj → FFN          ~20M          240M

  Then:
   +1      Final LayerNorm → logit projection (W_embed^T)   ~10M          ~250M

  ─────────────────────────────────────────────────────────────────────

  Per block:   6 matmuls × 1 block   = 6 matmuls           ~20M FLOPs
  Per token:   6 matmuls × 12 blocks  = 72 matmuls + 1 proj  ~250M FLOPs
  100 tokens: 73 × 100              = 7,300 matmuls        ~25B FLOPs

  Every token you see streaming? All 12 blocks ran. All 73 matmuls fired.
  The vector that entered block 1 as a raw embedding exits block 12
  as a rich, context-aware representation. Same shape in, same shape out.

  Scale comparison: Qwen-3.5 27B runs 64 blocks with 7 matmuls each
  = 449 matmuls per token, ~44B FLOPs. Same recipe, 176× the cost.

The Final Step: PROJECT

After all 12 blocks (GPT-2) or 64 (Qwen), one last norm stabilises the vector. Then comes the PROJECT step from the pipeline — the moment the model turns its internal representation back into words. It multiplies the vector by the embedding table transposed. Remember that table from Chapter 2? The same 50,257×768 matrix that converted token IDs into vectors at the start now works in reverse: a dot product of the output vector against every row produces one score per word in the vocabulary. High score means "my output vector is close to that word's embedding." These 50,257 scores are called logits — raw, unbounded numbers with no guarantee of summing to anything. They're not probabilities yet. They're a ranked ballot, and the model needs a way to pick a winner. That's the final piece.

// 04 — Choosing Words

EMBED → NORM → ATTN → FFN → PROJECT → ▶ SAMPLE   the final step

The forward pass outputs 50,257 raw scores. Temperature sharpens or flattens them.
Top-p prunes the tail. One random draw. This is where determinism ends and generation begins._

Those raw scores are called logits. They're not probabilities yet — they're unbounded numbers, some positive, some negative, with no guarantee of summing to anything. Before we look at how sampling transforms them, let's see what they actually look like.

The sampling pipeline transforms these scores: penalize repetition, prune the tail, control the temperature, normalize to probabilities, and finally roll the dice. Each filter shapes the output in a different way, and their order matters — engines disagree on the right sequence. Here's one common order, used by llama.cpp and several other engines.

LOGITS → REP PENALTY → TOP-K → TOP-P → MIN-P → TEMPERATURE → SOFTMAX → SAMPLE   ← common order (llama.cpp)

Pruning — remove candidates


  1. Repetition Penalty                              penalize the past
     Recently generated tokens get their logits divided by 1.1.
     Already said “Paris”? Its score drops. Prevents loops.

  2. Top-K          K=50                             hard ceiling
     Keep only the top K scores. Everything else → −∞.
     Problem: K=50 is too many when confident, too few when uncertain.

Pruning removes the obvious junk. But a fixed ceiling can't adapt to how confident the model is. The next three filters shape the distribution dynamically:

Shaping — adjust the distribution


  3. Top-P (nucleus)  p=0.95                       adaptive cutoff
     Walk down the sorted list, accumulating probability. Stop at p.
     Confident: [0.91, 0.04, …] → keeps 1–2 tokens
     Uncertain: [0.15, 0.12, …] → keeps 8–10 tokens

  4. Min-P          min_p=0.05                       floor filter
     Cut anything below min_p × top_probability.
     Newer than top-p. Simpler. Many practitioners prefer it alone.

  5. Temperature    T=0.7                            sharpen/flatten
     Divide all logits by T before softmax.
     T=0.3: [0.99, 0.01, …] locks in.   T=1.5: [0.37, 0.22, …] anything goes.

What remains is a shaped set of scores. Two final steps turn them into a word:

Output — scores become a word


  6. Softmax         e^logit / Σe^logits               logits → probabilities
     Raw scores become a distribution that sums to 1.

  7. Random Sample                                    roll the dice
     Draw one token. Done. Determinism ends, generation begins.
     Same prompt, same model, different roll → different story.

Sampling Explorer: Watch the Words Change

Drag the sliders. Watch the sentence change. At low temperature, the model always picks the highest-probability word. Crank it up, and rarer words start winning. Top-p prunes the long tail — lower it to force conservative output even at high temperature. Hit Resample to re-roll — at T=0.3 you get the same sentence every time; at T=1.5, every roll is different.

    █ >50%
    █ 20–50%
    █ 5–20%
    █ <5%
    pruned

Temperature 0.7

Top-P 0.95

Min-P 0.05

Practical Recipes

Different tasks want different sampling. Not gospel — but the defaults most engines converge on:

Starting points
Code generation    T=0                                           greedy. determinism matters.
Chat               T=0.7  min_p=0.05  rep=1.1            warm but not wild.
Creative writing   T=1.0  min_p=0.02  top_p=0.95          let rare words win sometimes.
Factual Q&A        T=0                                           always pick the favourite.

Part I

The Complete Machine

Text in. Matrix multiplications. One word out. Repeat.

We started with a question: what happens when you type something and words appear?

Now we know. Raw text becomes tokens. Tokens become vectors — points in a 768-dimensional space where meaning has geometry. Those vectors pass through 12 identical blocks, each one normalising, attending across context, and rewriting through a gated feed-forward network. Twelve times the same structure, twelve different sets of learned weights, each refining the representation a little further. What emerges is a single vector that has absorbed the entire context of everything before it.

That vector projects back to vocabulary space — a dot product against every word the model knows. Temperature, top-p, and min-p shape the resulting distribution. One random draw. One word. Append it to the sequence, and run the entire pipeline again.

That's the whole machine. No magic, no understanding, no intent — just matrix multiplications, applied in the right order, with the right learned numbers. The rest is engineering.

And that’s where Part II begins._

Up next

PART II

Making It Fast