Inference Engines
Make the numbers answer. This is the story of how._
// 01 — The Pipeline
Same skeleton, different scale. This chapter is the map you'll reference for everything after._
When you talk to a language model, words appear as if it's composing a thought. It isn't. Each word is a separate run through the same fixed pipeline — hundreds of matrix multiplications that read everything so far, produce a probability for every word it knows, and pick one. Append. Repeat. The model that wrote word five is the same machine that wrote word five thousand. Same weights, same path, same math.
The whole loop
def generate(prompt_tokens, model, max_tokens=50):
tokens = list(prompt_tokens)
for _ in range(max_tokens):
logits = forward_pass(tokens, model) # ALL tokens, ALL layers
probs = softmax(logits[-1] / temperature)
tokens.append(sample(probs))
if tokens[-1] == eos_token: break
return tokens
# forward_pass → pick a word → append → repeat. That's it.
“The capital of France” — ML’s Hello World
This sentence appears in almost every transformer tutorial and inference demo. It’s a single-factual-hop question — one lookup, one answer. Even the smallest model nails it, making it a perfect smoke test: if your model can’t finish “The capital of France is ___”, something is broken at the infrastructure level.
It also shows what certainty looks like inside the model. After “France”, the probability distribution is extraordinarily peaked — ~73% on “ is”, then ~92% on “ Paris”. Most prompts produce far flatter distributions. This one is a spotlight with almost zero ambiguity.
Seen in: Vaswani et al., HuggingFace quickstarts, llama.cpp smoke tests, vLLM benchmarks, and ~10,000 attention blog posts.
Now that you've seen the loop — text in, one word out, repeat — let's open the machine that runs each pass. The diagram below is the complete map: every stage your data touches between entering as text and leaving as a probability. Don't memorise it now. The rest of this article unpacks each box one at a time.
The Complete Forward Pass
"The capital of France"
↓
★ EMBED Look up each token in a table of 50,257 vectors.
│ "The" → [0.12, -0.34, 0.56, …] (768 numbers)
│ Add a position signal so the model knows word order.
↓
┌───────────────────────────────────────────────────────┐
│ ★ NORM Stabilise the numbers (rescale to unit RMS) │
│ ↓ │
│ ★ ATTENTION Each token asks: "who in this sentence matters │
│ to me?" Produces a relevance-weighted blend │
│ of all previous tokens' information. │
│ ↓ + skip connection (add original input back) │
│ │
│ ★ NORM Stabilise again │
│ ↓ │
│ ★ FFN Expand 768 → 3072, apply non-linearity, shrink │
│ back to 768. Attention gathers; the FFN │
│ rewrites — transforming mixed context into │
│ new features. │
│ ↓ + skip connection │
└───────────────────────────────────── ×12 identical ──┘
↓
★ NORM Final stabilisation
↓
★ PROJECT Multiply by the embedding table transposed to get
│ a score for every word in the vocabulary.
↓
[0.01, 0.02, …, 0.73, …] ← 50,257 scores (logits)
Highest score: token 318 = " is"
| Stage | What it does |
|---|---|
| ★ EMBED | Token ID → vector lookup. Each integer maps to a row in a table of 50,257×768 (GPT-2) or 152,064×5,120 (Qwen). This same table reappears at the end — the model asks which embedding is closest to its output. |
| ★ NORM | Stabilise numbers before each operation. RMSNorm (modern) or LayerNorm (GPT-2). Three passes over the vector — trivial cost. |
| ★ ATTENTION | "Who matters to me?" Each token broadcasts a query, every previous token advertises a key. High dot-product = relevance. The output is a weighted blend of value vectors — the token absorbs information from its context. |
| ★ FFN | Expand → activate → compress. The vector inflates to 4× width (17,408 dims in Qwen), a gating function decides which features survive, then it shrinks back. The heaviest computation in the block — 78% of per-block FLOPs. |
| ★ PROJECT | Multiply by the embedding table transposed → one score per vocabulary token. The model is asking: "which word's embedding vector is closest to my output?" |
| ★ SAMPLE | 50,257 scores (or 152,064 for Qwen). Apply temperature, top-p, top-k. Draw one token. That's the output. |
Six weight-matrix multiplications per block (GPT-2) or seven (Qwen's SwiGLU adds a gate matrix). Twelve blocks (GPT-2) or 64 (Qwen). That's 73 matmuls for GPT-2, 449 for Qwen — to produce one token. For GPT-2: ~250M FLOPs. For Qwen-3.5 27B: ~44 billion. Same recipe, 176× the bill.
That diagram is the whole model. Here it is as working code — 60 lines of Python that implement GPT-2's forward pass. Every stage from the diagram maps to a function below.
Full Code (60 lines)
Python
import numpy as np
def gelu(x):
return 0.5 * x * (1 + np.tanh(np.sqrt(2/np.pi) * (x + 0.044715 * x**3)))
def softmax(x, axis=-1):
e = np.exp(x - x.max(axis=axis, keepdims=True))
return e / e.sum(axis=axis, keepdims=True)
def layer_norm(x, gamma, beta, eps=1e-5):
mean = x.mean(axis=-1, keepdims=True)
var = x.var(axis=-1, keepdims=True)
return gamma * (x - mean) / np.sqrt(var + eps) + beta
def attention(x, w_qkv, w_proj, b_qkv, b_proj, n_heads):
# x: [seq_len, 768]
B, D = x.shape
qkv = x @ w_qkv + b_qkv # [B, 3*768] — project to Q, K, V
q, k, v = np.split(qkv, 3, axis=-1) # each [B, 768]
# Reshape into 12 independent heads of dim 64
head_dim = D // n_heads # 768/12 = 64
q = q.reshape(B, n_heads, head_dim).transpose(1,0,2)
k = k.reshape(B, n_heads, head_dim).transpose(1,0,2)
v = v.reshape(B, n_heads, head_dim).transpose(1,0,2)
# Causal attention: score every pair, mask the future, softmax, blend
scores = (q @ k.transpose(0,2,1)) / np.sqrt(head_dim) # [12, B, B]
mask = np.triu(np.full((B, B), -1e9), k=1) # -inf above diagonal
scores += mask
attn = softmax(scores) # attention weights
out = (attn @ v) # [12, B, 64] weighted blend of values
# Merge heads back to [B, 768] and project
out = out.transpose(1,0,2).reshape(B, D)
return out @ w_proj + b_proj
def ffn(x, w_fc, b_fc, w_proj, b_proj):
h = gelu(x @ w_fc + b_fc) # [B, 768] → [B, 3072] expand + activate
return h @ w_proj + b_proj # [B, 3072] → [B, 768] compress back
def transformer_block(x, block):
# Pre-norm → attention → residual
h = layer_norm(x, block['ln1_g'], block['ln1_b'])
x = x + attention(h, block['attn_qkv_w'], block['attn_proj_w'],
block['attn_qkv_b'], block['attn_proj_b'], 12)
# Pre-norm → FFN → residual
h = layer_norm(x, block['ln2_g'], block['ln2_b'])
x = x + ffn(h, block['fc_w'], block['fc_b'],
block['proj_w'], block['proj_b'])
return x
def gpt2_forward(token_ids, model):
x = model['wte'][token_ids] + model['wpe'][np.arange(len(token_ids))]
# wte: [50257, 768] token embeddings
# wpe: [1024, 768] position embeddings (max 1024 context)
for block in model['blocks']: # 12 transformer blocks
x = transformer_block(x, block)
x = layer_norm(x, model['ln_f_g'], model['ln_f_b'])
logits = x @ model['wte'].T # tie weights: reuse embedding table
return logits # [seq_len, 50257]
This is GPT-2 — 124M params, 12 layers, 768 dims. Qwen-3.5 27B is the same skeleton at 176× the scale: 64 layers, 5120 dims, 152K vocab. The variations (RMSNorm, SwiGLU, RoPE, GQA) are in Chapter 3. But we skipped the first step. The pipeline starts with token IDs — where do those come from?
// 02 — Text to Numbers
How text becomes geometry. How geometry becomes meaning._
The model can't read text. It needs numbers. But not just any numbers — it needs meaningful numbers, vectors where "cat" and "dog" land near each other and "cat" and "uranium" don't. Getting there takes two transformations: tokenization splits your text into pieces and assigns each an integer ID; embedding converts each ID into a dense vector that encodes meaning. Both steps are surprisingly consequential.
Tokenization: What Makes a Token
A token is a piece of text — sometimes a word, sometimes part of a word, sometimes a single character. The model has a fixed vocabulary: a numbered list of every text fragment it can recognize. GPT-2 has 50,257 tokens. Qwen has 152,064. Every input, no matter how exotic, gets split into pieces from this list. The question is how you build the list.
Characters give a tiny vocab but painfully long sequences. Whole words give short sequences but can't handle typos or new words. Byte Pair Encoding (BPE) finds the sweet spot: common words are single tokens, rare words decompose into recognizable subwords, and nothing is ever out-of-vocabulary.
Comparison
Approach Vocab "unhappiness" → "The capital of France is Paris"
─────────────────────────────────────────────────────────────────────────────────────
Characters 256 u,n,h,a,p,p,i,n,e,s,s 30 tokens (long sequence)
BPE (GPT-2) 50,257 un|happ|iness 8 tokens (sweet spot)
BPE (Qwen) 152,064 unhappiness 6 tokens (bigger vocab → shorter)
Words ~500K unhappiness 6 tokens (but "ChatGPT" = ???)
Every token costs compute: ~250 million FLOPs in GPT-2, ~44 billion in a 27B model.
Fewer tokens = proportionally less work.
GPT-2 (50K vocab): ~4.5 chars/token → more tokens, more compute
Qwen (152K vocab): ~5.2 chars/token → 15% fewer tokens, 15% faster
Vocabulary size × sequence length is the budget.
BPE lets you tune the tradeoff.
How BPE Builds a Vocabulary
Start with individual characters. Scan the training corpus, find the most frequent adjacent pair, merge it into a new token. Repeat 50,000 times. What emerges: common words are single tokens, rare words decompose into recognizable subwords, and nothing is ever out-of-vocabulary — the base includes all 256 byte values.
The results are not what you'd expect. Click through — each example reveals something about how the model actually sees your text.
Embedding: From IDs to Vectors
Tokenization gave us integers. Now each integer needs to become something the network can process: a vector — a list of numbers. GPT-2 uses 768 numbers per token. Qwen uses 5,120. Each number is a dimension, and each dimension encodes some learned feature of the token — not a feature anyone named, but one that emerged from training. One dimension might end up correlating with "is this a verb?", another with "does this relate to geography?", most with patterns too abstract to label. The key insight: 768 dimensions means 768 independent axes. Two tokens can be similar on some axes and different on others — "Paris" and "London" are close on the geography axes but far apart on whatever encodes French vs. English. This is what makes the representation rich enough to work.
Python
# The entire embedding step:
x = embedding_table[token_ids] # shape: [seq_len, 768]
# That's it. A table lookup. No math, no matrix multiply.
# embedding_table.shape = [50257, 768] for GPT-2
# = [152064, 5120] for Qwen-3.5 27B
# The same table reappears at the end of the model:
logits = final_vector @ embedding_table.T # shape: [50257]
# Dot product of the output against EVERY row. No shortcut this time.
Each token is now a vector — 768 numbers for GPT-2, 5,120 for Qwen. The transformer blocks will process these vectors through the attention and FFN stages you saw in the pipeline. Chapter 3 traces one vector through one block, operation by operation.
// 03 — The Forward Pass
Same shape, different meaning. Then the next block does it again. Sixty-four times._
Chapter 1 showed the skeleton. Now we open the block and trace what happens to a single vector — at instruction level, the way the hardware sees it. Concrete dimensions, real FLOP counts, every weight matrix named. RMSNorm stabilizes. Attention looks backwards. The FFN decides what to keep. Each operation is a matmul with a specific shape, and we'll trace every one.
One Block, Traced
One vector enters the top of a transformer block. Six matrix multiplications and two skip connections later, one updated vector exits the bottom. That block repeats 12 times (GPT-2) or 64 (Qwen). Here's the complete map — every weight matrix, every shape, every operation. The sections below trace each stage with real numbers.
input x [5120]
│
① RMSNorm ··· scale by learned γ, no bias
│
├→ W_q [5120×6144] → Q [24 heads × 256] ②
├→ W_k [5120×1024] → K [4 heads × 256] ② QKV Projection
├→ W_v [5120×1024] → V [4 heads × 256] ②
│
│ ↓ apply RoPE to Q, K (position encoding) ③
│
④ Attention scores = softmax(Q · KT / √d)
│ output = scores · V
│ ↓
├→ W_out [6144×5120] → projection back to model dim
│
⊕ residual add original x back ← skip connection
│
RMSNorm ··· normalize again
│
├→ W_gate [5120×17408] → gate ⑤
│ │
│ SiLU(gate) ⊙ up ←────────⑤ SwiGLU FFN
│ │
├→ W_up [5120×17408] → up ⑤
│
├→ W_down [17408×5120] → back to [5120]
│
⊕ residual add pre-FFN input back ← skip connection
│
output → next block (or final norm + logits)
GPT-2: 6 matmuls (W_q, W_k, W_v, W_out, W_up, W_down)
Qwen: 7 matmuls (adds W_gate for SwiGLU)
× 64 blocks = 448 learned projections per token
(not counting score computation, softmax, norms, or cache ops)
Step 1: RMSNorm
Before any matrix multiply, the vector gets normalized. Why? Each layer multiplies the vector by large weight matrices. After a few layers, some numbers in the vector get enormous while others shrink toward zero — the scale drifts. If the next layer's weights were tuned for inputs around 1.0 and they receive 340.0, everything breaks. RMSNorm — Root Mean Square Normalization — fixes this by dividing every number in the vector by a single value: the root mean square of the whole vector. That's literally: square every number, take the average, take the square root. Divide through. Now the vector's overall scale is back to ~1.0, and the next layer sees inputs in the range it expects. Two of these per block, 12 blocks in GPT-2 — 24 resets per token (128 in Qwen's 64 blocks).
The problem: each layer multiplies by weight matrices.
After a few layers, numbers drift — some hit 340, others drop
to 0.003. The next layer’s weights expect inputs near 1.0.
The fix: divide every number by the RMS of the whole vector.
RMS = √(mean of all values²). One number. Divide through. Done.
─────────────────────────────────────────────────────────────
Before RMSNorm:
dim 0: ████████████████████ 340.2 ← exploding
dim 1: ██ 0.003 ← vanishing
dim 2: ████████████ 89.7
After RMSNorm:
dim 0: ████ 1.73 ✓ tamed
dim 1: █ 0.00002 ✓ ratio preserved
dim 2: ██ 0.46 ✓ tamed
─────────────────────────────────────────────────────────────
Divide every dimension by the same number (the RMS).
Relative proportions stay intact. Absolute scale is reset.
Then γ (learned per-dimension) re-scales what matters.
LayerNorm vs RMSNorm
LayerNorm (GPT-2 era)
─────────────────────────────────────────────────────────────
1. Mean-center: μ = mean(x), subtract from every dim
the vector’s average becomes zero
2. Scale: σ = std(x), divide every dim by it
now the vector has unit variance
3. Re-scale: x × γ + β (learned per-dimension)
the network decides what to amplify & shift
RMSNorm (modern LLMs — LLaMA, Qwen, Mistral, …)
─────────────────────────────────────────────────────────────
1. Square: x² for every dimension
2. Mean: average those squared values
3. Root+divide: √(mean(x²)) = the RMS, divide every dim by it
4. Re-scale: x × γ (no β — no bias term)
─────────────────────────────────────────────────────────────
Why drop the mean-centering?
Re-centering contributes almost nothing to training stability —
the scale reset is what matters. Removing it:
• cuts norm parameters in half (no β vectors)
• saves one pass over the vector per norm call
• 12 blocks × 2 norms = 24 fewer reductions per token (GPT-2)
• simpler numerics → fewer edge cases in FP16/BF16
Math
# Input: x = [0.5, -1.2, 0.8, 0.3, ...] (768 values)
RMS(x) = sqrt(mean(x²))
= sqrt( (0.25 + 1.44 + 0.64 + 0.09 + ...) / 768 )
= sqrt(0.847) # typical RMS for a normalized hidden state
= 0.920
x_norm = x / RMS(x)
= [0.543, -1.304, 0.870, 0.326, ...]
output = x_norm * gamma # gamma is a learned per-dimension scale [768]
= [0.543 * 1.02, -1.304 * 0.97, ...]
= [0.554, -1.265, ...]
# Cost: 3 passes over the vector (square, sum, divide+multiply)
# FLOPs: ~3 × 768 = 2,304. Trivial compared to matmul.
Step 2: Attention — QKV Projection
The attention mechanism — introduced in Attention Is All You Need (Vaswani et al., 2017) — starts here. We multiply our normalized vector by three weight matrices to produce queries, keys, and values:
Every token produces three vectors from the same input:
Q = "What am I looking for?" the question this token asks
K = "What do I contain?" the label this token advertises
V = "What information do I carry?" the payload delivered if selected
─────────────────────────────────────────────────────────────
Example: token "France" at position 4
QFrance ≈ "I need context about countries, geography, politics"
KFrance ≈ "I am a country name, European, noun"
VFrance ≈ [rich vector encoding France's learned features]
─────────────────────────────────────────────────────────────
The attention mechanism in two steps:
Attention weights = softmax(Q · KT)
↑ match every query against every key
↑ high dot product = "this key answers my query"
Output = Attention · V
↑ read the winning payloads, weighted by relevance
Key insight: K decides WHO gets attention. V decides WHAT flows.
They're decoupled — a token can be highly relevant (strong K match)
but carry different information (V) than what K advertises.
d_model = 5120, n_heads = 20, head_dim = 256
Q = x_norm @ W_q [5120] × [5120, 5120] → [5120] (20 heads × 256 dim)
K = x_norm @ W_k [5120] × [5120, 5120] → [5120] (20 heads × 256 dim)
V = x_norm @ W_v [5120] × [5120, 5120] → [5120] (20 heads × 256 dim)
Three identical matmuls. Symmetric.
FLOPs: 5120 × 5120 × 2 × 3 = ~157M per block per token.
─────────────────────────────────────────────────────────────
Dimension flow
Input: x_norm [5120]
x_norm [5120] × W_q [5120 × 5120] → Q [5120]
x_norm [5120] × W_k [5120 × 5120] → K [5120]
x_norm [5120] × W_v [5120 × 5120] → V [5120]
Reshape into heads:
Q [5120] → [20 heads × 256 dim]
K [5120] → [20 heads × 256 dim]
V [5120] → [20 heads × 256 dim]
Attention (per head):
scores = Qhead [1 × 256] · KcacheT [256 × seq_len] → [1 × seq_len]
weights = softmax(scores / √256) → [1 × seq_len]
out = weights [1 × seq_len] · Vcache [seq_len × 256] → [1 × 256]
Concatenate all 20 heads, project back:
concat [20 × 256] = [5120] × W_out [5120 × 5120] → [5120]
Every head has its own Q, K, and V — 20 independent attention patterns.
KV cache stores all 20 heads × 256 dims = 5120 values per token per layer.
Most modern models use GQA instead of standard MHA. The idea:
keep many query heads, but share KV heads across groups.
─────────────────────────────────────────────────────────
Qwen 27B: 24 Q heads, 4 KV heads, head_dim = 256
Q = x @ W_q [5120] × [5120, 6144] → [6144] (24 × 256)
K = x @ W_k [5120] × [5120, 1024] → [1024] ( 4 × 256)
V = x @ W_v [5120] × [5120, 1024] → [1024] ( 4 × 256)
Q is 6× larger than K or V — 6 query heads share each KV head.
Total QKV FLOPs: ~84M (down from ~157M with standard MHA).
─────────────────────────────────────────────────────────
Query heads (24) KV heads (4)
Q0 Q1 Q2 Q3 Q4 Q5 →→→ KV0 group 0
Q6 Q7 Q8 Q9 Q10 Q11 →→→ KV1 group 1
Q12 Q13 Q14 Q15 Q16 Q17 →→→ KV2 group 2
Q18 Q19 Q20 Q21 Q22 Q23 →→→ KV3 group 3
Each group: 6 query heads share 1 KV head
─────────────────────────────────────────────────────────
Multi-Head Grouped-Query Multi-Query
Attention Attention Attention
KV heads: 24 4 1
KV cache: 100% 16.7% 4.2%
Quality: baseline ≈ same slightly worse
6 queries share 1 KV pair → 6× less KV cache memory
At 4K context: 1 GB KV cache instead of 6 GB
Step 3: Attention — RoPE
How does the model know that "Paris" is the fifth word, not the first? Rotary Position Embedding encodes position by rotating query and key vectors in pairs of dimensions. Each dimension pair rotates at a different frequency, creating a unique angular signature for each position.
Each dimension pair is a clock hand spinning at a different speed.
Position = how far each hand has rotated from the start.
─────────────────────────────────────────────────────────────
Dimension pair i=0 (slowest clock):
pos 0: ↑ 0° pos 1: ↗ 0.06° pos 5: ↗ 0.3°
tiny rotation per step → detects long-range gaps
Dimension pair i=32 (medium clock):
pos 0: ↑ 0° pos 1: → 5.6° pos 5: ↓ 28°
moderate rotation → mid-range position sensitivity
Dimension pair i=63 (fastest clock):
pos 0: ↑ 0° pos 1: ↙ 180° pos 5: ↙ 900°
rapid rotation → distinguishes adjacent tokens
─────────────────────────────────────────────────────────────
Each position gets a unique angular “fingerprint” across all 64 pairs.
No two positions produce the same combination of angles.
Python
# RoPE: rotate pairs of dimensions by position-dependent angles
# For dimension pair (2i, 2i+1) at position pos:
freq = 1.0 / (base ** (2*i / d_model)) # base=10000000 for Qwen-3.5
theta = pos * freq
# Low frequency pairs (i=0): rotate slowly → captures long-range position
# High frequency pairs (i=63): rotate fast → captures fine position
q[2*i] = q_orig[2*i] * cos(theta) - q_orig[2*i+1] * sin(theta)
q[2*i+1] = q_orig[2*i] * sin(theta) + q_orig[2*i+1] * cos(theta)
# The key insight: Q·K^T now encodes RELATIVE position.
# "How far apart are these tokens?" — not "what absolute position?"
# This is why context extension (YaRN, NTK) works: adjust frequencies.
Scenario A: Q at position 5, K at position 3 → gap = 2
Scenario B: Q at position 100, K at position 98 → gap = 2
─────────────────────────────────────────────────────────────
How RoPE makes this work:
Q·KT after rotation = f(pos_Q − pos_K)
The rotation of Q at pos 5 minus the rotation of K at pos 3
= the same angle as Q at pos 100 minus K at pos 98.
R(5) · R(3)T = R(5−3) = R(2)
R(100) · R(98)T = R(100−98) = R(2) ✓ identical
─────────────────────────────────────────────────────────────
Rotation matrices cancel: R(θ) · R(φ)T = R(θ−φ)
This is why context extension (YaRN, NTK) works:
rescale the frequencies → stretch the position space.
Step 4: Attention — Scores & Output
Now the model asks its central question: for each token, which other tokens matter most? Every token's query gets compared against every previous token's key — a dot product that measures relevance. High dot product means "you have what I'm looking for." Softmax turns these raw scores into weights that sum to 1. The output is a weighted blend of value vectors — each token absorbs information from the tokens that scored highest. Here's what that looks like for "The capital of France":
The model is about to predict the word after “The capital of France”.
Token “France” asks: “who in this sentence helps me decide what comes next?”
It compares its query against every previous token’s key.
The result: a weight for each token — how much to listen to it.
Token Weight How much “France” listens
───── ────── ──────────────────────────────────
"The" 0.08 ███░░░░░░░░░░░░░░░░░░░░░░
"capital" 0.62 ████████████████████░░░░░ ← highest
"of" 0.06 ██░░░░░░░░░░░░░░░░░░░░░░░
"France" 0.24 ████████░░░░░░░░░░░░░░░░░
↓
weighted blend of V vectors
↓
Output = 0.08·VThe + 0.62·Vcapital + 0.06·Vof + 0.24·VFrance
“France” focuses 62% on “capital” — not the nearest word,
but the most relevant one. It’s answering “capital of what?”
This is the core insight: attention is content-addressed, not positional.
Worked Example (4 tokens, 2 heads, head_dim=4)
# Simplified to illustrate the mechanics
Q (current token, head 0) = [0.8, -0.3, 0.5, 0.1]
K cache (all 4 tokens, head 0):
K₀ = [ 0.2, 0.5, -0.1, 0.3] "The"
K₁ = [-0.4, 0.1, 0.7, 0.2] "capital"
K₂ = [ 0.6, -0.2, 0.3, 0.8] "of"
K₃ = [ 0.3, 0.9, -0.5, 0.4] "France"
Scores = Q · K^T / sqrt(4):
s₀ = (0.16 - 0.15 - 0.05 + 0.03) / 2.0 = -0.005
s₁ = (-0.32 - 0.03 + 0.35 + 0.02) / 2.0 = +0.010
s₂ = (0.48 + 0.06 + 0.15 + 0.08) / 2.0 = +0.385
s₃ = (0.24 - 0.27 - 0.25 + 0.04) / 2.0 = -0.120
After softmax: [0.231, 0.234, 0.329, 0.206]
Token "of" gets highest attention weight → model is looking at "of France"
Output = 0.231·V₀ + 0.234·V₁ + 0.329·V₂ + 0.206·V₃
Weighted blend of all value vectors
Who can attend to whom?
K₀ K₁ K₂ K₃
The capital of France
┌──────┬──────┬──────┬──────┐
Q₀ The │ ✓ │ × │ × │ × │
├──────┼──────┼──────┼──────┤
Q₁ cap │ ✓ │ ✓ │ × │ × │
├──────┼──────┼──────┼──────┤
Q₂ of │ ✓ │ ✓ │ ✓ │ × │
├──────┼──────┼──────┼──────┤
Q₃ Fra │ ✓ │ ✓ │ ✓ │ ✓ │
└──────┴──────┴──────┴──────┘
Upper triangle = −∞ before softmax → attention weight = 0
You can’t read the future.
─────────────────────────────────────────────────────────────
During generation (decode), this simplifies:
New token’s Q attends to all previous K’s in the cache.
No mask needed — there’s nothing ahead to block.
This is why the D3 attention heatmap below shows a triangle:
each row has one more filled cell than the row above it.
Attention is a soft lookup table. Q is the search query, K is the index, V is the data. Unlike a hash table, every entry matches to some degree — softmax just decides how much.
In decode mode, this step is usually memory-bound, not math-bound — the GPU waits for data, not arithmetic. Part II explains why.
Step 5: SwiGLU FFN
After attention blends information across tokens, the FFN processes each token independently — expanding the vector to 4× width, selectively routing features through a learned gate, then compressing back. It's the largest single compute cost per block, and modern models use a gated variant called SwiGLU that's fundamentally different from GPT-2's simple expand-activate-shrink.
GPT-2's FFN was simple: expand to 4× width, apply GELU, shrink back.
Two matrices, one non-linearity. The activation decided which dimensions
to keep — but using only the expanded representation itself.
No separate "should I use this?" signal.
SwiGLU splits the decision from the content:
Same input x feeds both paths:
GATE path x → W_gate → SiLU → "should this feature activate?"
high → open the gate
low → shut it down
UP path x → W_up → "what value should it carry?"
the candidate content
COMBINE SiLU(gate) ⊙ up element-wise multiply
gate decides, up provides
DOWN result → W_down → compress back to [5120]
─────────────────────────────────────────────────────────────
Think of it like a mixing board. Each of the 17,408 channels has a
fader (the gate) and an audio signal (the up value). The model learned
which faders to push up for each kind of input.
A token about geography opens channels for location features and
closes channels for emotion features. A token about sentiment does
the opposite. Same weights, different routing per token.
The cost: three weight matrices instead of two. That's why the FFN
dominates compute — 535M FLOPs per block, ~78% of the budget.
But the quality gain is consistent, which is why every post-2022
model uses it.
The name: Swi from Swish (= SiLU) + GLU from Gated Linear Unit
─────────────────────────────────────────────────────────────
GLU family (Dauphin et al., 2017):
GLU(x, W, V) = σ(xW) ⊙ xV original: sigmoid gate
SwiGLU variant (Shazeer, 2020) replaces σ with Swish/SiLU:
SwiGLU(x, Wgate, Wup) = SiLU(x · Wgate) ⊙ (x · Wup)
Where SiLU (Sigmoid Linear Unit) = Swish with β=1:
SiLU(z) = z · σ(z) = z · 1/(1 + e−z)
─────────────────────────────────────────────────────────────
Full FFN with SwiGLU (as used in Qwen, Llama, Mistral, etc.):
FFN(x) = (SiLU(x · Wgate) ⊙ x · Wup) · Wdown
Dimensions (Qwen 27B): dmodel = 5120, dff = 17408
Wgate ∈ ℝ5120 × 17408 Wup ∈ ℝ5120 × 17408 Wdown ∈ ℝ17408 × 5120
─────────────────────────────────────────────────────────────
Why 3 matrices instead of 2?
Classic FFN (GPT-2): FFN(x) = GELU(x · W1) · W2 2 matmuls
SwiGLU FFN: FFN(x) = (SiLU(xWg) ⊙ xWu) · Wd 3 matmuls
Extra matmul = ~50% more FFN FLOPs, but the gating mechanism
gives the model fine-grained control over which features survive.
Consistent quality wins across all benchmarks — worth the cost.
SwiGLU (4 lines)
gate = x @ W_gate # [5120] → [17408] "should this feature activate?"
up = x @ W_up # [5120] → [17408] "what value should it carry?"
h = SiLU(gate) ⊙ up # element-wise: gate decides, up provides
out = h @ W_down # [17408] → [5120] compress back
# vs GPT-2's FFN (2 matrices, no gate):
# h = GELU(x @ W_fc) # expand + activate
# out = h @ W_proj # compress
# FLOPs: 3 × 5120 × 17408 × 2 = 535M per block per token
# Attention FLOPs: ~147M. FFN is ~78% of per-block compute.
SiLU(x) = x × σ(x) = x × 1/(1 + e−x)
How SiLU behaves at different gate values:
gate = -3.0 → sigmoid = 0.05 → SiLU = -0.14 nearly killed
gate = -1.0 → sigmoid = 0.27 → SiLU = -0.27 suppressed
gate = 0.0 → sigmoid = 0.50 → SiLU = 0.00 on the fence
gate = 1.0 → sigmoid = 0.73 → SiLU = 0.73 mostly open
gate = 3.0 → sigmoid = 0.95 → SiLU = 2.86 wide open
vs ReLU: hard cutoff at 0. Everything negative → dead zero.
vs SiLU: smooth transition. Gradients flow even for negative inputs.
This smoothness gives the model fine-grained control over
"how open" each gate is — not just on/off.
─────────────────────────────────────────────────────────────
SiLU can output small negative values (minimum ≈ -0.28 at x ≈ -1.28).
The gate can slightly invert a feature, not just suppress it.
ReLU can't do this. GELU is similar but lacks the clean
multiplicative gating structure that makes SwiGLU work.
dim gate SiLU(gate) up gate × up status
─── ──── ────────── ── ───────── ──────
0 2.1 ████ 1.88 0.7 ███ 1.32 ✓ PASS
1 -0.5 ░ -0.19 1.3 ░ -0.25 × kill
2 3.4 █████ 3.30 -0.9 ████ -2.97 ✓ PASS
3 -2.8 ░ -0.16 2.1 ░ -0.34 × kill
4 0.1 ░ 0.05 0.4 ░ 0.02 × kill
5 1.7 ███ 1.44 -0.6 ██ -0.86 ✓ PASS
─────────────────────────────────────────────────────────────────
Each row is one dimension’s full pipeline: gate → SiLU → × up → result.
Dims 0, 2, 5 flow through. Dims 1, 3, 4 are suppressed.
The gate learns which features matter. The up-projection computes candidate values. Multiplying them = selective routing. It's like an if-statement learned from data: if gate[i] > 0: output[i] = up[i], but smooth and differentiable. One projection decides, the other provides — that's why SwiGLU needs three matrices instead of GPT-2's two, and why it works better: the model can learn to shut off irrelevant features instead of blending everything.
Step 6: Exiting the Block
Two more operations close out the block. First, the attention output passed through an output projection — a matrix multiply (W_out: 768×768 in GPT-2) that merges all the attention heads back into a single vector. That's the 6th matmul. The FFN added the 7th. Second, a residual connection adds the block's input back to the FFN's output — the same skip-connection pattern from after attention. The block only had to learn what to change, not reconstruct the whole vector from scratch.
One block, summarised
input x [768]
↓
LayerNorm → Q,K,V projections (matmuls 1-3) → attention scores → W_out (matmul 4)
↓
+ x (residual)
↓
LayerNorm → FFN up (matmul 5) → GELU → FFN down (matmul 6) → GPT-2: 2 FFN matmuls
↓ Qwen: 3 (gate + up + down)
+ x (residual)
↓
output [768] ← same shape. different meaning. ready for the next block.
That's one block: six matmuls for GPT-2 (seven for Qwen's SwiGLU), two norms, two residual adds. The vector that enters is [768]. The vector that exits is [768]. Same shape, transformed meaning. Now do it 11 more times.
The Full Stack: 12 Blocks Deep
We traced one block. GPT-2 runs 12 of them, in sequence, on every token. The same six matrix multiplications, repeated with different learned weights. Here's what that stack looks like:
Block Operations FLOPs Cumulative
───── ────────── ───── ──────────
1 LayerNorm → QKV → Attn → Proj → FFN ~20M 20M
2 LayerNorm → QKV → Attn → Proj → FFN ~20M 40M
3 LayerNorm → QKV → Attn → Proj → FFN ~20M 60M
⋮ same structure, different learned weights ⋮ ⋮
12 LayerNorm → QKV → Attn → Proj → FFN ~20M 240M
Then:
+1 Final LayerNorm → logit projection (W_embedT) ~10M ~250M
─────────────────────────────────────────────────────────────────────
Per block: 6 matmuls × 1 block = 6 matmuls ~20M FLOPs
Per token: 6 matmuls × 12 blocks = 72 matmuls + 1 proj ~250M FLOPs
100 tokens: 73 × 100 = 7,300 matmuls ~25B FLOPs
Every token you see streaming? All 12 blocks ran. All 73 matmuls fired.
The vector that entered block 1 as a raw embedding exits block 12
as a rich, context-aware representation. Same shape in, same shape out.
Scale comparison: Qwen-3.5 27B runs 64 blocks with 7 matmuls each
= 449 matmuls per token, ~44B FLOPs. Same recipe, 176× the cost.
The Final Step: PROJECT
After all 12 blocks (GPT-2) or 64 (Qwen), one last norm stabilises the vector. Then comes the PROJECT step from the pipeline — the moment the model turns its internal representation back into words. It multiplies the vector by the embedding table transposed. Remember that table from Chapter 2? The same 50,257×768 matrix that converted token IDs into vectors at the start now works in reverse: a dot product of the output vector against every row produces one score per word in the vocabulary. High score means "my output vector is close to that word's embedding." These 50,257 scores are called logits — raw, unbounded numbers with no guarantee of summing to anything. They're not probabilities yet. They're a ranked ballot, and the model needs a way to pick a winner. That's the final piece.
// 04 — Choosing Words
Top-p prunes the tail. One random draw. This is where determinism ends and generation begins._
Those raw scores are called logits. They're not probabilities yet — they're unbounded numbers, some positive, some negative, with no guarantee of summing to anything. Before we look at how sampling transforms them, let's see what they actually look like.
The sampling pipeline transforms these scores: penalize repetition, prune the tail, control the temperature, normalize to probabilities, and finally roll the dice. Each filter shapes the output in a different way, and their order matters — engines disagree on the right sequence. Here's one common order, used by llama.cpp and several other engines.
1. Repetition Penalty penalize the past
Recently generated tokens get their logits divided by 1.1.
Already said “Paris”? Its score drops. Prevents loops.
2. Top-K K=50 hard ceiling
Keep only the top K scores. Everything else → −∞.
Problem: K=50 is too many when confident, too few when uncertain.
Pruning removes the obvious junk. But a fixed ceiling can't adapt to how confident the model is. The next three filters shape the distribution dynamically:
3. Top-P (nucleus) p=0.95 adaptive cutoff
Walk down the sorted list, accumulating probability. Stop at p.
Confident: [0.91, 0.04, …] → keeps 1–2 tokens
Uncertain: [0.15, 0.12, …] → keeps 8–10 tokens
4. Min-P min_p=0.05 floor filter
Cut anything below min_p × top_probability.
Newer than top-p. Simpler. Many practitioners prefer it alone.
5. Temperature T=0.7 sharpen/flatten
Divide all logits by T before softmax.
T=0.3: [0.99, 0.01, …] locks in. T=1.5: [0.37, 0.22, …] anything goes.
What remains is a shaped set of scores. Two final steps turn them into a word:
6. Softmax elogit / Σelogits logits → probabilities
Raw scores become a distribution that sums to 1.
7. Random Sample roll the dice
Draw one token. Done. Determinism ends, generation begins.
Same prompt, same model, different roll → different story.
Sampling Explorer: Watch the Words Change
Drag the sliders. Watch the sentence change. At low temperature, the model always picks the highest-probability word. Crank it up, and rarer words start winning. Top-p prunes the long tail — lower it to force conservative output even at high temperature. Hit Resample to re-roll — at T=0.3 you get the same sentence every time; at T=1.5, every roll is different.
Practical Recipes
Different tasks want different sampling. Not gospel — but the defaults most engines converge on:
Starting points
Code generation T=0 greedy. determinism matters.
Chat T=0.7 min_p=0.05 rep=1.1 warm but not wild.
Creative writing T=1.0 min_p=0.02 top_p=0.95 let rare words win sometimes.
Factual Q&A T=0 always pick the favourite.
We started with a question: what happens when you type something and words appear?
Now we know. Raw text becomes tokens. Tokens become vectors — points in a 768-dimensional space where meaning has geometry. Those vectors pass through 12 identical blocks, each one normalising, attending across context, and rewriting through a gated feed-forward network. Twelve times the same structure, twelve different sets of learned weights, each refining the representation a little further. What emerges is a single vector that has absorbed the entire context of everything before it.
That vector projects back to vocabulary space — a dot product against every word the model knows. Temperature, top-p, and min-p shape the resulting distribution. One random draw. One word. Append it to the sequence, and run the entire pipeline again.
That's the whole machine. No magic, no understanding, no intent — just matrix multiplications, applied in the right order, with the right learned numbers. The rest is engineering.
And that’s where Part II begins._