GGUF · AWQ · EXL2, DISSECTED
01 // The File
DOWNLOAD → HEADER → METADATA → TENSORS → WEIGHTS ← the whole map
Three ecosystems. Three design philosophies. One question: what is actually inside the file you downloaded?
In February 2023, Meta’s LLaMA weights leaked. Within a week, a Bulgarian developer named Georgi Gerganov had the model running on a MacBook — no Python, no PyTorch, pure C. His project, llama.cpp, needed a file format. The first attempt, GGML, broke constantly. Five months later he replaced it with GGUF: a single self-describing binary. Architecture, tokenizer, quantization, weights — one file, no sidecar configs. That bet turned llama.cpp into the engine behind Ollama, LM Studio, and most of the local AI ecosystem.
But Gerganov was not the only one building. From MIT’s Han Lab came AWQ — a quantization algorithm that stores its outputs as standard HuggingFace safetensors. No self-describing header, but native to vLLM and the GPU serving stack. From the ExLlamaV2 project came EXL2 — same safetensors packaging, but with a measurement.json that maps reconstruction error per column, enabling precision to the tenth of a bit. Three bets on the same problem:
| GGUF | AWQ | EXL2 | |
|---|---|---|---|
| On disk | 1 file | 5–8 files | 5–8 files |
| Metadata | Embedded | External JSON | External JSON + error map |
| Runs on | CPU, Apple Silicon, GPU | NVIDIA GPU | NVIDIA GPU |
| Precision | Per-tensor (S/M/L policy) | Per-group (128 weights) | Per-column (continuous bpw) |
GGUF optimises for portability — one file that runs on anything: a MacBook, a Raspberry Pi, a phone, an NVIDIA GPU, an AMD GPU. No dependencies, no ecosystem, no internet. AWQ optimises for GPU serving throughput — its weight layout is designed for vLLM’s continuous batching and CUDA tensor cores, and it loads with one line of Python in any HuggingFace-native framework. EXL2 optimises for quality per bit — the measurement pass and per-column allocation squeeze more intelligence from fewer bits than either alternative.
We start with GGUF because it is the dominant format. It has the widest hardware support (CPU, Apple Silicon, NVIDIA, AMD, mobile, browser via WebAssembly), the largest community of quantizers, the deepest integration with local tools (Ollama, LM Studio, GPT4All), and almost every model on HuggingFace ships a GGUF variant. When someone says “I downloaded a model,” they usually mean a .gguf file. Once you understand its anatomy, AWQ and EXL2 are variations on themes you already know.
02 // The Metadata
DOWNLOAD → HEADER + METADATA → TENSORS → WEIGHTS
Before a single weight is loaded, the runtime knows the entire architecture.
The Header
Every GGUF file starts with 24 bytes. Four magic characters, a version number, a tensor count, and a metadata count. The shortest possible handshake: I am GGUF v3, I contain 370 tensors and 300 metadata entries.
magic (4B) version (4B) tensor count (8B) metadata KV count (8B)
That tells the runtime how much is coming. Immediately after: what it all means.
The Metadata
Three hundred typed key-value pairs, and the very first one is the most important. general.architecture — a single string, "qwen3" — sets the namespace for everything else. It is the key that unlocks all the other keys: qwen3.block_count, qwen3.attention.head_count, qwen3.feed_forward_length. Get this one wrong and the runtime cannot find anything.
Click any key below to see what the runtime does with it:
Three namespaces organize the metadata. general.* identifies the model. {arch}.* (here qwen3.*) prescribes the architecture. tokenizer.* embeds the full vocabulary, merge rules, and special tokens. Together they form a complete instruction set for the runtime — no external files needed.
No runtime reads these keys for documentation. head_count_kv=8 with head_count=32 tells it: use the GQA kernel, not MHA. Allocate a KV cache 4× smaller. feed_forward_length=12288 tells it: wire three FFN projections, not two. The metadata is the computation graph — described before a single weight is loaded.
03 // The Tensors
DOWNLOAD → HEADER → METADATA → TENSORS → WEIGHTS
After the metadata: 370 tensor entries. Every matrix in the model, listed by name, shape, and quantization type.
The tensor manifest is the bill of materials. Here is what it looks like for the first layer of Qwen3-8B — exactly what you would see if you opened the file:
Qwen3-8B — Tensor Manifest (blk.0)token_embd.weight [151936, 4096] Q6_K ← vocabulary embedding
blk.0.attn_norm.weight [4096] F32 ← layer norm (full precision)
blk.0.attn_q.weight [4096, 4096] Q4_K ← 32 query heads × 128 dims
blk.0.attn_k.weight [1024, 4096] Q4_K ← 8 KV heads (GQA 4:1)
blk.0.attn_v.weight [1024, 4096] Q4_K ← values, same GQA shape
blk.0.attn_output.weight [4096, 4096] Q4_K ← merge heads
blk.0.ffn_norm.weight [4096] F32 ← layer norm
blk.0.ffn_gate.weight [12288, 4096] Q4_K ← SwiGLU gate
blk.0.ffn_up.weight [12288, 4096] Q4_K ← expand 3×
blk.0.ffn_down.weight [4096, 12288] Q6_K ← compress (promoted!)
... blk.1 through blk.35 — same 9 tensors, same names ...
output.weight [151936, 4096] Q6_K ← output head
370 entries like this. Three things to notice:
- Names tell you the architecture.
attn_q,attn_k,attn_vas separate tensors = split attention. K smaller than Q = GQA.ffn_gatepresent = SwiGLU. The names are the blueprint. - Shapes tell you the design.
[1024, 4096]for K and V means 8 KV heads shared across 32 query heads. The GQA ratio is in the shape. - Quant types tell you the compression. Most tensors are
Q4_K. But the embedding, output head, and some FFN projections areQ6_K— promoted to higher precision. Layer norms areF32. This is the mixed-precision policy at work.
That is the entire GGUF anatomy: a single file that describes itself completely. But GGUF is not the only way to package a quantized model. Before we look at how the weights are compressed, let’s see what the alternatives put on disk.
04 // The Alternatives
AWQ and EXL2 store weights in standard safetensors and let the HuggingFace ecosystem handle the rest. Here is what is actually on disk.
AWQ: The Activation-Aware Directory
A typical AWQ model on HuggingFace looks like this:
AWQ DirectoryQwen3-8B-AWQ/
├── config.json # architecture: layers, heads, dims, vocab
├── generation_config.json # sampling defaults (temperature, top_p)
├── model.safetensors # quantized weights (~4.8 GB)
├── quantize_config.json # AWQ-specific: bits, group_size, method
├── tokenizer.json # BPE vocabulary and merge rules
├── tokenizer_config.json # special tokens, chat template
└── special_tokens_map.json # BOS, EOS, PAD token mappings
Seven files. The weight file — model.safetensors — is the bulk, but it carries no metadata about what model it belongs to. Without config.json, the runtime doesn’t know how many layers to build. Without quantize_config.json, it doesn’t know the weights are 4-bit. The architecture knowledge that GGUF bakes into its header lives here in loose JSON files.
The weight encoding itself is distinctive. GGUF stores each tensor as a single quantized blob — you look up blk.0.attn_q and get the compressed weights directly. AWQ splits each quantized linear layer into three safetensors entries:
| Key | Contents | Purpose |
|---|---|---|
model.layers.0.self_attn.q_proj.qweight | INT32 packed | Eight 4-bit weights per int32 |
model.layers.0.self_attn.q_proj.qzeros | INT32 packed | Zero-points for each group of 128 weights |
model.layers.0.self_attn.q_proj.scales | FP16 | Per-group scale factors |
Every linear layer becomes a triplet: packed weights, zero-points, and scales. The dequantization formula is weight = scale × (qweight − qzero), applied per group of 128 weights. This is asymmetric uniform quantization — each group gets its own scale and zero-point, but every value within a group maps to one of 16 evenly-spaced levels. No double quantization, no super-blocks. Simple, GPU-friendly, and fast to dequant in a CUDA kernel.
The file that makes AWQ AWQ is quantize_config.json:
quantize_config.json{
"quant_method": "awq",
"bits": 4,
"group_size": 128,
"zero_point": true,
"version": "gemm"
}
Five fields. Compare this to GGUF’s 300 metadata key-value pairs. AWQ’s quantization config is minimal because the architecture is already in config.json — the standard HuggingFace config that every model ships with, quantized or not. AWQ just adds the quant-specific overlay.
No mmap. No single-file portability. No embedded tokenizer. But: native to the GPU serving stack. vLLM, TGI, and HuggingFace Transformers load AWQ models with a single line of Python. The format that serves billions of tokens per day in production is not the self-describing one — it is the one that plugs directly into the existing infrastructure.
EXL2: The Measured Directory
An EXL2 model directory looks almost identical to AWQ’s — safetensors, config, tokenizer. But one file changes everything:
EXL2 DirectoryQwen3-8B-EXL2-4.0bpw/
├── config.json # architecture (same as AWQ)
├── generation_config.json # sampling defaults
├── model.safetensors # quantized weights (~4.3 GB)
├── tokenizer.json # BPE vocabulary
├── tokenizer_config.json # special tokens, chat template
└── measurement.json # per-layer, per-column error map
measurement.json is EXL2’s unique artifact. Before quantizing a single weight, EXL2 runs the full-precision model on calibration data and measures reconstruction error at every bit-width (2, 3, 4, 5, 6, 8) for every column of every layer. The result is a complete sensitivity map of the model — which columns can tolerate 2-bit precision, which ones fall apart below 6-bit, and everything in between.
This measurement is expensive — it requires the full FP16 model and takes hours on a fast GPU. But it only happens once. The measurement.json file is reusable: given any target bits-per-weight, EXL2 solves a knapsack problem to allocate precision where the error map says it matters most. Want 3.5 bpw? 4.25 bpw? 6.0 bpw? Re-run the quantizer with the same measurement file and a different target. No re-measuring needed.
The precision allocation is per-column, not per-tensor. Inside a single weight matrix, column 47 might be 2-bit while column 48 is 6-bit. The bit allocation is so fine-grained that the effective bpw is a continuous number. This is why EXL2 models are named by their target: 4.0bpw, 3.5bpw, 6.5bpw — the number is a budget that the knapsack solver distributes optimally. (The next section compares all three approaches side-by-side.)
Best quality-per-bit of the three formats — at 4 bpw, EXL2 typically matches or beats GGUF’s Q4_K_M and AWQ’s 4-bit on perplexity benchmarks. But: GPU-only, ExLlamaV2-only, and the measurement pass requires the full FP16 model. One runtime, one hardware target, maximum precision.
Now you have seen all three anatomies. The next question is the same for all of them: how do you compress 8 billion parameters without destroying the model?
05 // The Weights & The Variants
How 8 billion parameters fit in 5 GB, and what Q4_K_M, IQ4_XS, and Q5_K_S actually mean.
You saw the quant types in the manifest: most tensors Q4_K, some Q6_K, norms at F32. That is not random. Quantization is triage — the edges of the network (embedding, output head) get more precision because errors there propagate through every layer. The middle layers are resilient. The _S, _M, _L suffixes you see on HuggingFace are precision policies: how aggressively the quantizer promotes sensitive tensors to higher bit-widths.
Each format handles this differently. GGUF uses network position (no calibration needed). AWQ uses activation magnitude (small calibration set — see Fewer Bits, Same Brain for the full algorithm). EXL2 uses per-column reconstruction error (expensive measurement pass). Three signals, same principle: not all weights are equal.
| Format | Signal | Granularity | Calibration? |
|---|---|---|---|
| GGUF (K-quant) | Network position | Per-tensor (370 decisions) | No |
| AWQ | Activation magnitude | Per-group (128 weights) | Yes (small) |
| EXL2 | Reconstruction error | Per-column (thousands) | Yes (expensive) |
Within GGUF, the naming encodes three things: family (Legacy, K-quant, or I-quant), bit-width (the number), and precision policy (the suffix). Here is every type you will encounter:
Sizes shown for Qwen3-27B — where the quant choice is the difference between a laptop and a server. Click any type to inspect.
AWQ Variants
AWQ’s naming is simpler because there is less to vary. Almost every AWQ model is 4-bit with group size 128. The main axes of variation:
- Group size (64 or 128): Smaller groups = better quality, more overhead. 128 is the default; 64 appears on some careful quantizations.
- Kernel compatibility: Some uploads specify “Marlin” or “GEMM” in the name. Marlin is a fast GPU kernel for 4-bit inference — it does not change the quantization, only the runtime speed. GEMM is the default, slower kernel.
- Quantizer:
llm-compressor(vLLM’s official tool, now the standard) or AutoAWQ (the original community tool). Both produce compatible files.
When you see Qwen3-8B-AWQ on HuggingFace with no further qualifiers, it is almost certainly 4-bit, group 128, GEMM kernel. The lack of variety is a feature: AWQ optimises for one sweet spot rather than offering a menu.
EXL2 Variants
EXL2 names its variants by target bits-per-weight: 3.0bpw, 4.0bpw, 5.0bpw, 6.5bpw. The number is not a quantization type — it is the budget that the knapsack solver distributes across columns. A 4.0bpw model has columns ranging from 2-bit to 6-bit internally; 4.0 is the average.
The precision is continuous. A quantizer can target 3.75bpw or 4.25bpw with equal ease — useful for fitting a model into a specific VRAM budget. And because the measurement.json is reusable, re-quantizing at a different bpw target takes minutes, not hours. The expensive measurement pass happens once; every target after that is a fast knapsack solve.
The Decision
There is no single best format. Each wins on a different axis.
No choice — AWQ and EXL2 require NVIDIA. Q4_K_M is the default. IQ4_XS if you need it smaller.
Ollama = zero setup, one command. EXL2 = 2–3× faster generation but needs ExLlamaV2. Convenience vs speed.
Native to vLLM and TGI. Continuous batching, tensor-core layout. The production choice.
Per-column allocation beats per-tensor and per-group at the same bpw. Expensive measurement pass, but you pay it once.
GGUF can offload layers to CPU — the others cannot. If it fits in VRAM, EXL2 at 3.0–3.5 bpw gets better quality per byte.
Q6_K saves ~60% space for +0.03 PPL. If you have the VRAM, just use FP16 — no quantization to worry about.
At 4-bit, the three formats produce nearly identical perplexity. On LLaMA-2 13B: AWQ 4.33, GGUF Q4_K_M 4.33, GPTQ 4.34. The real differentiators are not quality — they are hardware compatibility (GGUF runs everywhere), ecosystem fit (AWQ plugs into vLLM), and precision granularity (EXL2 lets you dial in the exact bpw). Pick the format that matches your runtime, not the one with the lowest perplexity on a benchmark.
06 // Running It
You understand the anatomy. Now download one and talk to it.
Each format has its own path from HuggingFace to running inference.
GGUF — llama.cpp / Ollama
llama.cpp is the runtime that created GGUF. Ollama wraps it with a friendlier interface.
Terminal# Ollama — one command, zero setup
ollama run qwen3:8b
# Or pull a specific quant from HuggingFace
ollama run hf.co/bartowski/Qwen3-8B-GGUF:Q4_K_M
# llama.cpp directly — more control, same engine
llama-cli -m Qwen3-8B-Q4_K_M.gguf -p "Hello" -n 128
Both read the GGUF header, build the graph from metadata, and load tensors via mmap. CPU, Apple Silicon, or NVIDIA GPU — same binary. Ollama adds model management and an API; llama.cpp gives you direct control.
AWQ — vLLM
The GPU serving path. vLLM loads the safetensors directory, reads config.json for architecture and quantize_config.json for the quant parameters, and starts serving.
Terminal# Serve an AWQ model with vLLM
vllm serve Qwen/Qwen3-8B-AWQ \
--quantization awq \
--max-model-len 8192
The model downloads from HuggingFace automatically. vLLM’s continuous batching serves hundreds of concurrent requests on a single GPU. No mmap, no single-file portability — but the weight layout is optimised for CUDA tensor cores and the serving stack is battle-tested at scale.
EXL2 — ExLlamaV2
The quality-per-bit path. ExLlamaV2 loads the safetensors, reads the per-column bit allocation, and dequantizes on the fly during inference.
Terminal# Serve via TabbyAPI (ExLlamaV2 backend)
python -m tabbyapi \
--model-dir ./Qwen3-8B-EXL2-4.0bpw
Same model, three paths. The format you chose on HuggingFace determines which runtime loads it, which hardware runs it, and how it gets there.
Every runtime — Ollama, vLLM, ExLlamaV2 — does the same thing: read the architecture, load the weights, wire the computation graph, start generating. The format determines how it reads, not what it reads. GGUF embeds everything in one file. AWQ and EXL2 scatter it across a directory. The model inside is the same 8 billion parameters, the same 36 layers, the same attention patterns. Now you know what is inside all of them.
Where this fits. This is the Prologue to the Efficiency Papers — the foundation layer. Before you optimise attention (Part I), quantize weights (Part II: Fewer Bits, Same Brain), or serve at scale (Part III), you need to understand what is inside the file. For the inference pipeline that executes these tensors, see The Inference Engine Deep Dive.