Documentation · methodology

How sensitivity works

Per-layer KL-divergence sensitivity is mlx-optiq's measurement for deciding which transformer layers can safely be quantized to 4 bits and which need 8. Not every layer in a transformer is equally fragile under quantization. Some take a 4-bit hit cleanly, others fall apart. mlx-optiq measures which is which by perturbing one layer at a time and watching the model's output distribution.

Measure once, allocate everywhere. The same per-layer sensitivity number informs weight bit-width, KV bit-width, and LoRA rank.

Two methods: optiq and static

The per-layer allocation comes from one of two signals, selected with optiq convert --method. They trade measurement fidelity against convert time and memory.

Method	Signal	Convert cost	Best for
`optiq` · default	Exact logit KL from calibration forward passes	High — `n_layers × n_bits × n_samples` forward passes	The gold-standard signal; small / medium models
`static`	Structural rules from architecture, no measurement	Negligible — no forward passes, loads lazily, respects target BPW	The fast path; large bases where exact KL is impractical

Measured on the smallest base, Qwen/Qwen3.5-0.8B (GSM8K, 3-shot, 200 questions, target 5.0 BPW with candidate bits 4 and 8):

Method	Convert time	Achieved BPW	GSM8K
`optiq`	899 s	5.59	34.5%
`static`	7.2 s	5.15	34.5%

static does no measurement at all, yet its structural rules — high bits to the embedding and output head, the first and last block, attention and the MoE router; low bits to the dense MLP and routed experts — match optiq here at a lower BPW and 125× faster. The lesson: for a typical transformer, architecture is most of the sensitivity signal, so the layers optiq would pick out are largely known in advance. optiq still earns its cost when you want certainty or are characterising a new family. And because static loads lazily and runs no forward passes, it is the practical choice for the very large bases — a 2/4-bit mixed quant of a 122 B-A10B MoE that runs on a laptop through SSD expert streaming, for instance.

Choosing a method Default to optiq for the gold-standard signal on small and medium models. Use static for a fast convert, and for any base too large for exact sensitivity to be practical — it matches optiq on typical architectures and is the path for 100 B-class MoEs.

method.shbash

# default: exact KL sensitivity
$ optiq convert Qwen/Qwen3.5-9B --target-bpw 5.0 --candidate-bits 4,8

# fast structural allocation for a large MoE, mixed 2/4-bit
$ optiq convert mlx-community/Qwen3.5-122B-A10B-bf16 \
    --method static --candidate-bits 2,4 --target-bpw 2.5

The measurement (`optiq` method)

For each pair (layer L, candidate bits b):

Forward-pass calibration data through the model with all weights in their reference precision. Record the output logits.
Replace just L's weight with a simulate-quantized version at b bits (round-trip: quantize → dequantize). All other layers stay at reference precision.
Forward-pass the same calibration data again. Record the perturbed logits.
Compute KL divergence between the reference and perturbed logit distributions, averaged across calibration samples and tokens.
Restore L to its reference precision and move on to the next layer.

The result is a table: for every layer, the KL cost of dropping it from reference to each candidate bit-width. This is mlx-optiq's per-layer sensitivity signal.

Two reference modes

What does "reference precision" mean concretely? Two options, picked automatically by --reference auto based on whether the bf16 weights fit in your Mac's RAM:

1. `bf16` reference: gold standard

Load the original bf16 model into RAM. Each sensitivity probe swaps a single layer between bf16 and a quantized copy. Highest-fidelity measurement. Required RAM ≈ model size in bf16 (≈ 2 × parameters in GB).

Used automatically when bf16 fits in ~70% of available RAM. On a 36 GB Mac, this means models up to ~10 B parameters.

2. `uniform_4bit` reference: for big models

Build a uniform-4-bit MLX baseline first. Load that as the running model (~25% of bf16 size). Stream bf16 weights off disk, one layer at a time, swapping each in for its sensitivity probe. The signal is slightly weaker. You're measuring KL relative to uniform-4 instead of bf16. But it lets 27 B+ models still get a calibration-driven mixed-precision allocation on a 36 GB Mac.

Used automatically when bf16 doesn't fit. The bf16 weights still need to be on disk for the streaming probes; only RAM is the constraint.

Auto-routing in practice

convert.shbash

# auto picks bf16 if it fits, else uniform_4bit
$ optiq convert Qwen/Qwen3.5-9B \
    --target-bpw 5.0 --candidate-bits 4,8 \
    --reference auto

# force bf16 (will OOM if model doesn't fit)
$ optiq convert Qwen/Qwen3.5-9B --reference bf16

# force uniform-4-bit reference (works on 27 B+ on 36 GB)
$ optiq convert Qwen/Qwen3.5-27B --reference uniform_4bit

How `static` assigns bits

The static method does no measurement. It scores each layer by architecture alone, from a small set of priors that hold across transformer families:

The embedding and output head rank highest.
The first and last transformer block are protected.
Attention and the MoE router rank above the dense MLP.
The routed experts and mid-network layers rank lowest.

Those priorities feed the same allocator the measured path uses, at the requested candidate bits and target BPW — so static produces a real mixed-precision quant, not a fixed recipe. It reads only layer names and shapes (the bf16 model is memory-mapped, never resident) and runs no forward passes, which is what lets it allocate a 100 B-class base on a 36 GB Mac in seconds. On a sparse-MoE base it puts the high bits on the router and the protected blocks and keeps the large routed experts low — the right call when the expert budget is what makes the quant fit through SSD streaming.

Calibration data

mlx-optiq ships a hand-curated 6-domain mix at optiq/calibration/data/optiq.jsonl: 40 samples spanning prose, reasoning (with <think> blocks), code, multi-turn agent loops, function-calling traces, and constraint-bearing instructions. Chat samples are auto-rendered through the target model's tokenizer.apply_chat_template() before tokenization, so the activated subspace matches production. The file lives in the wheel, so no network at convert time. Pass --calibration-mix /path/to/your.jsonl to override with a domain-specific mix; rebuild the default with python scripts/build_calibration.py. See the calibration-mix blog post for the why.

The allocator

Once you have the sensitivity table, allocating per-layer bits is a knapsack. mlx-optiq uses a greedy heuristic that's optimal in practice for the small number of layers and bit-widths involved:

Start every layer at the lowest candidate bit-width (e.g. 4-bit).
Compute the average bit-budget so far.
If under target BPW: find the layer where upgrading by one bit-width tier buys the largest KL reduction per extra bit. Upgrade it.
Repeat until the average BPW reaches the target.

Some layers are protected. They always get the highest bit-width regardless of the knapsack. By default these are lm_head, embed_tokens, the first attention block and the last attention block. They're cheap (small parameter share) and pathological to quantize.

The output

mlx-optiq hands the per-layer bit map to mlx_lm.convert as a quant_predicate. The output is a standard MLX checkpoint indistinguishable from a uniform-quantized one to mlx_lm.load, except that some layers are at 8-bit and others at 4-bit.

Why KL and not perplexity? Perplexity is a scalar. Too coarse to discriminate between layers that fail in different ways. KL divergence on the full output distribution captures shifts in which tokens get mass, not just how confident the model is on the chosen one. The KL signal correlates strongly with downstream task accuracy (GSM8K, MMLU) and is much cheaper to compute than running a full eval.

What about MoE models?

Sparse mixture-of-experts (Qwen3.5-35B-A3B, Qwen3.6-35B-A3B, gemma-4-26B-A4B) walk through the same sensitivity knapsack, with three structural rules layered on top:

The fused expert tensor (shape (num_experts, hidden, intermediate)) is treated as one knapsack entry per (block, projection) — so the routed experts in a block share a bit-width, but each block × projection picks its own. mlx-optiq walks the MoE structure with arch-aware naming: Gemma's switch_glu keeps the experts segment, Qwen's switch_mlp drops it.
Router projections (Qwen3's mlp.gate, Gemma's mlp.router) are explicitly protected at the high bit-width. The router's output controls which experts a token visits, so quantization noise here propagates to which paths get taken, not just the magnitude of activations on a fixed path.
Shared experts (any layer matching *.shared_experts.*) are also protected at the high bit-width. These fire for every token like a dense MLP, so they're sensitive in the per-token sense even though they sit inside an MoE block.

Empirically, even before these explicit rules landed, the KL knapsack put routers and shared experts at 8-bit on every shipped MoE quant — the per-layer KL signal is large at exactly those layers. The protection rule encodes the same outcome by design so it does not depend on the calibration mix happening to land that way.

Next up: see the algorithm in action in our research write-up, or get started with a model family.