Engineering · April 28, 2026

The eval framework that drives every quant we ship.

Topic Evaluation methodology Reading time 8 min Related calibration mix

The regression check on a fresh quant has to test the workload, not the cheapest thing to run. A 50-sample GSM8K pass is a quick gut-check that math still works. It doesn't catch a quant that quietly forgot how to emit valid function-call syntax, or one that lost the long-range KV precision that makes retrieval through 12k tokens work. By the time a user notices, the quant is already in their cache.

You catch the regression with the eval that tests the workload. Not the one that's cheap to run.

So every optiq quant goes through a two-stage suite. A fast smoketest for triage, then a full benchmark run for the headline number. Six benchmarks roll up into one Capability Score on the model card. Sandboxed HumanEval execution. Auto-resolved KL reference. All from one CLI command.

Two stages: a smoketest and the full benchmarks

Quants pass through two checkpoints. The first is fast and triages; the second is slow and decides what we ship.

Stage	Time / model	What it answers	Tasks
Smoketest	~5 min	Did the convert work? Are we close to the reference distribution?	KL on 64 prompts × 256 tokens · GSM8K-50 (chat-templated, thinking off)
Benchmarks	~2 h	How much capability did we keep across the workloads users actually run?	MMLU-1k 5-shot · GSM8K-1k · IFEval (full) · BFCL-V3 simple (200) · HumanEval (164) · HashHop (25 × 4 hops)

The smoketest is the gate. A quant that fails it doesn't get the full benchmarks. The benchmark numbers are what end up on the model card.

The smoketest: KL + GSM8K-50

KL divergence between two language models, computed token-by-token over a small batch of held-out prompts, is a cheap signal that works well in practice. The reference is the highest-fidelity version of the model that fits on the box. The candidate is the OptiQ quant. We compute KL(reference ‖ candidate) per token, average across 64 prompts × 256 tokens, and report mean + p95.

The auto-resolver picks the reference automatically:

KL reference auto-resolverpython

# Pick the highest-fidelity reference that fits on the box.
bf16_gb  = hf_repo_size_gb(strip_quant_suffix(model_id))
avail_gb = psutil.virtual_memory().available / 1024**3

if bf16_gb <= 0.70 * avail_gb:
    return "bf16", strip_quant_suffix(model_id)

# bf16 doesn't fit; fall back to the uniform-4-bit MLX baseline.
return "uniform_4bit", uniform_4bit_repo(model_id)

Two flags fall out of the smoketest sweep that the GSM8K-50 number alone wouldn't have surfaced: KL 20× higher on the Gemma-4 26B-A4B MoE and 31B dense relative to Qwen3.5-27B at comparable size, despite both showing healthy GSM8K-50 numbers. KL catches calibration regressions that single-task accuracy misses.

The smoketest is also how we decide when to invest two hours of compute in the full benchmark run. If a fresh convert fails KL, the bit allocation went bad and rerunning the full suite is wasted machine time.

The benchmarks: six metrics

The benchmark suite is what ends up on the model card. Each task targets a different capability slice, so a quant that quietly breaks one of them can't hide behind the others.

MMLU: 5-shot, stratified across the 57 subjects, 1000 samples. Encyclopedic knowledge after instruction-tuning. The bf16 anchor. Scored by first-letter logit argmax (the answer token after "Answer:"), the standard cheap-and-stable method.
GSM8K: 1000 samples, 3-shot CoT, chat-templated, enable_thinking=False for reasoning models with a thinking toggle. Multi-step arithmetic.
IFEval: full Google IFEval set with all 25+ constraint verifiers. Measures whether the model can follow detailed format / length / capitalization / inclusion-exclusion instructions. We report strict (the standard, harder metric).
BFCL-V3 simple: 200 single-turn function-calls with AST equivalence scoring. Whether the model can emit a syntactically valid call and pick the right tool from a small candidate set.
HumanEval: all 164 problems, sandboxed Python execution, pass@1 only.
HashHop Long-Context Evaluation: 25 instances at each of hops ∈ {1, 2, 3, 4} (100 total) at ~12 k context. Multi-hop key→value retrieval through a chain of N hash assignments. The model has to walk the chain through the long context and surface the terminal hash.

Run from the CLI as a single task:

terminalbash

optiq eval ./optiq_mixed --task all --score

Each individual task is also addressable (--task mmlu, --task ifeval, --task hashhop, etc.) for when you only need one number.

Always-on reasoning models: `--reasoning`

The setup above quietly assumes one of two things about a model: that it answers directly, or that it has an enable_thinking=False toggle we can flip. A growing class of models has neither — they emit a <think>…</think> block before every answer and can't be told to stop. On the stock harness those models score near-random, and not because they're bad: the generation tasks budget a few hundred tokens, which truncates a multi-hundred-token think trace before the answer ever appears; and MMLU's first-letter logit argmax collapses to chance, because a model trained to reason first doesn't put its answer-letter mass on the token right after "Answer:". We measured one strong math model at 27% MMLU (≈ chance) and 36% GSM8K on the stock harness — both artifacts of the format, not the model.

optiq eval --reasoning scores these fairly. It lets the model think (no suppression), gives each question a large budget (--reasoning-max-tokens, default 3072 — a cap, not a fixed length, since greedy decoding stops at EOS so short traces cost nothing), strips the <think> block before extracting the answer, and scores MMLU generatively — generate with thinking, then parse the answer letter — instead of by logit argmax. The candidate quant and its baseline are scored identically, so the comparison stays apples-to-apples. On that same model the real numbers were 74% MMLU and 90% GSM8K.

terminalbash

$ optiq eval ./optiq_mixed --task all --score --reasoning

Sandboxing HumanEval

HumanEval requires actually executing the model's generated Python against a unit-test harness. Doing that on the user's machine with no isolation is a footgun. A model that emits os.system("rm -rf …") ruins someone's afternoon. The sandbox helper falls through three tiers:

apple/container: when present, runs each candidate inside a fresh container with no network, no filesystem mount outside /tmp, and a wall-clock timeout. Hardest isolation, slowest start.
sandbox-exec: macOS native, when /usr/bin/sandbox-exec is available. Subprocess with a tight seatbelt profile (no network, deny file-write outside /tmp). Fast.
subprocess + rlimit: universal fallback. Spawn a Python child with RLIMIT_AS, RLIMIT_CPU, RLIMIT_FSIZE caps and a process-group timeout. No filesystem isolation; exists so the eval doesn't simply fail to run on Linux CI.

The helper picks the strictest tier available at runtime. Reported pass@1 is identical across tiers because the test harness is deterministic. Only the blast radius of malicious code changes.

HashHop Long-Context Evaluation

The first five benchmarks are short-context. MMLU prompts are a few hundred tokens. BFCL function-call prompts are similar. None of them push the KV cache out past 2k tokens, so a quant that silently lost long-context attention precision can still post strong numbers across all five.

HashHop is the one that catches it. Each instance is a dictionary of N hash assignments shaped like h0 = h1, h1 = h2, … h(N-1) = 'hN', scattered among thousands of unrelated chains and serialized into one large prompt. The model gets the starting hash and has to walk the chain through ~12k tokens of context, surface the terminal 16-character hash, and emit nothing else. Random guessing on a 16-character alphabetic hash is effectively zero (52¹⁶), so accuracy maps cleanly to how reliably the model is using its KV cache to retrieve the right key at each hop.

We sample 25 instances at each of hops ∈ {1, 2, 3, 4}, 100 instances total. Easy hops (1) catch coarse retrieval breakage; deep hops (4) stress compounding attention error across many heads × many tokens.

Why this matters specifically for mixed-precision: uniform 4-bit quantization of weights erodes long-range attention precision because small numerical errors compound across many heads × many tokens. Per-layer mixed precision (some layers 4-bit, sensitive ones 8-bit) preserves the layers retrieval depends on, so HashHop deltas vs uniform 4-bit are typically the largest single delta in the benchmark suite. On gemma-4-31B it's +22 pp over uniform 4-bit; on the 26B-A4B MoE it's +11 pp.

The Capability Score

Six percentages are hard to compare side-by-side. We want one number that answers which quant is more capable on average?. And we want a formula the reader can audit, not a hidden value judgement dressed up as math.

The simplest one that meets that bar:

Capability Scoreformula

Capability_Score = mean(MMLU, GSM8K, IFEval, BFCL, HumanEval, HashHop)

We tried a weighted formula first. Something like MMLU + 0.3 × IFEval + 0.5 × BFCL − 5 × disk_GB. It looked clever. It also embedded our quality/disk tradeoff in a way users can't see, and it could turn a +1 pp capability win into a "loss" if the disk grew by half a gigabyte. That's a recommendation, not a measurement.

So we stripped it down. The six benchmarks each get an equal vote. disk_gb is reported next to the score as an unweighted second axis, and the reader picks their own tradeoff. If you're optimizing for an 8 GB Mac, smaller wins. If you're on a 64 GB Studio, larger probably wins. The score doesn't pretend to know.

Three properties worth flagging. (1) GSM8K and MMLU both vote, because in practice they disagree often enough on quants that letting both vote catches regressions either one alone would miss. (2) HumanEval votes, which means a quant that breaks code generation can't hide behind strong instruction-following. (3) HashHop votes, which means a quant that holds up at 2k context but breaks at 12k can't claim parity by averaging only short-context numbers.

Picking the KL reference

One technical note that took us a few iterations to get right.

For models that fit in RAM (everything ≤ ~10 B at bf16 on a 36 GB Mac), the KL reference is unambiguous: it's the bf16 model itself. For 27 B+, bf16 doesn't fit, and you need a substitute reference that's still strictly higher-fidelity than the candidate. The community's uniform-4-bit MLX publish of the same model is exactly that: same architecture and weights modulo quantization noise, just at uniform 4-bit (no per-layer mixed precision).

The auto-resolver picks bf16 if available, falls back to uniform-4-bit otherwise. The fall-back was originally driven by a crude params × 2 bytes size estimate, which under-counted gemma-4-26B-A4B's MoE expert tensors and tried to load 110 GB of bf16 into 36 GB of RAM. Now we hit HfApi.model_info() and sum the actual safetensors shard sizes. The resolver is exact and the OOM is gone.

Reproducing

Everything in this post runs from the CLI. No special setup beyond pip install mlx-optiq:

terminalbash

# Fast smoketest (KL + GSM8K-50)
optiq eval mlx-community/Qwen3.5-9B-OptiQ-4bit --task smoketest

# Full benchmarks (MMLU + GSM8K + IFEval + BFCL + HumanEval + HashHop + Score)
optiq eval mlx-community/Qwen3.5-9B-OptiQ-4bit --task all --score

# Single tasks if you only need one number
optiq eval mlx-community/Qwen3.5-9B-OptiQ-4bit --task bfcl
optiq eval mlx-community/Qwen3.5-9B-OptiQ-4bit --task humaneval
optiq eval mlx-community/Qwen3.5-9B-OptiQ-4bit --task hashhop

# Custom reference for KL (skip auto-resolver)
optiq eval ./my-quant --task kl --reference-model Qwen/Qwen3.5-9B --reference-mode bf16

Every task above is callable on its own. Pick the one you need with optiq eval --task <name>.

— the mlx-optiq team

The eval framework that drives every quant we ship.

Two stages: a smoketest and the full benchmarks

The smoketest: KL + GSM8K-50

The benchmarks: six metrics

Always-on reasoning models: --reasoning

Sandboxing HumanEval

HashHop Long-Context Evaluation

The Capability Score

Picking the KL reference

Reproducing

Always-on reasoning models: `--reasoning`