The eval framework that drives every quant we ship.
For most of mlx-optiq's life, the regression check on a fresh quant was: does it answer the same GSM8K-50 questions the bf16 model answered? A quick 50-sample math check. Five minutes. If it passed, we shipped.
That worked when the things we cared about were does it still talk and can it still do arithmetic. It stopped working the day someone tried to use one of our quants behind an MCP server and got JSONDecodeError on every third tool call. The model could still do math. It had quietly forgotten how to emit valid function-call syntax. GSM8K-50 didn't ask the question.
You catch the regression with the eval that tests the workload. Not the one that's cheap to run.
This release replaces the GSM8K-50 smoketest with a two-stage suite: a fast smoketest for triage, then a full benchmark run for the headline number. Sandboxed HumanEval execution, auto-resolved KL reference, single roll-up Capability Score, all from one CLI command.
Two stages: a smoketest and the full benchmarks
Quants pass through two checkpoints. The first is fast and triages; the second is slow and decides what we ship.
| Stage | Time / model | What it answers | Tasks |
|---|---|---|---|
| Smoketest | ~5 min | Did the convert work? Are we close to the reference distribution? | KL on 64 prompts × 256 tokens · GSM8K-50 (chat-templated, thinking off) |
| Benchmarks | ~90 min | How much capability did we keep across the workloads users actually run? | MMLU-1k 5-shot · GSM8K-1k · IFEval (full) · BFCL-V3 simple (200) · HumanEval (164) |
The smoketest is the gate. A quant that fails it doesn't get the full benchmarks. The benchmark numbers are what end up on the model card.
The smoketest: KL + GSM8K-50
KL divergence between two language models, computed token-by-token over a small batch of held-out prompts, is a cheap signal that works well in practice. The reference is the highest-fidelity version of the model that fits on the box. The candidate is the OptIQ quant. We compute KL(reference ‖ candidate) per token, average across 64 prompts × 256 tokens, and report mean + p95.
The auto-resolver picks the reference automatically:
# Pick the highest-fidelity reference that fits on the box. bf16_gb = hf_repo_size_gb(strip_quant_suffix(model_id)) avail_gb = psutil.virtual_memory().available / 1024**3 if bf16_gb <= 0.70 * avail_gb: return "bf16", strip_quant_suffix(model_id) # bf16 doesn't fit; fall back to the uniform-4-bit MLX baseline. return "uniform_4bit", uniform_4bit_repo(model_id)
The smoketest sweep across the 12 quants we shipped pre-v0.1.0 produced this table. It's also the source of the re-quant list at the bottom of the post:
| Model | KL mean | KL p95 | GSM8K-50 | Reference |
|---|---|---|---|---|
Qwen3.5-27B-OptiQ-4bit | 0.05 | 0.15 | 96 % | uniform-4bit |
Qwen3.6-27B-OptiQ-4bit | 0.06 | 0.29 | 100 % | uniform-4bit |
Qwen3.5-9B-OptiQ-4bit | 0.18 | 0.80 | 82 % | bf16 |
Qwen3.5-2B-OptiQ-4bit | 0.19 | 0.84 | 56 % | bf16 |
gemma-4-e4b-it-OptiQ-4bit | 0.28 | 1.35 | 92 % | bf16 |
gemma-4-e2b-it-OptiQ-4bit | 0.57 | 3.04 | 56 % | bf16 |
gemma-4-26B-A4B-it-OptiQ-4bit 🔻 | 0.93 | 4.31 | 96 % | uniform-4bit |
gemma-4-31B-it-OptiQ-4bit 🔻 | 0.99 | 4.76 | 98 % | uniform-4bit |
Two flags fell out: the Gemma-4 26B-A4B sparse-MoE and the 31B dense, both with KL 20× higher than the Qwen3.5-27B at comparable size, even though their GSM8K-50 numbers (96 % and 98 %) looked fine. KL caught a calibration regression that single-task accuracy missed.
The full benchmarks then surfaced two more regressors with low KL but degraded BFCL and IFEval: Qwen3.5-27B and Qwen3.6-27B, both under-protected on tool-call layers by WikiText-only calibration.
The benchmarks: five metrics
The benchmark suite is what ends up on the model card. Each task targets a capability slice:
- MMLU: 5-shot, stratified across the 57 subjects, 1000 samples. Encyclopedic knowledge after instruction-tuning. The bf16 anchor.
- GSM8K: 1000 samples, 3-shot CoT, chat-templated,
enable_thinking=Falsefor reasoning models. Multi-step arithmetic. - IFEval: full Google IFEval set with all 25+ constraint verifiers. Measures whether the model can follow detailed format / length / capitalization / inclusion-exclusion instructions. We report strict (the standard, harder metric).
- BFCL-V3 simple: 200 single-turn function-calls with AST equivalence scoring. Whether the model can emit a syntactically valid call and pick the right tool from a small candidate set.
- HumanEval: all 164 problems, sandboxed Python execution, pass@1 only.
Run from the CLI as a single task:
optiq eval ./optiq_mixed --task all --score
Each individual task is also addressable (--task mmlu, --task ifeval, etc.) for when you only need one number.
Sandboxing HumanEval
HumanEval requires actually executing the model's generated Python against a unit-test harness. Doing that on the user's machine with no isolation is a footgun. A model that emits os.system("rm -rf …") ruins someone's afternoon. The sandbox helper falls through three tiers:
- apple/container: when present, runs each candidate inside a fresh container with no network, no filesystem mount outside
/tmp, and a wall-clock timeout. Hardest isolation, slowest start. - sandbox-exec: macOS native, when
/usr/bin/sandbox-execis available. Subprocess with a tight seatbelt profile (no network, deny file-write outside/tmp). Fast. - subprocess + rlimit: universal fallback. Spawn a Python child with
RLIMIT_AS,RLIMIT_CPU,RLIMIT_FSIZEcaps and a process-group timeout. No filesystem isolation; exists so the eval doesn't simply fail to run on Linux CI.
The helper picks the strictest tier available at runtime. Reported pass@1 is identical across tiers because the test harness is deterministic. Only the blast radius of malicious code changes.
The Capability Score
Five percentages are hard to compare side-by-side. We want one number that answers which quant is more capable on average?. And we want a formula the reader can audit, not a hidden value judgement dressed up as math.
The simplest one that meets that bar:
Capability_Score = mean(MMLU, GSM8K, IFEval, BFCL, HumanEval)
We tried a weighted formula first. Something like MMLU + 0.3 × IFEval + 0.5 × BFCL − 5 × disk_GB. It looked clever. It also embedded our quality/disk tradeoff in a way users can't see, and it could turn a +1 pp capability win into a "loss" if the disk grew by half a gigabyte. That's a recommendation, not a measurement.
So we stripped it down. The five benchmarks each get an equal vote. disk_gb is reported next to the score as an unweighted second axis, and the reader picks their own tradeoff. If you're optimizing for an 8 GB Mac, smaller wins. If you're on a 64 GB Studio, larger probably wins. The score doesn't pretend to know.
Two consequences worth flagging. (1) GSM8K is now back in the average. Earlier we worried about double-counting MMLU's reasoning content, but in practice GSM8K and MMLU disagree often enough on quants that letting GSM8K vote catches regressions MMLU misses. (2) HumanEval is in too, which means a quant that breaks code generation can't hide behind strong instruction-following.
Picking the KL reference
One technical note that took us a few iterations to get right.
For models that fit in RAM (everything ≤ ~10 B at bf16 on a 36 GB Mac), the KL reference is unambiguous: it's the bf16 model itself. For 27 B+, bf16 doesn't fit, and you need a substitute reference that's still strictly higher-fidelity than the candidate. The community's uniform-4-bit MLX publish of the same model is exactly that: same architecture and weights modulo quantization noise, just at uniform 4-bit (no per-layer mixed precision).
The auto-resolver picks bf16 if available, falls back to uniform-4-bit otherwise. The fall-back was originally driven by a crude params × 2 bytes size estimate, which under-counted gemma-4-26B-A4B's MoE expert tensors and tried to load 110 GB of bf16 into 36 GB of RAM. Now we hit HfApi.model_info() and sum the actual safetensors shard sizes. The resolver is exact and the OOM is gone.
Reproducing
Everything in this post runs from the CLI. No special setup beyond pip install mlx-optiq:
# Fast smoketest (KL + GSM8K-50) optiq eval mlx-community/Qwen3.5-9B-OptiQ-4bit --task smoketest # Full benchmarks (MMLU + GSM8K + IFEval + BFCL + HumanEval + Score) optiq eval mlx-community/Qwen3.5-9B-OptiQ-4bit --task all --score # Single tasks if you only need one number optiq eval mlx-community/Qwen3.5-9B-OptiQ-4bit --task bfcl optiq eval mlx-community/Qwen3.5-9B-OptiQ-4bit --task humaneval # Custom reference for KL (skip auto-resolver) optiq eval ./my-quant --task kl --reference-model Qwen/Qwen3.5-9B --reference-mode bf16
Every task above is callable on its own. Pick the one you need with optiq eval --task <name>.
— the mlx-optiq team