Engineering · research · releases

Blog

Release notes, methodology dives, and benchmarking deep-dives. New posts land alongside major releases and research findings.

The mlx-optiq eval framework: five benchmarks, one Capability Score

GSM8K-50 alone misses tool-calling regressions. The two-stage eval (KL + GSM8K-50 for triage, MMLU + GSM8K + IFEval + BFCL + HumanEval for headlines) drives every quant we ship. Plus the auto-resolved KL reference, sandboxed code execution, and a single Capability Score that's the unweighted mean of the five benchmarks.

optiq.jsonl: a six-domain calibration mix for mixed-precision quantization

WikiText-2 measures prose; modern LLMs do prose, reasoning, code, agent loops, tool-calling, and constraint-following instructions. We replaced the calibration set with 40 hand-curated samples across all six domains, bundled inside the package, fully reproducible. What you calibrate on is what you protect.

Gemma-4 lands on mlx-optiq: four sizes, +32 pp on the small one

Adding Google's full Gemma-4 instruct lineup: e2b, e4b, 26B-A4B sparse MoE, and 31B dense. The +32-point GSM8K recovery on gemma-4-e4b is the cleanest mixed-precision win we have. Plus the shared-KV caveat that means you'll want Qwen for quantized-KV serving.

TurboQuant: postmortem on a research path we didn't ship

We built rotated-space KV attention with a custom Metal kernel. The benchmarks looked good: 100 % needle retrieval at 4-bit vs 73 % for affine. We still chose affine for the shipping path. Plain writeup of the technique, the numbers, and why the marginal win didn't justify a parallel serving stack.

Sensitivity-aware LoRA: fine-tuning that respects the bit budget

The same per-layer signal that drives mixed-precision quantization also drives adapter rank. 8-bit-quantized layers get 2× the adapter rank of 4-bit-quantized ones at the same parameter budget. Validation loss drops 12 % in head-to-head A/Bs. Plus the empirical training-ceiling map for a 36 GB Mac across all 10 supported models.

Not All Layers Are Equal: mixed-precision quantization for weights and KV cache on Apple Silicon

The research foundation behind mlx-optiq. Some layers are 56× more sensitive than others. The KV cache becomes the dominant memory cost at long contexts. Mixed-precision recovers what uniform 4-bit drops; mixed-precision KV fixes the perplexity collapse uniform 4-bit causes.