Postmortem · April 17, 2026

TurboQuant: what we built, what we measured, and why we didn't ship it.

Topic Postmortem Reading time 7 min Related research foundation

This post is the version of itself we wish we'd written months ago. We built a thing. The benchmarks looked good. We didn't ship it. The reason isn't a flaw in the technique. The marginal win didn't justify the cost of a parallel serving path. That's a reasonable call but a deeply unsexy one to write up, so most teams don't.

The right answer to "should we ship this research?" is sometimes "no, even though the numbers are real."

The technique: rotated-space attention

Affine quantization is the standard way to compress the KV cache: stretch each tensor to fit in a smaller integer range, store an offset and a scale per group. It works for storage. But attention is a dot-product operation, and affine quantization doesn't preserve dot products well. It preserves magnitudes. The error in q · k after quantizing k isn't isotropic; it correlates with semantic structure in ways that bias which tokens win the softmax.

The fix is mathematically simple. Multiply both K and Q by the same random orthogonal matrix R. The attention score is preserved exactly: (R q) · (R k) = q · R^TR · k = q · k. So you can quantize R k instead of k and recover the original score. Random rotation distributes a vector's mass roughly uniformly across all coordinates, so the per-coordinate distribution becomes nearly Gaussian (concentration of measure). That's exactly what scalar quantizers like best.

The catch is the cost. Naive rotated quantization needs you to dequantize keys and rotate them back at attention time, which is O(seq_len × d²) per token, fatal at long context. Our trick was attending in rotated space: rotate the query once per step (cost O(d²) fixed), and dot it directly against the stored quantized rotated keys via a fused Metal kernel that does the dequant inline. That brings the overhead per step back to roughly the same as affine.

The numbers we measured

On Qwen3.5-9B with 4-bit KV at 64 k context, head-to-head against mlx-lm's affine QuantizedKVCache:

Method	Bits	Needle retrieval	Reasoning	PPL drift	Speed vs fp16
fp16 (reference)	16	100 %	32 %	—	1.00 ×
Affine	4	73 %	30 %	+0.48	0.96 ×
TurboQuant (rotated)	4	100 %	32 %	+0.37	0.98 ×

The needle-in-a-haystack number was the most striking: 100 % vs 73 % at 4-bit, on the exact same model and exact same prompts. The retrieval failures with affine concentrated on long-distance matches, exactly where small inner-product errors push the wrong token over the softmax threshold. Reasoning quality and perplexity were close to fp16; speed was within 2 % of the affine path.

The optimization journey itself was satisfying engineering. Our first naive Python implementation of rotated-space attention was 47 % slower than affine. Three rounds of work (incremental dequantization, a custom Metal kernel, a fused SDPA pass that never materializes dequantized keys) closed the gap to 2 %.

So why didn't we ship it

Three things killed it in turn.

1. The 100 % vs 73 % needle test was synthetic

Needle-in-a-haystack is a worst-case probe: single rare token, single position, against a vast distractor field. Real workloads almost never look like this. When we re-ran on more realistic long-context tasks (multi-fact retrieval, tool-result interpretation, multi-turn chat with file context), the gap shrunk dramatically. At 32 k context on a 9 B model, both quantizers landed within 1–2 percentage points of fp16 quality on the tasks our users actually run, and within noise of each other.

The headline number was real. It just wasn't predictive of user-visible quality.

2. Per-layer mixed-precision affine already captured most of the win

The companion experiment, per-layer KV bit-width assignment, was the bigger lever. Once we measured per-layer KV sensitivity and protected layer 0 (often 56× more sensitive than the average) at 8-bit, the affine path's quality regressions on long context largely disappeared. Mixed-precision affine was already covering most of what TurboQuant fixed, and it integrates with mx.quantized_matmul's fused fast path on Apple Silicon. TurboQuant, with its custom kernel, does not.

Net for users: mixed-precision affine gave roughly the same quality as TurboQuant on real workloads, and ~30–60 % faster decode at 64 k context. We can't ship a quality wash that's also slower.

3. Two parallel serving paths is one too many

Even ignoring quality and speed, shipping TurboQuant in optiq serve meant maintaining a fork of mlx-lm's attention path. Every kernel update, every model class addition, every new attention variant (sliding window, hybrid, MoE-routed) would need to be ported into our rotated-space version. That's a structural ongoing cost the technique would have had to keep earning.

For a research codebase that's a fine trade. For something we ship to users on PyPI as their default serving stack, it's not.

What we kept, what we removed

For a long time we kept the TurboQuant code in the package as an "import-it-yourself" library primitive: disabled by default, available for users who specifically wanted to play with rotated-space KV. As we got close to our first real release, we audited what was still in active use and what was just historical research that hadn't earned its place. TurboQuant was in the second bucket. We removed it from the package.

What stayed:

The conceptual framing that informed TurboQuant ("attention compression has different sensitivity than weight compression") is now embedded in the standard mixed-precision KV pipeline.

What went away:

optiq.core.turbo_kv_cache, optiq.core.turbo_quant, optiq.core.turbo_metal, optiq.core.turbo_state_cache: all deleted.
The demo/demo_turbo_kv.py stress test.
The TurboQuant-flavoured options that never made it into the public CLI anyway.

Lessons we took with us

Synthetic benchmarks lie about marginal wins. The 100 % vs 73 % needle gap was real and reproducible. It was also a worst-case probe of one failure mode that doesn't dominate real usage. We've since moved to multi-domain eval suites (5-shot MMLU, IFEval, BFCL, plus task-specific tests at multiple context lengths) so we don't get fooled the same way again.

If a feature requires a separate serving path, it has to clear a much higher bar. The cost of maintaining parallel infrastructure compounds with every new model architecture. Anything that doesn't cleanly compose with stock mlx-lm needs to deliver a meaningful, durable win, not a marginal one that erodes as the upstream landscape changes.

Ship fewer things, but ship them well. A library that does five things excellently beats one that does six things plus a research demo. mlx-optiq's edge is mixed-precision quantization, sensitivity-aware LoRA, and a dual-protocol serving stack. That's enough.

The TurboQuant code is gone from main. We won't be maintaining it.

— the mlx-optiq team