Weight quantization
Quality vs uniform 4-bit
OptiQ's per-layer bit allocation recovers accuracy lost to uniform 4-bit quantization. Measured on GSM8K with 200 samples per model, same prompt template, same sampling.
LLMs — GSM8K (200 samples)
Model Uniform 4-bit OptiQ 4-bit Delta
Qwen3.5-0.8B 11.5% 27.0% +15.5pp
Qwen3.5-2B 48.5% 48.0% -0.5pp
Qwen3.5-4B 79.5% 81.5% +2.0pp
Qwen3.5-9B 90.0% 90.0% 0.0pp
gemma-4-e2b-it 5.5% 13.0% +7.5pp
gemma-4-e4b-it 23.5% 55.5% +32.0pp
Pattern: OptIQ's wins grow with how much uniform 4-bit degrades the model. On saturated benchmarks (Qwen3.5-9B at 90%), there's nothing to recover. On broken-by-uniform-quant models (gemma-4-e4b: 23.5% → 55.5%), the win is dramatic.

KV cache serving
Mixed-precision KV A/B at 64k context
Apple M3 Max (36 GB). 64,000-token English prose prompt, streaming 500 output tokens. Comparing stock mlx_lm.server (fp16 KV) vs optiq serve --kv-config (per-layer sensitivity-guided KV quantization).
Decode tok/s at 64k
Model fp16 TTFT fp16 decode Mixed TTFT Mixed decode Decode speedup
Qwen3.5-0.8B 34.5s 47.2 tok/s 40.5s 42.4 tok/s -10%
Qwen3.5-2B 82.8s 27.9 tok/s 161.7s 41.8 tok/s +50%
Qwen3.5-4B 165.8s 8.1 tok/s 252.0s 13.1 tok/s +62%
Qwen3.5-9B 214.8s 20.7 tok/s 163.8s 27.1 tok/s +31%
Takeaway: For Qwen3.5 2B and larger, mixed-precision KV gives a 31–62% decode speedup at 64k context. 0.8B is too small to benefit — optiq kv-cache's sensitivity pass correctly picks uniform 4-bit for it (no layer needs 8-bit protection), but on M3 Max even that is marginally slower than fp16 at this scale.
Per-layer KV configs (generated by optiq kv-cache --target-bits 4.5)
Model Full-attn layers Config Avg bits
Qwen3.5-0.8B 6 of 24 6 @ 4-bit 4.00
Qwen3.5-2B 6 of 24 6 @ 4-bit 4.00
Qwen3.5-4B 8 of 32 7 @ 4-bit + 1 @ 8-bit (layer 3) 4.50
Qwen3.5-9B 8 of 32 7 @ 4-bit + 1 @ 8-bit (layer 3) 4.50
Why layer 3? In Qwen3.5's hybrid architecture, layer 3 is the first full-attention layer (layers 0, 1, 2 are linear-attention). It's consistently the most KV-sensitive layer across both 4B and 9B — and protecting it at 8-bit also happens to flip it onto mx.quantized_matmul's fast path on Apple Silicon. Quality-preservation and speed happen to point the same direction.

TurboQuant research
Rotated-space KV attention
The research path: rotation-based vector quantization that preserves inner products for attention. Compared against mlx-lm's affine QuantizedKVCache.
Quality and speed comparison
Quality + speed
TurboQuant 4-bit: better reasoning (32% vs 30%), perfect retrieval (100% vs 73%), same speed (-2%).
Needle retrieval
Needle retrieval
100% at 4-bit vs 73% for affine — rotated-space quantization preserves inner products across all sequence positions.
Perplexity
Perplexity
Tight perplexity gap at matched bit-widths — TurboQuant MSE 4-bit is PPL +0.37 vs affine's +0.48.
Memory scaling
Memory scaling
Per-token KV storage as a function of sequence length — 4-bit TurboQuant is 4× smaller than fp16.

YOLO object detection
COCO128 detection deltas
OptiQ 6-bit YOLO models vs the original fp16. Quality measured on COCO128 (128 images, standard detection metric).
Model Original OptiQ Compression Detection delta
YOLO26n 9.9 MB 2.5 MB 3.9× -1.6%
YOLO26s 38.4 MB 8.9 MB 4.3× -7.0%
YOLO26m 83.8 MB 18.9 MB 4.4× +0.1%
YOLO26l 100.7 MB 22.9 MB 4.4× 0.0%
YOLO26x 225.5 MB 50.6 MB 4.5× -1.1%