Apple Silicon native
Mixed-precision
quantization for MLX.
Per-layer sensitivity analysis assigns optimal bit-widths. TurboQuant KV cache with rotated-space attention. Better quality, same speed.
What it does
Two optimization layers
OptiQ models work with standard mlx-lm. Install mlx-optiq for the TurboQuant KV cache on top.
W Mixed-precision weights
Measures each layer's sensitivity via KL divergence and assigns optimal per-layer bit-widths using greedy knapsack. Sensitive layers get 8-bit, robust layers get 4-bit.
2.3x better accuracy on Qwen3.5-0.8B
K TurboQuant KV cache
Rotation-based vector quantization that preserves attention inner products. Rotated-space attention eliminates per-key rotation overhead at inference time.
100% needle retrieval at 4-bit vs 73% FP16
Benchmarks
Measured on Qwen3.5-0.8B, M3 Max
Quality + Speed
TurboQuant 4-bit: better reasoning (32% vs 30%), perfect retrieval (100% vs 73%), same speed (-2%).
Optimization Journey
From 47% slower to 2% slower through incremental dequantization, custom Metal kernels, and rotated-space attention.
Rotated-Space Attention
Rotate the query once instead of every stored key. O(d²) fixed vs O(seq_len x d²) per token.
Usage
Drop-in integration
OptiQ models work with stock mlx-lm. Install mlx-optiq to add TurboQuant KV cache on top.
Weight quantization
stock mlx-lm
from mlx_lm import load, generate model, tok = load( "mlx-community/Qwen3.5-0.8B-OptiQ-4bit" ) out = generate( model, tok, prompt="Hello", max_tokens=100 )
TurboQuant KV cache
mlx-optiq
from optiq.core.turbo_kv_cache import ( TurboQuantKVCache, patch_attention ) patch_attention() cache = model.make_cache() for i, layer in enumerate(model.layers): if hasattr(layer, "self_attn"): cache[i] = TurboQuantKVCache( head_dim=layer.self_attn.head_dim, bits=4, seed=42+i )
Pre-built models
Ready to use on HuggingFace
Work with standard mlx-lm. No special code needed for weight quantization.
| Model | GSM8K | Uniform 4-bit | Size |
|---|---|---|---|
| Qwen3.5-0.8B-OptiQ-4bit | 27.0% | 11.5% | 570 MB |
| Qwen3.5-2B-OptiQ-4bit | 48.0% | 48.5% | 1,365 MB |
| Qwen3.5-4B-OptiQ-4bit | 81.5% | 79.5% | 2,811 MB |
| Qwen3.5-9B-OptiQ-4bit | 90.0% | 90.0% | 5,763 MB |
Under the hood
How it works
Weight Quantization Pipeline
LoadHuggingFace
→
SensitivityKL divergence
→
OptimizeGreedy knapsack
→
ConvertMLX format
TurboQuant KV Cache
RotateRandom orthogonal R
→
QuantizeLloyd-Max centroids
→
AttendRotated-space SDPA
→
OutputPost-rotate result