Mixed-precision quantization optimizer for MLX models on Apple Silicon. Better quality through per-layer sensitivity analysis and rotation-based KV cache compression.
pip install mlx-optiq
OptiQ quantized models work with standard mlx-lm. Install mlx-optiq for the TurboQuant KV cache on top.
OptiQ measures each layer's sensitivity via KL divergence and assigns optimal per-layer bit-widths using greedy knapsack optimization. Sensitive layers get 8-bit, robust layers get 4-bit.
Rotation-based vector quantization from TurboQuant that preserves attention inner products. Rotated-space attention eliminates per-key rotation overhead.
All benchmarks on Qwen3.5-0.8B running on Apple Silicon M3 Max.
OptiQ models work with stock mlx-lm. Install mlx-optiq to add TurboQuant KV cache compression on top.
from mlx_lm import load, generate model, tokenizer = load("mlx-community/Qwen3.5-0.8B-OptiQ-4bit") response = generate(model, tokenizer, prompt="Hello", max_tokens=100)
from mlx_lm import load from optiq.core.turbo_kv_cache import ( TurboQuantKVCache, patch_attention ) model, tokenizer = load("mlx-community/Qwen3.5-0.8B-OptiQ-4bit") patch_attention() # Install rotated-space attention cache = model.make_cache() for i, layer in enumerate(model.layers): if hasattr(layer, "self_attn"): cache[i] = TurboQuantKVCache( head_dim=layer.self_attn.head_dim, bits=4, seed=42+i )
Available on HuggingFace. Work with standard mlx-lm — no special code needed.
| Model | GSM8K | Uniform 4-bit | Size |
|---|---|---|---|
| Qwen3.5-0.8B-OptiQ-4bit | 27.0% | 11.5% | 570 MB |
| Qwen3.5-2B-OptiQ-4bit | 48.0% | 48.5% | 1,365 MB |
| Qwen3.5-4B-OptiQ-4bit | 81.5% | 79.5% | 2,811 MB |
| Qwen3.5-9B-OptiQ-4bit | 90.0% | 90.0% | 5,763 MB |