Mixed-precision
quantization for MLX.
Per-layer sensitivity analysis for weights and the KV cache. Protect the layers that matter at 8-bit, quantize the rest to 4-bit. Smaller and faster than uniform quantization — and, on some models, faster than fp16.
optiq serve to get the KV-cache speedup out of the box.W Mixed-precision weights
Per-layer KL-divergence sensitivity analysis + greedy knapsack allocation. Sensitive layers stay at higher precision; the rest get aggressively quantized.
K Mixed-precision KV cache
Serve with per-layer KV bit-widths from a sensitivity pass. Protects KV-sensitive layers, aggressively quantizes the rest. Works with stock mlx-lm generation.
T TurboQuant KV cache
Rotation-based vector quantization that preserves attention inner products. Research path for quality-critical workloads and affine-KV comparisons.
O OpenAI-compatible server
optiq serve wraps mlx_lm.server and injects the mixed-precision KV path. Same API, same ports, same clients — faster generation.
$ pip install mlx-optiq
from mlx_lm import load, generate model, tok = load("mlx-community/Qwen3.5-9B-OptiQ-4bit") out = generate(model, tok, prompt="Hello", max_tokens=100)
# Sensitivity analysis once per model (~1-2 min) $ optiq kv-cache mlx-community/Qwen3.5-9B-OptiQ-4bit \ --target-bits 4.5 -o optiq_output/kv_cache/qwen35_9b # Serve (OpenAI-compatible on :8080) $ optiq serve \ --kv-config optiq_output/kv_cache/qwen35_9b/kv_config.json \ --model mlx-community/Qwen3.5-9B-OptiQ-4bit \ --max-tokens 32768 --temp 0.6 --top-p 0.95 --top-k 20