Apple Silicon Native

mlx-optiq

Mixed-precision quantization optimizer for MLX models on Apple Silicon. Better quality through per-layer sensitivity analysis and rotation-based KV cache compression.

pip install mlx-optiq

Two optimization layers

OptiQ quantized models work with standard mlx-lm. Install mlx-optiq for the TurboQuant KV cache on top.

W Mixed-Precision Weights

OptiQ measures each layer's sensitivity via KL divergence and assigns optimal per-layer bit-widths using greedy knapsack optimization. Sensitive layers get 8-bit, robust layers get 4-bit.

2.3x better accuracy — Qwen3.5-0.8B: 27% GSM8K vs 11.5% uniform 4-bit

K TurboQuant KV Cache

Rotation-based vector quantization from TurboQuant that preserves attention inner products. Rotated-space attention eliminates per-key rotation overhead.

100% needle retrieval at 4-bit vs 73% FP16. Near-zero speed overhead.

Measured results

All benchmarks on Qwen3.5-0.8B running on Apple Silicon M3 Max.

Quality and speed comparison
Quality + Speed
TurboQuant 4-bit: better reasoning (32% vs 30%), perfect retrieval (100% vs 73%), same speed (-2%). TurboQuant 3-bit is actually 4% faster than FP16.
Optimization journey
Optimization Journey
From 47% slower to 2% slower through incremental dequantization, custom Metal kernels, and rotated-space attention.
How rotated-space attention works
Rotated-Space Attention
Rotate the query once instead of rotating every stored key. O(d²) fixed vs O(seq_len × d²) — eliminates the dequantization bottleneck entirely.

Drop-in integration

OptiQ models work with stock mlx-lm. Install mlx-optiq to add TurboQuant KV cache compression on top.

Weight Quantization — works with stock mlx-lm Python
from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Qwen3.5-0.8B-OptiQ-4bit")
response = generate(model, tokenizer,
    prompt="Hello", max_tokens=100)
TurboQuant KV Cache — requires mlx-optiq Python
from mlx_lm import load
from optiq.core.turbo_kv_cache import (
    TurboQuantKVCache, patch_attention
)

model, tokenizer = load("mlx-community/Qwen3.5-0.8B-OptiQ-4bit")
patch_attention()  # Install rotated-space attention

cache = model.make_cache()
for i, layer in enumerate(model.layers):
    if hasattr(layer, "self_attn"):
        cache[i] = TurboQuantKVCache(
            head_dim=layer.self_attn.head_dim,
            bits=4, seed=42+i
        )

Ready to use

Available on HuggingFace. Work with standard mlx-lm — no special code needed.

Model GSM8K Uniform 4-bit Size
Qwen3.5-0.8B-OptiQ-4bit 27.0% 11.5% 570 MB
Qwen3.5-2B-OptiQ-4bit 48.0% 48.5% 1,365 MB
Qwen3.5-4B-OptiQ-4bit 81.5% 79.5% 2,811 MB
Qwen3.5-9B-OptiQ-4bit 90.0% 90.0% 5,763 MB

Under the hood

Weight Quantization Pipeline
LoadHuggingFace
SensitivityKL divergence
OptimizeGreedy knapsack
ConvertMLX format
TurboQuant KV Cache
RotateRandom orthogonal R
QuantizeLloyd-Max centroids
AttendRotated-space SDPA
OutputPost-rotate result