Apple Silicon native

Mixed-precision
quantization for MLX.

Per-layer sensitivity analysis assigns optimal bit-widths. TurboQuant KV cache with rotated-space attention. Better quality, same speed.

$ pip install mlx-optiq

What it does
Two optimization layers
OptiQ models work with standard mlx-lm. Install mlx-optiq for the TurboQuant KV cache on top.

W Mixed-precision weights

Measures each layer's sensitivity via KL divergence and assigns optimal per-layer bit-widths using greedy knapsack. Sensitive layers get 8-bit, robust layers get 4-bit.

2.3x better accuracy on Qwen3.5-0.8B

K TurboQuant KV cache

Rotation-based vector quantization that preserves attention inner products. Rotated-space attention eliminates per-key rotation overhead at inference time.

100% needle retrieval at 4-bit vs 73% FP16

Benchmarks
Measured on Qwen3.5-0.8B, M3 Max
Quality and speed comparison
Quality + Speed
TurboQuant 4-bit: better reasoning (32% vs 30%), perfect retrieval (100% vs 73%), same speed (-2%).
Optimization journey
Optimization Journey
From 47% slower to 2% slower through incremental dequantization, custom Metal kernels, and rotated-space attention.
How rotated-space attention works
Rotated-Space Attention
Rotate the query once instead of every stored key. O(d²) fixed vs O(seq_len x d²) per token.

Usage
Drop-in integration
OptiQ models work with stock mlx-lm. Install mlx-optiq to add TurboQuant KV cache on top.
Weight quantization stock mlx-lm
from mlx_lm import load, generate

model, tok = load(
    "mlx-community/Qwen3.5-0.8B-OptiQ-4bit"
)
out = generate(
    model, tok,
    prompt="Hello",
    max_tokens=100
)
TurboQuant KV cache mlx-optiq
from optiq.core.turbo_kv_cache import (
    TurboQuantKVCache, patch_attention
)

patch_attention()

cache = model.make_cache()
for i, layer in enumerate(model.layers):
    if hasattr(layer, "self_attn"):
        cache[i] = TurboQuantKVCache(
            head_dim=layer.self_attn.head_dim,
            bits=4, seed=42+i
        )

Pre-built models
Ready to use on HuggingFace
Work with standard mlx-lm. No special code needed for weight quantization.
Model GSM8K Uniform 4-bit Size
Qwen3.5-0.8B-OptiQ-4bit 27.0% 11.5% 570 MB
Qwen3.5-2B-OptiQ-4bit 48.0% 48.5% 1,365 MB
Qwen3.5-4B-OptiQ-4bit 81.5% 79.5% 2,811 MB
Qwen3.5-9B-OptiQ-4bit 90.0% 90.0% 5,763 MB

Under the hood
How it works
Weight Quantization Pipeline
LoadHuggingFace
SensitivityKL divergence
OptimizeGreedy knapsack
ConvertMLX format
TurboQuant KV Cache
RotateRandom orthogonal R
QuantizeLloyd-Max centroids
AttendRotated-space SDPA
OutputPost-rotate result