Apple Silicon native

Mixed-precision
quantization for MLX.

Per-layer sensitivity analysis for weights and the KV cache. Protect the layers that matter at 8-bit, quantize the rest to 4-bit. Smaller and faster than uniform quantization — and, on some models, faster than fp16.

$ pip install mlx-optiq

What it does
Three optimization layers
Every layer of your OptiQ-processed model uses data-driven precision. Serve with optiq serve to get the KV-cache speedup out of the box.

W Mixed-precision weights

Per-layer KL-divergence sensitivity analysis + greedy knapsack allocation. Sensitive layers stay at higher precision; the rest get aggressively quantized.

2.3× better GSM8K on Qwen3.5-0.8B vs uniform 4-bit

K Mixed-precision KV cache

Serve with per-layer KV bit-widths from a sensitivity pass. Protects KV-sensitive layers, aggressively quantizes the rest. Works with stock mlx-lm generation.

+40–62% decode speedup at 64k on Qwen3.5 2B/4B/9B

T TurboQuant KV cache

Rotation-based vector quantization that preserves attention inner products. Research path for quality-critical workloads and affine-KV comparisons.

100% needle retrieval at 4-bit vs 73% for affine

O OpenAI-compatible server

optiq serve wraps mlx_lm.server and injects the mixed-precision KV path. Same API, same ports, same clients — faster generation.

0 code changes needed on the client side

Quickstart
Three commands
1. Installbash
$ pip install mlx-optiq
2. Use a pre-built model (stock mlx-lm)python
from mlx_lm import load, generate

model, tok = load("mlx-community/Qwen3.5-9B-OptiQ-4bit")
out = generate(model, tok, prompt="Hello", max_tokens=100)
3. Serve with mixed-precision KVbash
# Sensitivity analysis once per model (~1-2 min)
$ optiq kv-cache mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --target-bits 4.5 -o optiq_output/kv_cache/qwen35_9b

# Serve (OpenAI-compatible on :8080)
$ optiq serve \
    --kv-config optiq_output/kv_cache/qwen35_9b/kv_config.json \
    --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --max-tokens 32768 --temp 0.6 --top-p 0.95 --top-k 20

Numbers that matter
At a glance
3.2×
Weight compression
+62%
Decode speedup (4B, 64k)
6
Pre-built LLMs
5
Pre-built YOLO models
See Results for full benchmarks: per-model GSM8K comparison with uniform 4-bit, COCO detection deltas for YOLO, and the 64k-context KV-serving A/B across Qwen3.5 0.8B / 2B / 4B / 9B.