Pre-built models
OptiQ-quantized models on HuggingFace. LLMs drop into stock mlx-lm. YOLO models require pip install mlx-optiq[yolo].
Qwen3.5 — quality measured on GSM8K (200 samples)
| Model | Original | OptiQ | Compression | OptiQ vs Uniform 4-bit |
|---|---|---|---|---|
| Qwen3.5-0.8B-OptiQ-4bit | 1,666 MB | 570 MB | 2.9× | 27.0% vs 11.5% (+15.5pp) |
| Qwen3.5-2B-OptiQ-4bit | 4,338 MB | 1,365 MB | 3.2× | 48.0% vs 48.5% |
| Qwen3.5-4B-OptiQ-4bit | 8,888 MB | 2,811 MB | 3.2× | 81.5% vs 79.5% (+2.0pp) |
| Qwen3.5-9B-OptiQ-4bit | 18,412 MB | 5,763 MB | 3.2× | 90.0% vs 90.0% |
Gemma-4 — quality measured on GSM8K (200 samples)
| Model | Original | OptiQ | Compression | OptiQ vs Uniform 4-bit |
|---|---|---|---|---|
| gemma-4-e2b-it-OptiQ-4bit | 9,772 MB | 3,978 MB | 2.5× | 13.0% vs 5.5% (+7.5pp) |
| gemma-4-e4b-it-OptiQ-4bit | 15,252 MB | 6,028 MB | 2.5× | 55.5% vs 23.5% (+32.0pp) |
Note on Gemma-4 serving: Gemma-4 inference works fine with fp16 KV (stock
mlx_lm.server or optiq serve without --kv-config). The mixed-precision KV path currently fails on Gemma-4's shared-KV attention layers — an upstream mlx-lm limitation we're tracking for a future release.
YOLO26 object detection — quality measured on COCO128
| Model | Original | OptiQ | Compression | Detection delta |
|---|---|---|---|---|
| YOLO26n-OptiQ-6bit | 9.9 MB | 2.5 MB | 3.9× | -1.6% |
| YOLO26s-OptiQ-6bit | 38.4 MB | 8.9 MB | 4.3× | -7.0% |
| YOLO26m-OptiQ-6bit | 83.8 MB | 18.9 MB | 4.4× | +0.1% |
| YOLO26l-OptiQ-6bit | 100.7 MB | 22.9 MB | 4.4× | 0.0% |
| YOLO26x-OptiQ-6bit | 225.5 MB | 50.6 MB | 4.5× | -1.1% |
YOLO26 usage — requires mlx-optiq[yolo]Python
from optiq.models.yolo import load_quantized_yolo model = load_quantized_yolo("mlx-community/YOLO26n-OptiQ-6bit") results = model.predict("image.jpg")