Apple Silicon native

Optimized deployment
for MLX.

Per-layer sensitivity drives every layer of the stack — weights, KV cache, LoRA fine-tuning, runtime adapter hot-swap. One data-driven signal, consistent from quantization to serving. Smaller, faster, and with the same sensitivity-aware LoRA budget where it helps most.

$ pip install mlx-optiq

What it does
One sensitivity signal, six optimization layers
OptIQ measures each layer's sensitivity once and reuses it everywhere — weight quant, KV quant, LoRA rank, TurboQuant rotation, VLM stripping. Ship with optiq serve for the full stack.

W Mixed-precision weights

Per-layer KL-divergence sensitivity + greedy knapsack. Sensitive layers stay at higher precision; the rest get aggressively quantized. Auto-strips vision/audio metadata when quantizing VLM bases for text-only use.

2.3× better GSM8K on Qwen3.5-0.8B vs uniform 4-bit

K Mixed-precision KV cache

Serve with per-layer KV bit-widths from a sensitivity pass. Protects KV-sensitive layers, aggressively quantizes the rest. Works with stock mlx-lm generation.

+40–62% decode speedup at 64k on Qwen3.5 2B/4B/9B

T TurboQuant compression

Rotation-based vector quantization that preserves attention inner products. Research path for quality-critical workloads and affine-KV comparisons.

100% needle retrieval at 4-bit vs 73% for affine

L Sensitivity-aware LoRA

optiq lora train reads OptIQ's per-layer bit assignments and gives sensitive layers higher adapter rank than robust ones. PEFT-compatible output plus an OptIQ sidecar.

-12% val loss at iter 50 vs constant rank (GSM8K subset)

A Hot-swap adapters

Reversible mounted LoRA. Mount N adapters on one base, switch per-request via ContextVar — no model reload, no GPU re-load. optiq serve --adapter <HF-repo-id> auto-downloads from HuggingFace.

~50 MB per extra adapter vs ~5 GB per extra base

Y YOLO26 & OpenAI server

YOLO26 object detection gets the same per-layer sensitivity treatment on detection outputs. optiq serve is a drop-in mlx_lm.server replacement — same OpenAI API, KV quant + LoRA built in.

5 pre-built YOLO26 OptIQ 6-bit sizes (n/s/m/l/x)

Quickstart
From install to sensitivity-aware LoRA in four commands
1. Installbash
$ pip install mlx-optiq
2. Use a pre-built model (works with stock mlx-lm too)python
from mlx_lm import load, generate

model, tok = load("mlx-community/Qwen3.5-9B-OptiQ-4bit")
out = generate(model, tok, prompt="Hello", max_tokens=100)
3. Serve with mixed-precision KV + a HuggingFace adapterbash
# Sensitivity analysis once per model (~1-2 min)
$ optiq kv-cache mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --target-bits 4.5 -o optiq_output/kv_cache/qwen35_9b

# Serve with quantized KV + a community LoRA (auto-downloaded from HF)
$ optiq serve \
    --kv-config optiq_output/kv_cache/qwen35_9b/kv_config.json \
    --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --adapter codelion/my-agent-lora \
    --max-tokens 32768 --temp 0.6 --top-p 0.95
4. Fine-tune with sensitivity-aware LoRAbash
# 8-bit layers get 2× the rank of 4-bit layers at the same base rank
$ optiq lora train mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --data ./my_training_data \
    --rank 8 --rank-scaling by_bits \
    --iters 1000 -o ./my_adapter

# Inspect the per-layer rank distribution
$ optiq lora info ./my_adapter

Numbers that matter
At a glance
3.2×
Weight compression
+62%
Decode speedup (4B, 64k)
6
Pre-built LLMs
5
Pre-built YOLOs
100×
RAM vs N full copies
(multi-adapter)
0 ms
Adapter swap latency
(ContextVar gate)
See Results for full benchmarks: per-model GSM8K comparison with uniform 4-bit, COCO detection deltas for YOLO, the 64k-context KV-serving A/B across Qwen3.5 0.8B / 2B / 4B / 9B, and the v0.0.9 sensitivity-aware LoRA A/B on GSM8K.