Optimized deployment
for MLX.
Per-layer sensitivity drives every layer of the stack — weights, KV cache, LoRA fine-tuning, runtime adapter hot-swap. One data-driven signal, consistent from quantization to serving. Smaller, faster, and with the same sensitivity-aware LoRA budget where it helps most.
optiq serve for the full stack.W Mixed-precision weights
Per-layer KL-divergence sensitivity + greedy knapsack. Sensitive layers stay at higher precision; the rest get aggressively quantized. Auto-strips vision/audio metadata when quantizing VLM bases for text-only use.
K Mixed-precision KV cache
Serve with per-layer KV bit-widths from a sensitivity pass. Protects KV-sensitive layers, aggressively quantizes the rest. Works with stock mlx-lm generation.
T TurboQuant compression
Rotation-based vector quantization that preserves attention inner products. Research path for quality-critical workloads and affine-KV comparisons.
L Sensitivity-aware LoRA
optiq lora train reads OptIQ's per-layer bit assignments and gives sensitive layers higher adapter rank than robust ones. PEFT-compatible output plus an OptIQ sidecar.
A Hot-swap adapters
Reversible mounted LoRA. Mount N adapters on one base, switch per-request via ContextVar — no model reload, no GPU re-load. optiq serve --adapter <HF-repo-id> auto-downloads from HuggingFace.
Y YOLO26 & OpenAI server
YOLO26 object detection gets the same per-layer sensitivity treatment on detection outputs. optiq serve is a drop-in mlx_lm.server replacement — same OpenAI API, KV quant + LoRA built in.
$ pip install mlx-optiq
from mlx_lm import load, generate model, tok = load("mlx-community/Qwen3.5-9B-OptiQ-4bit") out = generate(model, tok, prompt="Hello", max_tokens=100)
# Sensitivity analysis once per model (~1-2 min) $ optiq kv-cache mlx-community/Qwen3.5-9B-OptiQ-4bit \ --target-bits 4.5 -o optiq_output/kv_cache/qwen35_9b # Serve with quantized KV + a community LoRA (auto-downloaded from HF) $ optiq serve \ --kv-config optiq_output/kv_cache/qwen35_9b/kv_config.json \ --model mlx-community/Qwen3.5-9B-OptiQ-4bit \ --adapter codelion/my-agent-lora \ --max-tokens 32768 --temp 0.6 --top-p 0.95
# 8-bit layers get 2× the rank of 4-bit layers at the same base rank $ optiq lora train mlx-community/Qwen3.5-9B-OptiQ-4bit \ --data ./my_training_data \ --rank 8 --rank-scaling by_bits \ --iters 1000 -o ./my_adapter # Inspect the per-layer rank distribution $ optiq lora info ./my_adapter
(multi-adapter)
(ContextVar gate)