# mlx-optiq > mlx-optiq is a Python toolkit for running large language models entirely on Apple Silicon. One PyPI package gives you: (1) sensitivity-driven mixed-precision weight quantization that beats uniform 4-bit at the same size; (2) per-layer mixed-precision KV cache for long-context decode; (3) sensitivity-aware LoRA fine-tuning with PEFT-compatible output; (4) hot-swappable mounted LoRA adapters; (5) a dual-protocol inference server speaking both the OpenAI `/v1/chat/completions` API and the Anthropic `/v1/messages` API from the same process; (6) a two-stage evaluation harness (smoketest + full benchmark suite with HumanEval running in a layered sandbox). ## Install ``` pip install mlx-optiq ``` Requirements: macOS 14+, Apple Silicon (M1/M2/M3/M4), Python 3.11+. Optional extras: `mlx-optiq[convert]` (psutil for RAM precheck), `mlx-optiq[eval]` (datasets for MMLU/GSM8K/IFEval/BFCL/HumanEval), `mlx-optiq[serve]` (uvicorn/fastapi), `mlx-optiq[all]`. ## Pre-built models on Hugging Face All twelve pre-built quants live under the `mlx-community` organization on HF. They load with stock `mlx_lm.load(...)`. No special runtime. ### Qwen3.5 family (dense + 1 sparse MoE) - `mlx-community/Qwen3.5-0.8B-OptiQ-4bit`: 0.5 GB · 27.0% GSM8K (+15.5pp vs uniform-4) - `mlx-community/Qwen3.5-2B-OptiQ-4bit`: 1.4 GB · 48.0% GSM8K - `mlx-community/Qwen3.5-4B-OptiQ-4bit`: 2.8 GB · 81.5% GSM8K (+2.0pp) - `mlx-community/Qwen3.5-9B-OptiQ-4bit`: 5.6 GB · 90.0% GSM8K (default daily-driver) - `mlx-community/Qwen3.5-27B-OptiQ-4bit`: 15.7 GB · 87.5% GSM8K - `mlx-community/Qwen3.5-35B-A3B-OptiQ-4bit`: 20.1 GB · 89.5% GSM8K (sparse MoE, 3 B active) ### Qwen3.6 family - `mlx-community/Qwen3.6-27B-OptiQ-4bit`: 15.7 GB · 95.0% GSM8K (+1.0pp) - `mlx-community/Qwen3.6-35B-A3B-OptiQ-4bit`: 20.1 GB · 89.5% GSM8K (256-expert MoE, 3 B active) ### Gemma-4 family (instruct) - `mlx-community/gemma-4-e2b-it-OptiQ-4bit`: 4.0 GB · 13.0% GSM8K (+7.5pp) - `mlx-community/gemma-4-e4b-it-OptiQ-4bit`: 6.0 GB · 55.5% GSM8K (+32.0pp; best small-model recovery) - `mlx-community/gemma-4-26B-A4B-it-OptiQ-4bit`: 14.9 GB · 94.0% GSM8K (sparse MoE, 4 B active) - `mlx-community/gemma-4-31B-it-OptiQ-4bit`: 18.1 GB · 96.0% GSM8K (strongest dense quant) ## Loading any pre-built quant ```python from mlx_lm import load, generate model, tok = load("mlx-community/Qwen3.5-9B-OptiQ-4bit") prompt = tok.apply_chat_template( [{"role": "user", "content": "Explain mixed-precision quantization."}], tokenize=False, add_generation_prompt=True, ) out = generate(model, tok, prompt=prompt, max_tokens=300) print(out) ``` For Qwen3.5/3.6 reasoning models, pass `enable_thinking=False` to skip the `...` channel for faster (slightly less accurate) output. ## Streaming generation ```python from mlx_lm import load, stream_generate from mlx_lm.sample_utils import make_sampler model, tok = load("mlx-community/Qwen3.5-9B-OptiQ-4bit") sampler = make_sampler(temp=0.6, top_p=0.95) for response in stream_generate(model, tok, prompt="...", max_tokens=200, sampler=sampler): print(response.text, end="", flush=True) ``` ## Quantizing your own model ```bash # auto-routes between bf16 and uniform_4bit reference based on RAM optiq convert Qwen/Qwen3.5-9B \ --target-bpw 4.5 \ --candidate-bits 4,8 \ --reference auto \ -o ./optiq_output/Qwen3.5-9B ``` Two reference modes: - `bf16` (gold standard, requires bf16 in RAM, ~2 × params in GB) - `uniform_4bit` (for big models, builds 4-bit baseline + streams bf16 layer-by-layer from disk) - `auto` (default; picks bf16 if it fits, else uniform_4bit) The output is a standard MLX checkpoint with per-layer bit assignments stored in metadata. It loads anywhere stock `mlx-lm` loads. ## Mixed-precision KV-cache serving One-time sensitivity pass, then serve: ```bash optiq kv-cache mlx-community/Qwen3.5-9B-OptiQ-4bit \ --target-bits 5.0 --candidate-bits 4,8 \ -o ./kv/qwen35_9b optiq serve --model mlx-community/Qwen3.5-9B-OptiQ-4bit \ --kv-config ./kv/qwen35_9b/kv_config.json \ --port 8080 ``` Delivers +31% to +62% decode speedup at 64k context on Qwen3.5 4B/9B vs fp16 KV. KV-quant currently broken on Gemma-4 (shared-KV attention upstream limitation). Use stock fp16 KV for Gemma-4 long-context serving. ## OpenAI- and Anthropic-compatible API `optiq serve` exposes BOTH endpoints from the same process: - OpenAI: `/v1/chat/completions` (streaming SSE) - Anthropic: `/v1/messages` (streaming SSE), works with Claude Code and the official `anthropic` Python SDK ```python # OpenAI client from openai import OpenAI client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-used") resp = client.chat.completions.create( model="mlx-community/Qwen3.5-9B-OptiQ-4bit", messages=[{"role": "user", "content": "hi"}], stream=True, ) for chunk in resp: print(chunk.choices[0].delta.content or "", end="") # Anthropic client (same server) from anthropic import Anthropic client = Anthropic(base_url="http://localhost:8080", api_key="not-used") resp = client.messages.create( model="mlx-community/Qwen3.5-9B-OptiQ-4bit", max_tokens=300, messages=[{"role": "user", "content": "hi"}], ) print(resp.content[0].text) ``` Claude Code via env vars (one-line setup): ```bash export ANTHROPIC_BASE_URL="http://localhost:8080" export ANTHROPIC_API_KEY="not-used" claude # now driven by your local quant ``` ## Sensitivity-aware LoRA fine-tuning ```bash optiq lora train mlx-community/Qwen3.5-9B-OptiQ-4bit \ --data ./my_training_data \ --max-seq-length 1400 \ --rank 8 --rank-scaling by_bits \ --num-layers 16 --iters 1000 \ -o ./my_adapter optiq lora info ./my_adapter ``` `--rank-scaling by_bits` gives 8-bit mlx-optiq layers 2× the adapter rank of 4-bit layers at the same total parameter budget. The same layers mlx-optiq kept at 8-bit during quantization also get more adapter capacity. Output is PEFT-compatible (`adapter_config.json` + `adapters.safetensors`) plus an mlx-optiq sidecar (`optiq_lora_config.json`) recording the per-layer rank distribution. Data format is JSONL (one example per line, `{"text": "..."}` or `{"messages": [...]}`). Same as `mlx_lm.lora`. ### Empirical training-ceiling map (M3 Max 36 GB, default config) | Model | Max seq len | Peak mem | |---|---|---| | Qwen3.5-0.8B | 2,800 | 23.4 GB | | Qwen3.5-2B | 2,400 | 19.3 GB | | Qwen3.5-4B | 1,600 | 24.8 GB | | Qwen3.5-9B | 1,400 | 25.4 GB | | Qwen3.5-27B / Qwen3.6-27B | 512 | 27.7 GB | | gemma-4-26B-A4B | 512 | 27.6 GB | | Qwen3.5-35B-A3B / Qwen3.6-35B-A3B | 128 | 25.3 GB | | gemma-4-31B-it | 32 | 21.4 GB | Two distinct failure modes when pushing past these: - **Memory cliff** (~27-28 GB): macOS uses compressed memory, throughput drops 9-30% - **MTLResource cliff** (independent of bytes): Apple GPUs cap at 499 K simultaneously bound resources. 2 B at T=3,200 hits a hard `kIOGPUCommandBufferCallbackErrorOutOfMemory` even at 22 GB peak. Don't extrapolate "more GB headroom" → "longer T". ## Hot-swap mounted LoRA adapters `optiq serve --adapter` accepts a single adapter (HF repo id or local path) per process. For multi-adapter, in-process hot-swap, use the Python primitive: ```python from mlx_lm import load, generate from optiq.adapters.mount import ( prepare_model_for_mounted_lora, mount_adapter_on_model, AdapterActivation, ) model, tok = load("mlx-community/Qwen3.5-9B-OptiQ-4bit") prepare_model_for_mounted_lora(model) mount_adapter_on_model(model, "agent-A", "./adapter_a") mount_adapter_on_model(model, "agent-B", "./adapter_b") with AdapterActivation("agent-A"): out_a = generate(model, tok, prompt=p, max_tokens=100) with AdapterActivation("agent-B"): out_b = generate(model, tok, prompt=p, max_tokens=100) ``` Memory: ~50 MB per extra adapter vs ~5 GB per full base copy. The `ContextVar` means concurrent asyncio tasks or threads with different active adapters don't step on each other. ## Evaluation ```bash # Fast triage (KL + GSM8K-50, ~5 min on 27B) optiq eval ./optiq_mixed --task smoketest # Full benchmark suite (MMLU + GSM8K + IFEval + BFCL + HumanEval) optiq eval ./optiq_mixed --task all --score --output-json ./bench.json # Single tasks optiq eval ./optiq_mixed --task humaneval # 164 problems, sandboxed code execution optiq eval ./optiq_mixed --task bfcl # 200 simple function-call questions optiq eval ./optiq_mixed --task ifeval # full 540-prompt IFEval ``` The `--score` flag (under `--task all`) computes a Capability Score = unweighted mean of MMLU + GSM8K + IFEval + BFCL + HumanEval. HumanEval runs in a layered sandbox (apple/container → sandbox-exec → subprocess + rlimit), so untrusted model-generated code can't escape. KL evaluator auto-resolves the reference: bf16 if it fits in RAM (per-shard sizes via `HfApi.model_info`), else the mlx-community uniform-4-bit baseline. ## CLI reference - `optiq convert MODEL [--target-bpw 4.5] [--candidate-bits 4,8] [--reference auto|bf16|uniform_4bit] [--calibration-mix optiq|PATH] [--n-floor-per-block 2] [-o PATH]` - `optiq kv-cache MODEL [--target-bits 5.0] [--candidate-bits 4,8] [--n-samples 5] [--seq-len 512] [-o PATH]` - `optiq lora train MODEL --data PATH [--rank 8] [--scale 20.0] [--rank-scaling by_bits|constant|by_kl] [--num-layers 16] [--max-seq-length 1600] [--iters 1000] [-o PATH]` - `optiq lora info ADAPTER_PATH` - `optiq serve --model MODEL [--kv-config PATH | --kv-bits 4|8] [--adapter PATH-OR-REPO] [--anthropic/--no-anthropic] [-- mlx_lm.server flags]` - `optiq eval MODEL_PATH --task smoketest|all|kl|mmlu|gsm8k|gsm8k-50|ifeval|bfcl|humaneval [--score] [--kv-config PATH | --kv-bits 4|8] [--reference-mode auto|bf16|uniform_4bit] [--output-json PATH]` - `optiq benchmark MODEL_PATH [--baseline UNIFORM_PATH] [--n-samples 50]` - `optiq latency MODEL_PATH [--calibrate]` - `optiq --version` ## How sensitivity works (algorithm) For each `(layer L, candidate bits b)`: 1. Forward-pass calibration data with all weights at reference precision; record output logits. 2. Replace just L's weight with a simulate-quantized copy at b bits (round-trip quantize→dequantize). 3. Forward-pass the same calibration data; record perturbed logits. 4. Compute KL divergence between reference and perturbed logits, averaged over samples. 5. Restore L; move to next layer. Then greedy knapsack: start every layer at the lowest bit, greedily upgrade the layer with the largest KL-reduction-per-bit until the average BPW reaches target. `lm_head`, `embed_tokens`, first/last attention blocks are protected at 8-bit by default. Calibration: bundled `optiq.jsonl` mix: 40 hand-curated samples across prose (5), reasoning (6), code (6), agent loops (8), function-calling (7), and constraint-bearing instructions (8). Chat samples are rendered through the target model's tokenizer.apply_chat_template before tokenization, so the activated subspace matches production. Pass `--calibration-mix /path/to/your.jsonl` to override; rebuild the default mix with `python scripts/build_calibration.py`. ## Site map - https://mlx-optiq.com/: overview - https://mlx-optiq.com/models: all 12 pre-built quants - https://mlx-optiq.com/docs/: documentation index - https://mlx-optiq.com/docs/install: installation - https://mlx-optiq.com/docs/quants: using pre-built quants - https://mlx-optiq.com/docs/sensitivity: methodology - https://mlx-optiq.com/docs/qwen3.5: Qwen3.5 family guide - https://mlx-optiq.com/docs/qwen3.6: Qwen3.6 family guide - https://mlx-optiq.com/docs/gemma-4: Gemma-4 family guide - https://mlx-optiq.com/docs/finetune: LoRA fine-tuning - https://mlx-optiq.com/docs/serve: KV-quant serving - https://mlx-optiq.com/docs/cli: CLI reference - https://mlx-optiq.com/blog/: engineering posts and research - https://mlx-optiq.com/blog/gemma-4-support: Gemma-4 family launch (e2b/e4b/26B-A4B/31B), +32 pp recovery on e4b - https://mlx-optiq.com/blog/turboquant-postmortem: postmortem on the rotated-space KV experiment we built but didn't ship - https://mlx-optiq.com/blog/sensitivity-aware-lora: LoRA fine-tuning with rank scaled by per-layer bit assignment - https://mlx-optiq.com/blog/not-all-layers-are-equal: research foundation: per-layer sensitivity for weights and KV cache - https://mlx-optiq.com/blog/eval-framework: two-stage eval (smoketest + benchmarks) + Capability Score - https://mlx-optiq.com/blog/calibration-mix: six-domain calibration mix methodology ## Distribution - PyPI: https://pypi.org/project/mlx-optiq/ - Hugging Face quants: https://huggingface.co/mlx-community