# mlx-optiq

> mlx-optiq is a Python toolkit for running large language models entirely on Apple Silicon. One PyPI package gives you: (1) sensitivity-driven mixed-precision weight quantization that beats uniform 4-bit at the same size; (2) per-layer mixed-precision KV cache for long-context decode; (3) sensitivity-aware LoRA fine-tuning with PEFT-compatible output; (4) hot-swappable mounted LoRA adapters; (5) a dual-protocol inference server speaking both the OpenAI `/v1/chat/completions` API and the Anthropic `/v1/messages` API from the same process; (6) a two-stage evaluation harness (smoketest + full benchmark suite with HumanEval running in a layered sandbox).

## Install

```
pip install mlx-optiq
```

Requirements: macOS 14+, Apple Silicon (M1/M2/M3/M4), Python 3.11+.

Optional extras: `mlx-optiq[convert]` (psutil for RAM precheck), `mlx-optiq[eval]` (datasets for MMLU/GSM8K/IFEval/BFCL/HumanEval), `mlx-optiq[serve]` (uvicorn/fastapi), `mlx-optiq[all]`.

## Pre-built models on Hugging Face

All twelve pre-built quants live under the `mlx-community` organization on HF. They load with stock `mlx_lm.load(...)`. No special runtime.

### Qwen3.5 family (dense + 1 sparse MoE)
- `mlx-community/Qwen3.5-0.8B-OptiQ-4bit`: 0.5 GB · 27.0% GSM8K (+15.5pp vs uniform-4)
- `mlx-community/Qwen3.5-2B-OptiQ-4bit`: 1.4 GB · 48.0% GSM8K
- `mlx-community/Qwen3.5-4B-OptiQ-4bit`: 2.8 GB · 81.5% GSM8K (+2.0pp)
- `mlx-community/Qwen3.5-9B-OptiQ-4bit`: 5.6 GB · 90.0% GSM8K (default daily-driver)
- `mlx-community/Qwen3.5-27B-OptiQ-4bit`: 15.7 GB · 87.5% GSM8K
- `mlx-community/Qwen3.5-35B-A3B-OptiQ-4bit`: 20.1 GB · 89.5% GSM8K (sparse MoE, 3 B active)

### Qwen3.6 family
- `mlx-community/Qwen3.6-27B-OptiQ-4bit`: 15.7 GB · 95.0% GSM8K (+1.0pp)
- `mlx-community/Qwen3.6-35B-A3B-OptiQ-4bit`: 20.1 GB · 89.5% GSM8K (256-expert MoE, 3 B active)

### Gemma-4 family (instruct)
- `mlx-community/gemma-4-e2b-it-OptiQ-4bit`: 4.0 GB · 13.0% GSM8K (+7.5pp)
- `mlx-community/gemma-4-e4b-it-OptiQ-4bit`: 6.0 GB · 55.5% GSM8K (+32.0pp; best small-model recovery)
- `mlx-community/gemma-4-26B-A4B-it-OptiQ-4bit`: 14.9 GB · 94.0% GSM8K (sparse MoE, 4 B active)
- `mlx-community/gemma-4-31B-it-OptiQ-4bit`: 18.1 GB · 96.0% GSM8K (strongest dense quant)

## Loading any pre-built quant

```python
from mlx_lm import load, generate

model, tok = load("mlx-community/Qwen3.5-9B-OptiQ-4bit")

prompt = tok.apply_chat_template(
    [{"role": "user", "content": "Explain mixed-precision quantization."}],
    tokenize=False,
    add_generation_prompt=True,
)
out = generate(model, tok, prompt=prompt, max_tokens=300)
print(out)
```

For Qwen3.5/3.6 reasoning models, pass `enable_thinking=False` to skip the `<think>...</think>` channel for faster (slightly less accurate) output.

## Streaming generation

```python
from mlx_lm import load, stream_generate
from mlx_lm.sample_utils import make_sampler

model, tok = load("mlx-community/Qwen3.5-9B-OptiQ-4bit")
sampler = make_sampler(temp=0.6, top_p=0.95)

for response in stream_generate(model, tok, prompt="...", max_tokens=200, sampler=sampler):
    print(response.text, end="", flush=True)
```

## Quantizing your own model

```bash
# auto-routes between bf16 and uniform_4bit reference based on RAM
optiq convert Qwen/Qwen3.5-9B \
    --target-bpw 4.5 \
    --candidate-bits 4,8 \
    --reference auto \
    -o ./optiq_output/Qwen3.5-9B
```

Two reference modes:
- `bf16` (gold standard, requires bf16 in RAM, ~2 × params in GB)
- `uniform_4bit` (for big models, builds 4-bit baseline + streams bf16 layer-by-layer from disk)
- `auto` (default; picks bf16 if it fits, else uniform_4bit)

The output is a standard MLX checkpoint with per-layer bit assignments stored in metadata. It loads anywhere stock `mlx-lm` loads.

## Mixed-precision KV-cache serving

One-time sensitivity pass, then serve:

```bash
optiq kv-cache mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --target-bits 5.0 --candidate-bits 4,8 \
    -o ./kv/qwen35_9b

optiq serve --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --kv-config ./kv/qwen35_9b/kv_config.json \
    --port 8080
```

Delivers +31% to +62% decode speedup at 64k context on Qwen3.5 4B/9B vs fp16 KV.

KV-quant currently broken on Gemma-4 (shared-KV attention upstream limitation). Use stock fp16 KV for Gemma-4 long-context serving.

## OpenAI- and Anthropic-compatible API

`optiq serve` exposes BOTH endpoints from the same process:
- OpenAI: `/v1/chat/completions` (streaming SSE)
- Anthropic: `/v1/messages` (streaming SSE), works with Claude Code and the official `anthropic` Python SDK

```python
# OpenAI client
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-used")
resp = client.chat.completions.create(
    model="mlx-community/Qwen3.5-9B-OptiQ-4bit",
    messages=[{"role": "user", "content": "hi"}],
    stream=True,
)
for chunk in resp:
    print(chunk.choices[0].delta.content or "", end="")

# Anthropic client (same server)
from anthropic import Anthropic
client = Anthropic(base_url="http://localhost:8080", api_key="not-used")
resp = client.messages.create(
    model="mlx-community/Qwen3.5-9B-OptiQ-4bit",
    max_tokens=300,
    messages=[{"role": "user", "content": "hi"}],
)
print(resp.content[0].text)
```

Claude Code via env vars (one-line setup):
```bash
export ANTHROPIC_BASE_URL="http://localhost:8080"
export ANTHROPIC_API_KEY="not-used"
claude    # now driven by your local quant
```

## Sensitivity-aware LoRA fine-tuning

```bash
optiq lora train mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --data ./my_training_data \
    --max-seq-length 1400 \
    --rank 8 --rank-scaling by_bits \
    --num-layers 16 --iters 1000 \
    -o ./my_adapter

optiq lora info ./my_adapter
```

`--rank-scaling by_bits` gives 8-bit mlx-optiq layers 2× the adapter rank of 4-bit layers at the same total parameter budget. The same layers mlx-optiq kept at 8-bit during quantization also get more adapter capacity.

Output is PEFT-compatible (`adapter_config.json` + `adapters.safetensors`) plus an mlx-optiq sidecar (`optiq_lora_config.json`) recording the per-layer rank distribution.

Data format is JSONL (one example per line, `{"text": "..."}` or `{"messages": [...]}`). Same as `mlx_lm.lora`.

### Empirical training-ceiling map (M3 Max 36 GB, default config)

| Model | Max seq len | Peak mem |
|---|---|---|
| Qwen3.5-0.8B | 2,800 | 23.4 GB |
| Qwen3.5-2B | 2,400 | 19.3 GB |
| Qwen3.5-4B | 1,600 | 24.8 GB |
| Qwen3.5-9B | 1,400 | 25.4 GB |
| Qwen3.5-27B / Qwen3.6-27B | 512 | 27.7 GB |
| gemma-4-26B-A4B | 512 | 27.6 GB |
| Qwen3.5-35B-A3B / Qwen3.6-35B-A3B | 128 | 25.3 GB |
| gemma-4-31B-it | 32 | 21.4 GB |

Two distinct failure modes when pushing past these:
- **Memory cliff** (~27-28 GB): macOS uses compressed memory, throughput drops 9-30%
- **MTLResource cliff** (independent of bytes): Apple GPUs cap at 499 K simultaneously bound resources. 2 B at T=3,200 hits a hard `kIOGPUCommandBufferCallbackErrorOutOfMemory` even at 22 GB peak. Don't extrapolate "more GB headroom" → "longer T".

## Hot-swap mounted LoRA adapters

`optiq serve --adapter` accepts a single adapter (HF repo id or local path) per process. For multi-adapter, in-process hot-swap, use the Python primitive:

```python
from mlx_lm import load, generate
from optiq.adapters.mount import (
    prepare_model_for_mounted_lora,
    mount_adapter_on_model,
    AdapterActivation,
)

model, tok = load("mlx-community/Qwen3.5-9B-OptiQ-4bit")
prepare_model_for_mounted_lora(model)
mount_adapter_on_model(model, "agent-A", "./adapter_a")
mount_adapter_on_model(model, "agent-B", "./adapter_b")

with AdapterActivation("agent-A"):
    out_a = generate(model, tok, prompt=p, max_tokens=100)

with AdapterActivation("agent-B"):
    out_b = generate(model, tok, prompt=p, max_tokens=100)
```

Memory: ~50 MB per extra adapter vs ~5 GB per full base copy. The `ContextVar` means concurrent asyncio tasks or threads with different active adapters don't step on each other.

## Evaluation

```bash
# Fast triage (KL + GSM8K-50, ~5 min on 27B)
optiq eval ./optiq_mixed --task smoketest

# Full benchmark suite (MMLU + GSM8K + IFEval + BFCL + HumanEval)
optiq eval ./optiq_mixed --task all --score --output-json ./bench.json

# Single tasks
optiq eval ./optiq_mixed --task humaneval     # 164 problems, sandboxed code execution
optiq eval ./optiq_mixed --task bfcl          # 200 simple function-call questions
optiq eval ./optiq_mixed --task ifeval        # full 540-prompt IFEval
```

The `--score` flag (under `--task all`) computes a Capability Score = unweighted mean of MMLU + GSM8K + IFEval + BFCL + HumanEval. HumanEval runs in a layered sandbox (apple/container → sandbox-exec → subprocess + rlimit), so untrusted model-generated code can't escape. KL evaluator auto-resolves the reference: bf16 if it fits in RAM (per-shard sizes via `HfApi.model_info`), else the mlx-community uniform-4-bit baseline.

## CLI reference

- `optiq convert MODEL [--target-bpw 4.5] [--candidate-bits 4,8] [--reference auto|bf16|uniform_4bit] [--calibration-mix optiq|PATH] [--n-floor-per-block 2] [-o PATH]`
- `optiq kv-cache MODEL [--target-bits 5.0] [--candidate-bits 4,8] [--n-samples 5] [--seq-len 512] [-o PATH]`
- `optiq lora train MODEL --data PATH [--rank 8] [--scale 20.0] [--rank-scaling by_bits|constant|by_kl] [--num-layers 16] [--max-seq-length 1600] [--iters 1000] [-o PATH]`
- `optiq lora info ADAPTER_PATH`
- `optiq serve --model MODEL [--kv-config PATH | --kv-bits 4|8] [--adapter PATH-OR-REPO] [--anthropic/--no-anthropic] [-- mlx_lm.server flags]`
- `optiq eval MODEL_PATH --task smoketest|all|kl|mmlu|gsm8k|gsm8k-50|ifeval|bfcl|humaneval [--score] [--kv-config PATH | --kv-bits 4|8] [--reference-mode auto|bf16|uniform_4bit] [--output-json PATH]`
- `optiq benchmark MODEL_PATH [--baseline UNIFORM_PATH] [--n-samples 50]`
- `optiq latency MODEL_PATH [--calibrate]`
- `optiq --version`

## How sensitivity works (algorithm)

For each `(layer L, candidate bits b)`:
1. Forward-pass calibration data with all weights at reference precision; record output logits.
2. Replace just L's weight with a simulate-quantized copy at b bits (round-trip quantize→dequantize).
3. Forward-pass the same calibration data; record perturbed logits.
4. Compute KL divergence between reference and perturbed logits, averaged over samples.
5. Restore L; move to next layer.

Then greedy knapsack: start every layer at the lowest bit, greedily upgrade the layer with the largest KL-reduction-per-bit until the average BPW reaches target. `lm_head`, `embed_tokens`, first/last attention blocks are protected at 8-bit by default.

Calibration: bundled `optiq.jsonl` mix: 40 hand-curated samples across prose (5), reasoning (6), code (6), agent loops (8), function-calling (7), and constraint-bearing instructions (8). Chat samples are rendered through the target model's tokenizer.apply_chat_template before tokenization, so the activated subspace matches production. Pass `--calibration-mix /path/to/your.jsonl` to override; rebuild the default mix with `python scripts/build_calibration.py`.

## Site map

- https://mlx-optiq.com/: overview
- https://mlx-optiq.com/models: all 12 pre-built quants
- https://mlx-optiq.com/docs/: documentation index
- https://mlx-optiq.com/docs/install: installation
- https://mlx-optiq.com/docs/quants: using pre-built quants
- https://mlx-optiq.com/docs/sensitivity: methodology
- https://mlx-optiq.com/docs/qwen3.5: Qwen3.5 family guide
- https://mlx-optiq.com/docs/qwen3.6: Qwen3.6 family guide
- https://mlx-optiq.com/docs/gemma-4: Gemma-4 family guide
- https://mlx-optiq.com/docs/finetune: LoRA fine-tuning
- https://mlx-optiq.com/docs/serve: KV-quant serving
- https://mlx-optiq.com/docs/cli: CLI reference
- https://mlx-optiq.com/blog/: engineering posts and research
- https://mlx-optiq.com/blog/gemma-4-support: Gemma-4 family launch (e2b/e4b/26B-A4B/31B), +32 pp recovery on e4b
- https://mlx-optiq.com/blog/turboquant-postmortem: postmortem on the rotated-space KV experiment we built but didn't ship
- https://mlx-optiq.com/blog/sensitivity-aware-lora: LoRA fine-tuning with rank scaled by per-layer bit assignment
- https://mlx-optiq.com/blog/not-all-layers-are-equal: research foundation: per-layer sensitivity for weights and KV cache
- https://mlx-optiq.com/blog/eval-framework: two-stage eval (smoketest + benchmarks) + Capability Score
- https://mlx-optiq.com/blog/calibration-mix: six-domain calibration mix methodology

## Distribution

- PyPI: https://pypi.org/project/mlx-optiq/
- Hugging Face quants: https://huggingface.co/mlx-community