Quantize, fine-tune
and serve LLMs
entirely on Apple Silicon.
Run large language models natively on a Mac. Per-layer sensitivity analysis for mixed-precision weights. LoRA fine-tuning that respects the bit budget. A server that speaks both OpenAI and Anthropic APIs (point Claude Code at your local quant). No GPU cluster, no API key.
$ pip install mlx-optiq
Drop-in 4-bit quants. Same weights, smarter bits.
Ten production mlx-optiq-quantized LLMs on Hugging Face. Qwen3.5, Qwen3.6 and Gemma-4 families, from 0.8 B dense to 35 B-A3B mixture-of-experts. They load directly into stock mlx-lm. No special runtime.
gemma-4-31B-it-OptiQ-4bit
The largest single quant we ship. 31 B parameters in 18.1 GB at 96.0% GSM8K, within noise of the original. Use it when you want the strongest model that still fits on a 36 GB Mac.
gemma-4-26B-A4B-it-OptiQ-4bit
Mixture-of-experts at 14.9 GB. mlx-optiq quantizes router and expert blocks independently, so each expert lands at its own per-layer bit-width.
Qwen3.6-27B-OptiQ-4bit
Frontier-class reasoning at 15.7 GB. Fits comfortably on a 36 GB Mac with room left for fine-tuning.
Qwen3.5-9B-OptiQ-4bit
The default daily-driver. 9 B parameters in 5.6 GB at GSM8K parity with the original. Long context to 64 k via mixed-precision KV.
From zero to a serving LLM in three commands.
Each step is reversible and works with stock MLX tools. mlx-optiq is additive. Skip any of these and you still have a working pipeline.
Install
Pure Python. Pulls in mlx, mlx-lm and huggingface-hub. Python 3.11+ on Apple Silicon.
$ pip install mlx-optiq
Use a pre-built quant
Pre-built mlx-optiq quants load with stock mlx-lm. Per-layer bit assignment is recorded in the model metadata. No special loader required.
from mlx_lm import load, generate model, tok = load("mlx-community/Qwen3.5-9B-OptiQ-4bit") out = generate(model, tok, prompt="Explain mixed-precision quantization.", max_tokens=200) print(out)
Serve with mixed-precision KV
The KV cache is its own sensitivity problem. optiq kv-cache measures it once per model; optiq serve serves with the resulting per-layer config behind an OpenAI-compatible API.
# 1-2 min, once per model $ optiq kv-cache mlx-community/Qwen3.5-9B-OptiQ-4bit \ --target-bits 5.0 -o ./kv # OpenAI + Anthropic compatible server on :8080 # /v1/chat/completions (OpenAI) # /v1/messages (Anthropic; works with Claude Code, anthropic SDK, etc.) $ optiq serve --model mlx-community/Qwen3.5-9B-OptiQ-4bit \ --kv-config ./kv/kv_config.json \ --port 8080
One sensitivity signal. A whole toolkit around it.
A single per-layer KL-divergence pass drives weight, KV-cache and LoRA-rank allocation. The rest of the toolkit (hot-swap adapters, dual-protocol serving, sandboxed code execution) sits around that core.
Mixed-precision weights
Per-layer KL on calibration data + greedy knapsack. Sensitive layers stay at higher precision; the rest get aggressively quantized at the same average BPW.
Mixed-precision KV cache
Independent sensitivity pass on the KV cache. Layer 0 is often 56× more sensitive than the average. Uniform 4-bit KV is catastrophic.
Sensitivity-aware LoRA
optiq lora train assigns higher adapter rank to layers mlx-optiq kept at 8-bit. Same parameter budget, smarter capacity allocation.
Hot-swap adapters
Mounted-LoRA primitive: keep N adapters resident on one base model, switch per request via ContextVar. No reload, no GPU re-upload.
OpenAI and Anthropic API
One server, both protocols. /v1/chat/completions for OpenAI clients; /v1/messages for Anthropic clients. Point Claude Code at your local quant.
Sandboxed code execution
from optiq.sandbox import run_python: run model-generated Python with no network, contained file writes, and wall-time / memory caps. Drop it into your agent's tool-call handler.
Sensitivity, in three steps.
Uniform 4-bit quantization treats every layer the same, but layers are not the same. mlx-optiq measures, then allocates.
1. Measure
For each layer, simulate-quantize just that layer at each candidate bit-width. Forward-pass calibration data. Measure KL divergence between the perturbed logits and the reference logits. Repeat for every layer; you now have a (layer, bits) → quality cost table.
2. Allocate
Greedy knapsack on the table: start every layer at the lowest bit-width,
then greedily upgrade the layer that buys the most KL-reduction per extra bit
until the average bit-budget is exhausted. Layers like lm_head
and the first/last attention blocks are protected at 8-bit by default.
3. Convert
Hand the per-layer bit map to mlx_lm.convert as a quant
predicate. The output is a standard MLX checkpoint that loads anywhere
stock mlx-lm loads, with sensitivity metadata stashed on
the side for downstream LoRA training.
# Auto-routes between bf16 and uniform-4-bit reference # based on available RAM. $ optiq convert Qwen/Qwen3.5-9B \ --target-bpw 4.5 \ --candidate-bits 4,8 \ --reference auto \ -o optiq_output/Qwen3.5-9B
--reference auto
picks bf16 if it fits, otherwise falls back to a uniform-4-bit baseline
with bf16-streaming probes, so 27 B+ models still get a calibration-driven
signal on a 36 GB Mac. The full methodology lives in
our research write-up →
Where mlx-optiq sits among the Mac LLM options.
A snapshot of how the popular paths stack up on the things that actually move quality and speed on Apple Silicon. None of these are wrong; they're optimizing different axes.
| mlx-optiq | mlx-lm (stock) | llama.cpp Q4_K_M | Unsloth dynamic GGUFs | |
|---|---|---|---|---|
| Per-layer mixed-precision weights | Yes, calibration-driven | Uniform 4-bit | Block-wise K-quant | Block-wise + recovery passes |
| Per-layer mixed-precision KV cache | Yes | Uniform 4 / 8 / fp16 | Group-wise int8 only | Inherits llama.cpp's |
| Sensitivity-aware LoRA fine-tuning | Rank scaled by per-layer bits | Constant rank LoRA | Inference only | CUDA-first; Mac path via llama.cpp adapter loading |
| OpenAI- and Anthropic-compatible server | One process, both | OpenAI only | llama-server (OpenAI shim) | No native server |
| Native MLX runtime on Apple Silicon | Yes | Yes | Metal directly, not MLX | Runs through llama.cpp on Mac |
| Sandboxed code execution helper | Ships in package | — | — | — |
Make your Mac an LLM workstation.
Pick a model, get a snippet, ship it. The docs cover every supported family, fine-tuning recipes, and the OpenAI-compatible serving stack.