Optimizing compiler · MLX

Quantize, fine-tune
and serve LLMs
entirely on Apple Silicon.

Name: mlx-optiq
Author: mlx-optiq

Run large language models natively on a Mac. Per-layer sensitivity analysis for mixed-precision weights. LoRA fine-tuning that respects the bit budget. A server that speaks both OpenAI and Anthropic APIs (point Claude Code at your local quant). No GPU cluster, no API key.

Per-layer bit allocation · sample LLM

8-bit · sensitive layers 4-bit · robust layers

$ pip install mlx-optiq

pre-built quants

3.4×

avg compression

+62%

decode at 64k

+32pp

small-model recovery

01 Pre-built models

Drop-in 4-bit quants. Same weights, smarter bits.

Ten production mlx-optiq-quantized LLMs on Hugging Face. Qwen3.5, Qwen3.6 and Gemma-4 families, from 0.8 B dense to 35 B-A3B mixture-of-experts. They load directly into stock mlx-lm. No special runtime.

Gemma-4

gemma-4-31B-it-OptiQ-4bit

The largest single quant we ship. 31 B parameters in 18.1 GB at 96.0% GSM8K, within noise of the original. Use it when you want the strongest model that still fits on a 36 GB Mac.

18.1 GB on disk 96.0% GSM8K 3.3× compression

Gemma-4

gemma-4-26B-A4B-it-OptiQ-4bit

Mixture-of-experts at 14.9 GB. mlx-optiq quantizes router and expert blocks independently, so each expert lands at its own per-layer bit-width.

14.9 GB on disk 94.0% GSM8K 3.4× compression

Qwen3.6

Qwen3.6-27B-OptiQ-4bit

Frontier-class reasoning at 15.7 GB. Fits comfortably on a 36 GB Mac with room left for fine-tuning.

15.7 GB on disk 95.0% GSM8K 3.4× compression

Qwen3.5

Qwen3.5-9B-OptiQ-4bit

The default daily-driver. 9 B parameters in 5.6 GB at GSM8K parity with the original. Long context to 64 k via mixed-precision KV.

5.6 GB on disk 90.0% GSM8K 3.2× compression

all 12 models →

02 Quickstart

From zero to a serving LLM in three commands.

Each step is reversible and works with stock MLX tools. mlx-optiq is additive. Skip any of these and you still have a working pipeline.

Install

Pure Python. Pulls in mlx, mlx-lm and huggingface-hub. Python 3.11+ on Apple Silicon.

terminalbash

$ pip install mlx-optiq

Use a pre-built quant

Pre-built mlx-optiq quants load with stock mlx-lm. Per-layer bit assignment is recorded in the model metadata. No special loader required.

generate.pypython

from mlx_lm import load, generate

model, tok = load("mlx-community/Qwen3.5-9B-OptiQ-4bit")
out = generate(model, tok, prompt="Explain mixed-precision quantization.", max_tokens=200)
print(out)

iii

Serve with mixed-precision KV

The KV cache is its own sensitivity problem. optiq kv-cache measures it once per model; optiq serve serves with the resulting per-layer config behind an OpenAI-compatible API.

terminalbash

# 1-2 min, once per model
$ optiq kv-cache mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --target-bits 5.0 -o ./kv

# OpenAI + Anthropic compatible server on :8080
# /v1/chat/completions  (OpenAI)
# /v1/messages          (Anthropic; works with Claude Code, anthropic SDK, etc.)
$ optiq serve --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --kv-config ./kv/kv_config.json \
    --port 8080

Where to next Each model family has a getting-started guide with model-specific sampling defaults and recommended use cases. Building an agent? Drop llms.txt into your IDE. It's the entire library reference in one Markdown file.

03 What it does

One sensitivity signal. A whole toolkit around it.

A single per-layer KL-divergence pass drives weight, KV-cache and LoRA-rank allocation. The rest of the toolkit (hot-swap adapters, dual-protocol serving, sandboxed code execution) sits around that core.

Mixed-precision weights

Per-layer KL on calibration data + greedy knapsack. Sensitive layers stay at higher precision; the rest get aggressively quantized at the same average BPW.

Higher accuracy at the same disk size as uniform-4

Mixed-precision KV cache

Independent sensitivity pass on the KV cache. Layer 0 is often 56× more sensitive than the average. Uniform 4-bit KV is catastrophic.

Faster long-context decode without breaking quality

iii

Sensitivity-aware LoRA

optiq lora train assigns higher adapter rank to layers mlx-optiq kept at 8-bit. Same parameter budget, smarter capacity allocation.

Lower val loss at the same parameter budget

Hot-swap adapters

Mounted-LoRA primitive: keep N adapters resident on one base model, switch per request via ContextVar. No reload, no GPU re-upload.

N adapters in the RAM of one base model copy

OpenAI and Anthropic API

One server, both protocols. /v1/chat/completions for OpenAI clients; /v1/messages for Anthropic clients. Point Claude Code at your local quant.

Use existing clients unmodified

Sandboxed code execution

from optiq.sandbox import run_python: run model-generated Python with no network, contained file writes, and wall-time / memory caps. Drop it into your agent's tool-call handler.

Client-side helper for tool execution

04 How it works

Sensitivity, in three steps.

Uniform 4-bit quantization treats every layer the same, but layers are not the same. mlx-optiq measures, then allocates.

1. Measure

For each layer, simulate-quantize just that layer at each candidate bit-width. Forward-pass calibration data. Measure KL divergence between the perturbed logits and the reference logits. Repeat for every layer; you now have a (layer, bits) → quality cost table.

2. Allocate

Greedy knapsack on the table: start every layer at the lowest bit-width, then greedily upgrade the layer that buys the most KL-reduction per extra bit until the average bit-budget is exhausted. Layers like lm_head and the first/last attention blocks are protected at 8-bit by default.

3. Convert

Hand the per-layer bit map to mlx_lm.convert as a quant predicate. The output is a standard MLX checkpoint that loads anywhere stock mlx-lm loads, with sensitivity metadata stashed on the side for downstream LoRA training.

convert.shbash

# Auto-routes between bf16 and uniform-4-bit reference
# based on available RAM.
$ optiq convert Qwen/Qwen3.5-9B \
    --target-bpw 4.5 \
    --candidate-bits 4,8 \
    --reference auto \
    -o optiq_output/Qwen3.5-9B

runs in 1–2 min on a 9 B model · longer for 27 B+

Why this scales A single calibration-driven sensitivity path. --reference auto picks bf16 if it fits, otherwise falls back to a uniform-4-bit baseline with bf16-streaming probes, so 27 B+ models still get a calibration-driven signal on a 36 GB Mac. The full methodology lives in our research write-up →

05 How it compares

Where mlx-optiq sits among the Mac LLM options.

A snapshot of how the popular paths stack up on the things that actually move quality and speed on Apple Silicon. None of these are wrong; they're optimizing different axes.

	mlx-optiq	mlx-lm (stock)	llama.cpp Q4_K_M	Unsloth dynamic GGUFs
Per-layer mixed-precision weights	Yes, calibration-driven	Uniform 4-bit	Block-wise K-quant	Block-wise + recovery passes
Per-layer mixed-precision KV cache	Yes	Uniform 4 / 8 / fp16	Group-wise int8 only	Inherits llama.cpp's
Sensitivity-aware LoRA fine-tuning	Rank scaled by per-layer bits	Constant rank LoRA	Inference only	CUDA-first; Mac path via llama.cpp adapter loading
OpenAI- and Anthropic-compatible server	One process, both	OpenAI only	llama-server (OpenAI shim)	No native server
Native MLX runtime on Apple Silicon	Yes	Yes	Metal directly, not MLX	Runs through llama.cpp on Mac
Sandboxed code execution helper	Ships in package	—	—	—

Reading the table mlx-optiq is the only path on this list that uses calibration-driven, per-layer bit allocation for both weights and KV cache, in the native MLX runtime, with serving and fine-tuning in the same package. The others are great at what they target. They just target different things.

Get started

Make your Mac an LLM workstation.

Pick a model, get a snippet, ship it. The docs cover every supported family, fine-tuning recipes, and the OpenAI-compatible serving stack.

read the docs → browse models

Quantize, fine-tune and serve LLMs entirely on Apple Silicon.