mlx-optiq
Workflow · serving

KV-quant serving

optiq serve is a drop-in replacement for mlx_lm.server. It exposes both the OpenAI /v1/chat/completions endpoint and the Anthropic /v1/messages endpoint from the same process. Point Claude Code, the OpenAI SDK, the Anthropic SDK, or plain curl at the same local URL. On top of that: sensitivity-aware quantized KV cache for long-context throughput, and mounted LoRA adapters that swap per request.

Quickstart

terminalbash
# Stock fp16 KV serving. Works for any mlx-optiq quant.
$ optiq serve --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --port 8080

Then call it like any OpenAI endpoint:

curl_chat.shbash
$ curl http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "mlx-community/Qwen3.5-9B-OptiQ-4bit",
      "messages": [{"role": "user", "content": "What is RoPE?"}],
      "max_tokens": 300,
      "stream": false
    }'

Mixed-precision KV cache

Default optiq serve uses fp16 KV. For long context (16 k+) on Qwen3.5 / 3.6, a sensitivity-driven quantized KV cache delivers 30-60% decode speedup with no quality loss. Why mixed precision matters for KV (and why uniform 4-bit KV is catastrophic on layer 0) is covered in Not All Layers Are Equal; for the alternative we built but didn't ship, see the TurboQuant postmortem.

Step 1: measure

terminalbash
# 1-2 min. Once per model.
$ optiq kv-cache mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --target-bits 5.0 \
    --candidate-bits 4,8 \
    -o ./kv/qwen35_9b

# writes ./kv/qwen35_9b/kv_config.json
# [{"layer_idx": 3, "bits": 8, "group_size": 64}, ...]

Step 2: serve

terminalbash
$ optiq serve \
    --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --kv-config ./kv/qwen35_9b/kv_config.json \
    --max-tokens 32768 --temp 0.6 --top-p 0.95
Why this works Layer 0's KV is often 56× more sensitive than the layer-average. Uniform 4-bit KV is catastrophic. A single 8-bit layer (the most KV-sensitive one, often layer 3 in Qwen3.5's hybrid attention) protects quality while every other layer runs 4-bit. Apple Silicon's mx.quantized_matmul also handles the 8-bit fast path more efficiently than 4-bit, so protecting that one layer also flips it onto a faster kernel. Quality and speed point the same way.

LoRA adapter at serve time

The CLI accepts one adapter per server process via --adapter. The argument is either a HuggingFace repo id (auto-downloaded into the OptIQ adapter cache) or a local directory:

terminalbash
# Local path
$ optiq serve \
    --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --adapter ./my_adapter

# HF repo id (downloaded on first use)
$ optiq serve \
    --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --adapter your-org/my-adapter

OptIQ-trained adapters surface their optiq_lora_config.json sidecar (rank, rank distribution, scaling mode) in the startup log.

Multi-adapter, hot-swap at the Python layer

The mounted-LoRA primitive supports multiple adapters resident at once, switched per call via a ContextVar. This is a programmatic API, not a CLI flag — embed it in your own server or notebook to serve N adapters from one base:

mount.pypython
from mlx_lm import load, generate
from optiq.adapters.mount import (
    prepare_model_for_mounted_lora,
    mount_adapter_on_model,
    AdapterActivation,
)

model, tok = load("mlx-community/Qwen3.5-9B-OptiQ-4bit")
prepare_model_for_mounted_lora(model)
mount_adapter_on_model(model, "agent-A", "./adapter_a")
mount_adapter_on_model(model, "agent-B", "./adapter_b")

with AdapterActivation("agent-A"):
    out_a = generate(model, tok, prompt=p, max_tokens=100)

with AdapterActivation("agent-B"):
    out_b = generate(model, tok, prompt=p, max_tokens=100)

Mounted adapters stay separate from the base (unlike mlx-lm's load_adapters, which merges weights). The ContextVar means concurrent asyncio tasks or threads with different active adapters don't step on each other.

Memory math One Qwen3.5-9B-OptiQ-4bit base is ~5.6 GB. Each LoRA adapter is ~50 MB. 10 adapters co-resident ≈ 6.1 GB, vs ~56 GB if you spun up one full model copy per adapter. Switching is free. No weight reload, no GPU re-upload.

Embedding the server in your own process

If you want to install the OptIQ KV-cache hooks and Anthropic endpoint into a self-managed mlx_lm.server launch:

embed_serve.pypython
from optiq.serve import _load_kv_config, install_mixed_kv
from optiq.anthropic_server import install_anthropic_endpoint
import sys

# Per-layer mixed-precision KV cache
configs = _load_kv_config("./kv/qwen35_9b/kv_config.json")
install_mixed_kv(kv_configs=configs, quantized_kv_start=0)

# Anthropic /v1/messages alongside OpenAI /v1/chat/completions
install_anthropic_endpoint()

# Hand off to mlx_lm.server
sys.argv = ["mlx_lm.server",
            "--model", "mlx-community/Qwen3.5-9B-OptiQ-4bit",
            "--port", "8080"]
from mlx_lm.server import main
main()

OpenAI client compatibility

Use the official openai Python client by pointing it at your local server:

openai_client.pypython
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-used",  # local server, but key is required
)

resp = client.chat.completions.create(
    model="mlx-community/Qwen3.5-9B-OptiQ-4bit",
    messages=[{"role": "user", "content": "hi"}],
    stream=True,
)
for chunk in resp:
    print(chunk.choices[0].delta.content or "", end="")

Anthropic API: point Claude Code at your local quant

The same server simultaneously answers Anthropic's /v1/messages endpoint with the exact response shape Claude clients expect. This means you can drive a local mlx-optiq quant from any tool that speaks the Anthropic API: Claude Code, the official anthropic Python SDK, or your own integrations.

terminalbash
# Same optiq serve invocation. No extra flag needed.
$ optiq serve --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --port 8080

Anthropic SDK against your local quant:

anthropic_client.pypython
from anthropic import Anthropic

client = Anthropic(
    base_url="http://localhost:8080",
    api_key="not-used",
)
resp = client.messages.create(
    model="mlx-community/Qwen3.5-9B-OptiQ-4bit",
    max_tokens=300,
    messages=[{"role": "user", "content": "hi"}],
)
print(resp.content[0].text)

Claude Code via env var (one line):

claude_code.shbash
export ANTHROPIC_BASE_URL="http://localhost:8080"
export ANTHROPIC_API_KEY="not-used"
$ claude    # now driven by your local quant
What's translated The shim accepts Anthropic-shaped requests, translates them into the OpenAI request the underlying mlx-lm engine wants, runs generation, and translates the response back into Anthropic shape, including streaming events. system, messages, max_tokens, stream, temperature, and top_p all work. Tool-use parameters are accepted but route through the same generation path (the underlying model does what it does; there's no server-side function-calling router).

Production tips

  • Bind to 127.0.0.1 for local-only use, or behind a reverse proxy. Don't expose the raw 0.0.0.0 binding to the public internet; there's no auth.
  • Set max-concurrency. The MLX runtime is single-process; concurrent generations share the GPU and degrade tail latency.
  • Tune --max-tokens conservatively. Each in-flight request keeps a KV cache resident; long contexts dominate memory.
  • Pre-load adapters. Loading a new adapter mid-flight stalls all in-flight requests. Mount everything you need at startup.