mlx-optiq
Workflow · serving

KV-quant serving

optiq serve is a drop-in replacement for mlx_lm.server. It exposes both the OpenAI /v1/chat/completions endpoint and the Anthropic /v1/messages endpoint from the same process. Point Claude Code, the OpenAI SDK, the Anthropic SDK, or plain curl at the same local URL. On top of that: sensitivity-aware quantized KV cache for long-context throughput, structured JSON/regex output, model switching per request, and mounted LoRA adapters that swap per request.

Quickstart

terminalbash
# Stock fp16 KV serving. Works for any mlx-optiq quant.
$ optiq serve --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --port 8080

Then call it like any OpenAI endpoint:

curl_chat.shbash
$ curl http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "mlx-community/Qwen3.5-9B-OptiQ-4bit",
      "messages": [{"role": "user", "content": "What is RoPE?"}],
      "max_tokens": 300,
      "stream": false
    }'

Mixed-precision KV cache

Default optiq serve uses fp16 KV. Pass --kv-bits 4 for uniform 4-bit KV or --kv-config kv_config.json for OptiQ mixed-precision KV. Both automatically install the streaming-converter and FlashAttention-2 patches that prevent stock mlx-lm's OOM at long context: on a 24 GB Mac at 32k context, peak drops from 16.35 GB (stock u4) to 7.60 GB (ours), 34% below the fp16 KV path. Speed is within ±2% of fp16 on Qwen3.5-9B-class models. At hash-hop hops=3 on Qwen3.5-9B, OptiQ mixed-precision scores 33% better than uniform 4-bit (32% vs 24% retention of fp16's 36%). For the engineering write-up see Why u4 KV cache OOMs harder than fp16; the underlying research on per-layer KV sensitivity is in Not All Layers Are Equal.

Step 1: measure

terminalbash
# 1-2 min. Once per model.
$ optiq kv-cache mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --target-bits 5.0 \
    --candidate-bits 4,8 \
    -o ./kv/qwen35_9b

# writes ./kv/qwen35_9b/kv_config.json
# [{"layer_idx": 3, "bits": 8, "group_size": 64}, ...]

Step 2: serve

terminalbash
$ optiq serve \
    --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --kv-config ./kv/qwen35_9b/kv_config.json \
    --max-tokens 32768 --temp 0.6 --top-p 0.95
Why this works Layer 0's KV is often 56× more sensitive than the layer-average. Uniform 4-bit KV is catastrophic. A single 8-bit layer (the most KV-sensitive one, often layer 3 in Qwen3.5's hybrid attention) protects quality while every other layer runs 4-bit. Apple Silicon's mx.quantized_matmul also handles the 8-bit fast path more efficiently than 4-bit, so protecting that one layer also flips it onto a faster kernel. Quality and speed point the same way.

LoRA adapter at serve time

The CLI accepts one adapter per server process via --adapter. The argument is either a HuggingFace repo id (auto-downloaded into the OptiQ adapter cache) or a local directory:

terminalbash
# Local path
$ optiq serve \
    --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --adapter ./my_adapter

# HF repo id (downloaded on first use)
$ optiq serve \
    --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --adapter your-org/my-adapter

OptiQ-trained adapters surface their optiq_lora_config.json sidecar (rank, rank distribution, scaling mode) in the startup log.

Multi-adapter, hot-swap at the Python layer

The mounted-LoRA primitive supports multiple adapters resident at once, switched per call via a ContextVar. This is a programmatic API, not a CLI flag — embed it in your own server or notebook to serve N adapters from one base:

mount.pypython
from mlx_lm import load, generate
from optiq.adapters.mount import (
    prepare_model_for_mounted_lora,
    mount_adapter_on_model,
    AdapterActivation,
)

model, tok = load("mlx-community/Qwen3.5-9B-OptiQ-4bit")
prepare_model_for_mounted_lora(model)
mount_adapter_on_model(model, "agent-A", "./adapter_a")
mount_adapter_on_model(model, "agent-B", "./adapter_b")

with AdapterActivation("agent-A"):
    out_a = generate(model, tok, prompt=p, max_tokens=100)

with AdapterActivation("agent-B"):
    out_b = generate(model, tok, prompt=p, max_tokens=100)

Mounted adapters stay separate from the base (unlike mlx-lm's load_adapters, which merges weights). The ContextVar means concurrent asyncio tasks or threads with different active adapters don't step on each other.

Memory math One Qwen3.5-9B-OptiQ-4bit base is ~5.6 GB. Each LoRA adapter is ~50 MB. 10 adapters co-resident ≈ 6.1 GB, vs ~56 GB if you spun up one full model copy per adapter. Switching is free. No weight reload, no GPU re-upload.

Embedding the server in your own process

If you want to install the OptiQ KV-cache hooks and Anthropic endpoint into a self-managed mlx_lm.server launch:

embed_serve.pypython
from optiq.serve import _load_kv_config, install_mixed_kv
from optiq.anthropic_server import install_anthropic_endpoint
import sys

# Per-layer mixed-precision KV cache
configs = _load_kv_config("./kv/qwen35_9b/kv_config.json")
install_mixed_kv(kv_configs=configs, quantized_kv_start=0)

# Anthropic /v1/messages alongside OpenAI /v1/chat/completions
install_anthropic_endpoint()

# Hand off to mlx_lm.server
sys.argv = ["mlx_lm.server",
            "--model", "mlx-community/Qwen3.5-9B-OptiQ-4bit",
            "--port", "8080"]
from mlx_lm.server import main
main()

OpenAI client compatibility

Use the official openai Python client by pointing it at your local server:

openai_client.pypython
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-used",  # local server, but key is required
)

resp = client.chat.completions.create(
    model="mlx-community/Qwen3.5-9B-OptiQ-4bit",
    messages=[{"role": "user", "content": "hi"}],
    stream=True,
)
for chunk in resp:
    print(chunk.choices[0].delta.content or "", end="")

Structured / JSON output

Pass an OpenAI response_format and the server constrains generation so the model can only emit tokens that keep the output valid. Both json_object (any valid JSON) and json_schema (a specific shape) are supported, plus the vLLM-style guided_regex and guided_choice extensions. The JSON it returns parses, the enum answers stay in range, the regex always matches. No retries, no parsing failures.

curl_json_schema.shbash
$ curl http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "messages": [{"role": "user", "content": "A person: Alice, 30."}],
      "response_format": {
        "type": "json_schema",
        "json_schema": {"schema": {
          "type": "object",
          "properties": {"name": {"type": "string"}, "age": {"type": "integer"}},
          "required": ["name", "age"]
        }}
      }
    }'
# -> {"name": "Alice", "age": 30}

The vLLM extensions take a raw regex or a fixed choice set:

guided.shbash
# constrain to a regex
$ curl ... -d '{"messages": [...], "guided_regex": "[0-9]{3}-[0-9]{3}-[0-9]{4}"}'
# -> 123-456-7890

# constrain to a fixed set of answers
$ curl ... -d '{"messages": [...], "guided_choice": ["yes", "no"]}'
# -> yes

In the OptiQ Lab

The Lab Chat surface exposes the same constraint without curl. In Model & params, set JSON mode to any valid JSON or match schema (paste a JSON Schema in the box that appears), then send a message. The reply is guaranteed-valid JSON, parsed the same way the curl examples above are constrained. Tools are turned off while JSON mode is on, since a single constrained response and a tool-call loop are mutually exclusive. The Lab's server installs the constraint automatically, so any model you load supports it.

No PyTorch Structured output is built on lm-format-enforcer (pure-Python: pydantic + interegular), not xgrammar — xgrammar hard-requires PyTorch, which would break OptiQ's MLX-native, no-torch runtime. The constraining logit mask adds about 1 ms/token; free-form requests (no response_format) are untouched. For reasoning models, thinking is auto-disabled when a spec is present so the constrained output lands in content, not reasoning.

Tool-call healing

Quantized open-weight models often emit a malformed tool call instead of the clean structured form: a Hermes <tool_call> tag, a fenced JSON block, a bare object, trailing commas, fancy quotes, or function-call syntax. mlx-lm's parser only recognizes the canonical format, so a malformed call leaks into the message content as raw text and the client never gets a tool_calls array.

OptiQ heals these server-side. On a non-streaming completion that carried tools, it scans the output for the six common malformed shapes, recovers them into proper OpenAI tool_calls, strips the leftover "I'll call X" preamble from content, and sets finish_reason to tool_calls. Unknown tool names are rejected so the model cannot invent a tool. Always on; requests without tools are untouched. It is the same healer the OptiQ Lab uses, lifted to the server layer so any client (an agent, or Claude Code through the OpenAI endpoint) gets clean calls, not just the Lab UI.

Prevent vs repair Two complementary routes to reliable tool calls: structured output above constrains the model so it can only emit valid calls (prevention), and healing repairs malformed ones after the fact (cure). Healing runs on non-streaming completions, where the full output is in hand.

Listing & switching models

The server hot-swaps to whatever model a request asks for — just set the model field and it loads (and caches) that model, evicting the previous one. So one running server backs a coding harness that switches models per request, no restart. /v1/models lists every MLX model already in your HuggingFace cache plus the served one.

switch.shbash
# what's available to switch to
$ curl http://localhost:8080/v1/models

# switch by naming a different model in the request
$ curl http://localhost:8080/v1/chat/completions -d '{
      "model": "mlx-community/Qwen3.6-27B-OptiQ-4bit",
      "messages": [{"role": "user", "content": "hi"}]
    }'

Locally-built quants that were never pushed to the hub won't show up in the cache scan. Point --models-dir at the directory holding them and they're advertised in /v1/models too, switchable by passing their path as the request model:

terminalbash
$ optiq serve --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --models-dir ./optiq_output --port 8080

Anthropic API: point Claude Code at your local quant

The same server simultaneously answers Anthropic's /v1/messages endpoint with the exact response shape Claude clients expect. This means you can drive a local mlx-optiq quant from any tool that speaks the Anthropic API: Claude Code, the official anthropic Python SDK, or your own integrations.

terminalbash
# Same optiq serve invocation. No extra flag needed.
$ optiq serve --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --port 8080

Anthropic SDK against your local quant:

anthropic_client.pypython
from anthropic import Anthropic

client = Anthropic(
    base_url="http://localhost:8080",
    api_key="not-used",
)
resp = client.messages.create(
    model="mlx-community/Qwen3.5-9B-OptiQ-4bit",
    max_tokens=300,
    messages=[{"role": "user", "content": "hi"}],
)
print(resp.content[0].text)

Claude Code via env var (one line):

claude_code.shbash
export ANTHROPIC_BASE_URL="http://localhost:8080"
export ANTHROPIC_API_KEY="not-used"
$ claude    # now driven by your local quant
What's translated The shim accepts Anthropic-shaped requests, translates them into the OpenAI request the underlying mlx-lm engine wants, runs generation, and translates the response back into Anthropic shape, including streaming events. system, messages, max_tokens, stream, temperature, and top_p all work. Tool-use parameters are accepted but route through the same generation path (the underlying model does what it does; there's no server-side function-calling router).

Production tips

  • Bind to 127.0.0.1 for local-only use, or behind a reverse proxy. Don't expose the raw 0.0.0.0 binding to the public internet; there's no auth.
  • Set max-concurrency. The MLX runtime is single-process; concurrent generations share the GPU and degrade tail latency.
  • Tune --max-tokens conservatively. Each in-flight request keeps a KV cache resident; long contexts dominate memory.
  • Pre-load adapters. Loading a new adapter mid-flight stalls all in-flight requests. Mount everything you need at startup.