KV-quant serving
optiq serve is a drop-in replacement for mlx_lm.server. It exposes both the OpenAI /v1/chat/completions endpoint and the Anthropic /v1/messages endpoint from the same process. Point Claude Code, the OpenAI SDK, the Anthropic SDK, or plain curl at the same local URL. On top of that: sensitivity-aware quantized KV cache for long-context throughput, and mounted LoRA adapters that swap per request.
Quickstart
# Stock fp16 KV serving. Works for any mlx-optiq quant. $ optiq serve --model mlx-community/Qwen3.5-9B-OptiQ-4bit \ --port 8080
Then call it like any OpenAI endpoint:
$ curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "mlx-community/Qwen3.5-9B-OptiQ-4bit", "messages": [{"role": "user", "content": "What is RoPE?"}], "max_tokens": 300, "stream": false }'
Mixed-precision KV cache
Default optiq serve uses fp16 KV. For long context (16 k+) on Qwen3.5 / 3.6, a sensitivity-driven quantized KV cache delivers 30-60% decode speedup with no quality loss. Why mixed precision matters for KV (and why uniform 4-bit KV is catastrophic on layer 0) is covered in Not All Layers Are Equal; for the alternative we built but didn't ship, see the TurboQuant postmortem.
Step 1: measure
# 1-2 min. Once per model. $ optiq kv-cache mlx-community/Qwen3.5-9B-OptiQ-4bit \ --target-bits 5.0 \ --candidate-bits 4,8 \ -o ./kv/qwen35_9b # writes ./kv/qwen35_9b/kv_config.json # [{"layer_idx": 3, "bits": 8, "group_size": 64}, ...]
Step 2: serve
$ optiq serve \ --model mlx-community/Qwen3.5-9B-OptiQ-4bit \ --kv-config ./kv/qwen35_9b/kv_config.json \ --max-tokens 32768 --temp 0.6 --top-p 0.95
mx.quantized_matmul also handles the 8-bit fast path more efficiently than 4-bit, so protecting that one layer also flips it onto a faster kernel. Quality and speed point the same way.
LoRA adapter at serve time
The CLI accepts one adapter per server process via --adapter. The argument is either a HuggingFace repo id (auto-downloaded into the OptIQ adapter cache) or a local directory:
# Local path $ optiq serve \ --model mlx-community/Qwen3.5-9B-OptiQ-4bit \ --adapter ./my_adapter # HF repo id (downloaded on first use) $ optiq serve \ --model mlx-community/Qwen3.5-9B-OptiQ-4bit \ --adapter your-org/my-adapter
OptIQ-trained adapters surface their optiq_lora_config.json sidecar (rank, rank distribution, scaling mode) in the startup log.
Multi-adapter, hot-swap at the Python layer
The mounted-LoRA primitive supports multiple adapters resident at once, switched per call via a ContextVar. This is a programmatic API, not a CLI flag — embed it in your own server or notebook to serve N adapters from one base:
from mlx_lm import load, generate from optiq.adapters.mount import ( prepare_model_for_mounted_lora, mount_adapter_on_model, AdapterActivation, ) model, tok = load("mlx-community/Qwen3.5-9B-OptiQ-4bit") prepare_model_for_mounted_lora(model) mount_adapter_on_model(model, "agent-A", "./adapter_a") mount_adapter_on_model(model, "agent-B", "./adapter_b") with AdapterActivation("agent-A"): out_a = generate(model, tok, prompt=p, max_tokens=100) with AdapterActivation("agent-B"): out_b = generate(model, tok, prompt=p, max_tokens=100)
Mounted adapters stay separate from the base (unlike mlx-lm's load_adapters, which merges weights). The ContextVar means concurrent asyncio tasks or threads with different active adapters don't step on each other.
Embedding the server in your own process
If you want to install the OptIQ KV-cache hooks and Anthropic endpoint into a self-managed mlx_lm.server launch:
from optiq.serve import _load_kv_config, install_mixed_kv from optiq.anthropic_server import install_anthropic_endpoint import sys # Per-layer mixed-precision KV cache configs = _load_kv_config("./kv/qwen35_9b/kv_config.json") install_mixed_kv(kv_configs=configs, quantized_kv_start=0) # Anthropic /v1/messages alongside OpenAI /v1/chat/completions install_anthropic_endpoint() # Hand off to mlx_lm.server sys.argv = ["mlx_lm.server", "--model", "mlx-community/Qwen3.5-9B-OptiQ-4bit", "--port", "8080"] from mlx_lm.server import main main()
OpenAI client compatibility
Use the official openai Python client by pointing it at your local server:
from openai import OpenAI client = OpenAI( base_url="http://localhost:8080/v1", api_key="not-used", # local server, but key is required ) resp = client.chat.completions.create( model="mlx-community/Qwen3.5-9B-OptiQ-4bit", messages=[{"role": "user", "content": "hi"}], stream=True, ) for chunk in resp: print(chunk.choices[0].delta.content or "", end="")
Anthropic API: point Claude Code at your local quant
The same server simultaneously answers Anthropic's /v1/messages endpoint with the exact response shape Claude clients expect. This means you can drive a local mlx-optiq quant from any tool that speaks the Anthropic API: Claude Code, the official anthropic Python SDK, or your own integrations.
# Same optiq serve invocation. No extra flag needed. $ optiq serve --model mlx-community/Qwen3.5-9B-OptiQ-4bit \ --port 8080
Anthropic SDK against your local quant:
from anthropic import Anthropic client = Anthropic( base_url="http://localhost:8080", api_key="not-used", ) resp = client.messages.create( model="mlx-community/Qwen3.5-9B-OptiQ-4bit", max_tokens=300, messages=[{"role": "user", "content": "hi"}], ) print(resp.content[0].text)
Claude Code via env var (one line):
export ANTHROPIC_BASE_URL="http://localhost:8080" export ANTHROPIC_API_KEY="not-used" $ claude # now driven by your local quant
system, messages, max_tokens, stream, temperature, and top_p all work. Tool-use parameters are accepted but route through the same generation path (the underlying model does what it does; there's no server-side function-calling router).
Production tips
- Bind to
127.0.0.1for local-only use, or behind a reverse proxy. Don't expose the raw0.0.0.0binding to the public internet; there's no auth. - Set max-concurrency. The MLX runtime is single-process; concurrent generations share the GPU and degrade tail latency.
- Tune
--max-tokensconservatively. Each in-flight request keeps a KV cache resident; long contexts dominate memory. - Pre-load adapters. Loading a new adapter mid-flight stalls all in-flight requests. Mount everything you need at startup.