mlx-optiq
Reference · CLI

CLI reference

The optiq command-line interface drives every mlx-optiq workflow on Apple Silicon: quantization, KV-cache config, LoRA fine-tuning, OpenAI- and Anthropic-compatible serving, evaluation, and latency benchmarking. Seven top-level commands cover the full workflow:

optiq convert

Quantize a Hugging Face model with mixed-precision sensitivity-driven bit allocation.

convert.shbash
$ optiq convert MODEL [OPTIONS]

Required

  • MODEL: HF repo ID (e.g. Qwen/Qwen3.5-9B) or local path to a bf16 checkpoint.

Options

  • --method [optiq|static]: Bit-allocation method. Default: optiq (exact calibration-driven KL sensitivity). static allocates by structural rules — embedding/head, first/last block, attention and the MoE router get the high bits — at the requested candidate bits and target BPW, with no calibration or forward passes. It matches optiq on typical architectures, loads lazily, and is the fast path for large bases where exact KL is impractical. See the methods comparison.
  • --target-bpw FLOAT: Average bits per weight. Default: 5.0.
  • --candidate-bits TEXT: Comma-separated bit-widths to choose from. Default: 4,8.
  • --reference [auto|bf16|uniform_4bit]: Reference precision for sensitivity probes (optiq method only). Default: auto (bf16 if it fits in RAM, else uniform_4bit).
  • --calibration-mix TEXT: Calibration set. Default: optiq (the bundled 6-domain mix). Pass a path to override with a custom JSONL file.
  • --group-size INTEGER: Quantization group size. Default: 64.
  • --n-calibration INTEGER: Calibration sequences for sensitivity (drawn from the 40-sample mix). Default: 24 (roughly 4 samples from each of the 6 domains). Higher = more stable per-layer KL ranking, linearly slower convert.
  • --n-floor-per-block INTEGER: Block-aware floor: minimum components per transformer block kept above the lowest bit. Prevents lowest-bit assignments from concentrating in the middle. Default: 2; 0 disables.
  • --skip-baselines: Don't also build the uniform-4-bit comparison artifact.
  • -o, --output TEXT: Output directory. Default: optiq_output/<model_basename>.

Examples

terminalbash
# Standard 5.0 BPW mix on a 9B (auto-routes to bf16 reference)
$ optiq convert Qwen/Qwen3.5-9B --target-bpw 5.0

# 27B+ on a 36 GB Mac (auto-routes to uniform_4bit reference)
$ optiq convert Qwen/Qwen3.5-27B --reference uniform_4bit

# Custom bit set (3-bit / 6-bit mix at 4 BPW average)
$ optiq convert Qwen/Qwen3.5-4B \
    --target-bpw 4.0 --candidate-bits 3,4,6,8

# Fast structural allocation for a large MoE, mixed 2/4-bit
$ optiq convert mlx-community/Qwen3.5-122B-A10B-bf16 \
    --method static --candidate-bits 2,4 --target-bpw 2.5

optiq kv-cache

Measure per-layer KV-cache sensitivity and write a per-layer KV bit-width config that optiq serve --kv-config consumes.

kv-cache.shbash
$ optiq kv-cache MODEL [OPTIONS]

Options

  • --target-bits FLOAT: Average KV bits across full-attention layers. Default: 5.0.
  • --candidate-bits TEXT: Default: 4,8.
  • --n-samples INTEGER: Calibration samples for KV sensitivity. Default: 5.
  • --seq-len INTEGER: Calibration sequence length. Default: 512.
  • --group-size INTEGER: Default: 64.
  • -o, --output TEXT: Where to write kv_config.json and kv_sensitivity.json. Default: optiq_output/kv_cache.
Gemma-4 support Upstream mlx-lm raises NotImplementedError: RotatingKVCache Quantization NYI on sliding-window models. optiq kv-cache installs optiq.runtime.kv.RotatingQuantizedKVCache at probe time so sensitivity analysis (and serve) work on all sliding-window models — Gemma-4, Cohere R2, OLMo 3, Phi-3/4, EXAONE, Ministral 3, etc.

optiq lora

Train and inspect sensitivity-aware LoRA adapters.

optiq lora train

lora-train.shbash
$ optiq lora train MODEL [OPTIONS]
  • --data PATH: Directory containing train.jsonl (and optional valid.jsonl). Required.
  • --rank INTEGER: Base LoRA rank. Default: 8.
  • --scale FLOAT: LoRA alpha scaling (alpha = rank * scale). Default: 20.0.
  • --dropout FLOAT: LoRA dropout. Default: 0.0.
  • --rank-scaling [constant|by_bits|by_kl]: How to scale rank across layers. by_bits uses the per-layer bit assignments OptiQ recorded at convert time; by_kl scales by raw KL sensitivity; constant is uniform. Default: by_bits.
  • --num-layers INTEGER: Last N transformer blocks to adapt; -1 for all. Default: 16.
  • --target-modules TEXT: Comma-separated module suffixes to adapt. Default: q_proj,v_proj.
  • --use-dora: Use DoRA in place of LoRA (currently NotImplementedError on mlx-lm 0.31.x).
  • --max-seq-length INTEGER: Tokens per training sample. Default: 1600 (M3 Max 36 GB safe ceiling for 9B). See the fine-tuning training-ceiling map for per-model recipes.
  • --iters INTEGER: Training iterations. Default: 1000.
  • --learning-rate / --lr FLOAT: Default: 1e-4.
  • --batch-size INTEGER: Default: 1 (Mac UMA constraint).
  • --grad-accumulation-steps INTEGER: Default: 1.
  • --grad-checkpoint / --no-grad-checkpoint: Enable activation checkpointing. Default: on.
  • --val-batches INTEGER: Default: 25.
  • --steps-per-report INTEGER: Default: 10.
  • --steps-per-eval INTEGER: Default: 200.
  • --steps-per-save INTEGER: Default: 100.
  • -o, --output PATH: Adapter output directory. Default: ./optiq_lora_adapters.

optiq lora info

lora-info.shbash
$ optiq lora info ADAPTER_DIR

Prints the per-layer rank distribution and trainable parameter count.

optiq serve

Dual-protocol inference server. Wraps mlx_lm.server and exposes both the OpenAI /v1/chat/completions endpoint and the Anthropic /v1/messages endpoint from the same process. Drive it with the OpenAI SDK, the Anthropic SDK, plain curl, or Claude Code (via ANTHROPIC_BASE_URL). Adds OptiQ-aware mixed-precision KV cache and a mounted LoRA adapter on top.

serve.shbash
$ optiq serve [OPTIONS] -- [mlx_lm.server flags]

OptiQ-specific options

  • --kv-config PATH: JSON config from optiq kv-cache. Per-layer mixed-precision KV. Overrides --kv-bits.
  • --kv-bits INTEGER: Uniform KV quantization (4 or 8). Omit both --kv-config and --kv-bits for fp16 KV.
  • --kv-group-size INTEGER: KV quantization group size. Default: 64.
  • --quantized-kv-start INTEGER: Token offset at which KV quantization kicks in. Default: 0.
  • --adapter PATH-OR-REPO: Apply a LoRA adapter at startup. Accepts a HuggingFace repo id (auto-downloaded) or a local directory. OptiQ-trained adapters surface their optiq_lora_config.json sidecar in startup logs.
  • --models-dir DIR: Advertise locally-built quants under DIR in /v1/models (alongside HF-cache models and the served one), switchable per request by path. The server already hot-swaps to any requested model.
  • --anthropic / --no-anthropic: Expose the Anthropic /v1/messages endpoint alongside OpenAI's. Default: on. Pass --no-anthropic to disable.

Forwarded to mlx_lm.server

All other flags pass through unchanged. Common ones include --model, --host, --port, --max-tokens, --temp, --top-p, --top-k. See python -m mlx_lm.server --help for the full upstream list and current defaults.

Examples

terminalbash
# 1. Generate a per-layer KV config (once per model)
$ optiq kv-cache mlx-community/Qwen3.5-9B-OptiQ-4bit --target-bits 5.0

# 2. Serve with mixed-precision KV + Anthropic on by default
$ optiq serve \
    --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --kv-config optiq_output/kv_cache/kv_config.json \
    --port 8080

# 3. Drive Claude Code through it
export ANTHROPIC_BASE_URL="http://localhost:8080"
export ANTHROPIC_API_KEY="not-used"
$ claude

optiq eval

Two-stage evaluation harness. --task smoketest runs a fast triage (KL + GSM8K-50, ~5 min on 27B). --task all runs the full benchmark suite (MMLU + GSM8K + IFEval + BFCL + HumanEval + HashHop, ~2 h on 27B). Add --score for the unweighted-mean Capability Score. Individual tasks are addressable on their own. The methodology is in the eval-framework write-up.

eval.shbash
$ optiq eval MODEL_PATH --task [TASK] [OPTIONS]

Tasks

  • smoketest: KL on 64 prompts × 256 tokens, plus GSM8K on 50 samples. Fast triage.
  • all: full benchmark suite below, in one run.
  • kl: KL divergence vs reference (default 64 prompts).
  • mmlu: 5-shot MMLU on 1000 samples.
  • gsm8k: GSM8K on 1000 samples (default), CoT 3-shot. Default task if none specified.
  • gsm8k-50: GSM8K on 50 samples (smoketest tier).
  • ifeval: full 540-prompt IFEval (instruction following).
  • bfcl: BFCL-V3 simple, 200 questions (function calling).
  • humaneval: full 164 problems, code execution sandboxed (apple/container → sandbox-exec → subprocess + rlimit).
  • hashhop: HashHop Long-Context Evaluation. 25 instances per hop count × 4 hop counts (~100 total) at ~12 k context. Multi-hop key→value retrieval.

Options

  • --task TEXT: One of the tasks above. Default: gsm8k.
  • --n-samples INTEGER: Override the per-task default sample count.
  • --baseline PATH: Side-by-side comparison against another model (gsm8k task only).
  • --reference-model PATH: Explicit reference for the KL eval. If unset, auto-resolved.
  • --reference-mode [auto|bf16|uniform_4bit]: KL reference selection strategy. Default: auto (bf16 if it fits in RAM via HfApi.model_info per-shard sizes, else mlx-community uniform-4-bit baseline).
  • --score: With --task all, also compute the Capability Score (unweighted mean of MMLU, GSM8K, IFEval, BFCL, HumanEval, HashHop).
  • --reasoning: Score an always-on reasoning model fairly. Lets the model emit its <think> block (no suppression), gives the generation tasks a large per-question budget so the trace completes, strips the think block before extracting the answer, and scores MMLU generatively (parse the answer letter) instead of by first-letter logit argmax — which collapses to chance for a model trained to reason before answering. Opt-in; default eval behavior is unchanged.
  • --reasoning-max-tokens INTEGER: Per-question budget in --reasoning mode. Default: 3072 (a cap, not a fixed length — greedy decoding stops at EOS, so short traces cost nothing). Raise it for unusually long reasoners.
  • --kv-bits INTEGER: Uniform KV quantization (4 or 8) during generation-based benchmarks. KL and MMLU are forward-pass only, so this has no effect there.
  • --kv-group-size INTEGER: Default: 64.
  • --kv-config PATH: Per-layer mixed-precision KV from optiq kv-cache. Overrides --kv-bits.
  • --output-json PATH: Dump the structured eval record (every metric + Capability Score) to JSON.

Examples

terminalbash
# Fast triage
$ optiq eval ./optiq_output/Qwen3.5-9B/optiq_mixed --task smoketest

# Full benchmark suite + Capability Score, dump to JSON
$ optiq eval ./optiq_mixed --task all --score \
    --output-json ./bench.json

# Single benchmark
$ optiq eval ./optiq_mixed --task humaneval

# GSM8K head-to-head against the uniform-4-bit baseline
$ optiq eval ./optiq_mixed --task gsm8k \
    --baseline ./optiq_output/Qwen3.5-9B/uniform_4bit

# Eval the full suite with mixed-precision KV active during generation
$ optiq eval ./optiq_mixed --task all \
    --kv-config ./kv/qwen35_9b/kv_config.json

optiq benchmark

Quick-and-dirty perplexity + throughput on a converted model, with optional baseline side-by-side. For headline accuracy numbers, prefer optiq eval above.

benchmark.shbash
$ optiq benchmark MODEL_PATH [--baseline UNIFORM_PATH] [--n-samples N]
  • --baseline PATH: Side-by-side comparison against a uniform-4-bit baseline (or any other model path).
  • --n-samples INTEGER: Perplexity samples. Default: 50.

optiq latency

Predicts decode tok/s for a quantized model using the Apple Silicon roofline model: latency ≈ model_bytes / memory_bandwidth + per_layer_overhead. Detects your hardware tier and bandwidth automatically.

latency.shbash
# Bare prediction (linear-layer-only roofline; optimistic)
$ optiq latency ./optiq_mixed

# Calibrated: load + run the model once, fit the overhead constant
$ optiq latency ./optiq_mixed --calibrate
  • --calibrate: Run 8 warmup + 15 measured generations to fit the framework-overhead constant for this hardware. The bare prediction only counts weight-loading time and is optimistic; on M3 Max ~83% of decode latency is overhead (attention, norms, KV cache, framework), so calibration produces meaningfully better predictions for the model you measured and for similarly-sized models on the same hardware.

Top-level options

  • optiq --version: Print the installed version.
  • optiq --help: Show top-level help.
  • optiq COMMAND --help: Per-command help.

Environment variables

  • HF_HOME: Hugging Face cache root. Default: ~/.cache/huggingface.
  • HF_HUB_ENABLE_HF_TRANSFER=1: Enable accelerated HF downloads (requires pip install hf_transfer).
  • OPTIQ_ADAPTER_CACHE: Where optiq serve --adapter <hf-repo> downloads remote adapters. Default: under the HF cache.
  • ANTHROPIC_BASE_URL / ANTHROPIC_API_KEY: Set these to point Claude Code (and any Anthropic-SDK consumer) at optiq serve.