mlx-optiq
Reference · CLI

CLI reference

The optiq CLI is built with Click. Seven top-level commands cover the entire mlx-optiq workflow:

optiq convert

Quantize a Hugging Face model with mixed-precision sensitivity-driven bit allocation.

convert.shbash
$ optiq convert MODEL [OPTIONS]

Required

  • MODEL: HF repo ID (e.g. Qwen/Qwen3.5-9B) or local path to a bf16 checkpoint.

Options

  • --target-bpw FLOAT: Average bits per weight. Default: 4.5.
  • --candidate-bits TEXT: Comma-separated bit-widths to choose from. Default: 4,8.
  • --reference [auto|bf16|uniform_4bit]: Reference precision for sensitivity probes. Default: auto (bf16 if it fits in RAM, else uniform_4bit).
  • --calibration-mix TEXT: Calibration set. Default: optiq (the bundled 6-domain mix). Pass a path to override with a custom JSONL file.
  • --group-size INTEGER: Quantization group size. Default: 64.
  • --n-calibration INTEGER: Calibration sequences for sensitivity (drawn from the 40-sample mix). Default: 24 (roughly 4 samples from each of the 6 domains). Higher = more stable per-layer KL ranking, linearly slower convert.
  • --n-floor-per-block INTEGER: Block-aware floor: minimum components per transformer block kept above the lowest bit. Prevents lowest-bit assignments from concentrating in the middle. Default: 2; 0 disables.
  • --skip-baselines: Don't also build the uniform-4-bit comparison artifact.
  • -o, --output TEXT: Output directory. Default: optiq_output/<model_basename>.

Examples

terminalbash
# Standard 4.5 BPW mix on a 9B (auto-routes to bf16 reference)
$ optiq convert Qwen/Qwen3.5-9B --target-bpw 4.5

# 27B+ on a 36 GB Mac (auto-routes to uniform_4bit reference)
$ optiq convert Qwen/Qwen3.5-27B --reference uniform_4bit

# Custom bit set (3-bit / 6-bit mix at 4 BPW average)
$ optiq convert Qwen/Qwen3.5-4B \
    --target-bpw 4.0 --candidate-bits 3,4,6,8

optiq kv-cache

Measure per-layer KV-cache sensitivity and write a per-layer KV bit-width config that optiq serve --kv-config consumes.

kv-cache.shbash
$ optiq kv-cache MODEL [OPTIONS]

Options

  • --target-bits FLOAT: Average KV bits across full-attention layers. Default: 5.0.
  • --candidate-bits TEXT: Default: 4,8.
  • --n-samples INTEGER: Calibration samples for KV sensitivity. Default: 5.
  • --seq-len INTEGER: Calibration sequence length. Default: 512.
  • --group-size INTEGER: Default: 64.
  • -o, --output TEXT: Where to write kv_config.json and kv_sensitivity.json. Default: optiq_output/kv_cache.
Gemma-4 caveat Gemma-4 uses shared-KV attention; the current mlx-lm KV-cache implementation doesn't yet support mixed-precision quantization on this layout. optiq kv-cache exits cleanly with an explanation when run on Gemma-4. Use stock fp16 KV serving on Gemma instead.

optiq lora

Train and inspect sensitivity-aware LoRA adapters.

optiq lora train

lora-train.shbash
$ optiq lora train MODEL [OPTIONS]
  • --data PATH: Directory containing train.jsonl (and optional valid.jsonl). Required.
  • --rank INTEGER: Base LoRA rank. Default: 8.
  • --scale FLOAT: LoRA alpha scaling (alpha = rank * scale). Default: 20.0.
  • --dropout FLOAT: LoRA dropout. Default: 0.0.
  • --rank-scaling [constant|by_bits|by_kl]: How to scale rank across layers. by_bits uses the per-layer bit assignments OptIQ recorded at convert time; by_kl scales by raw KL sensitivity; constant is uniform. Default: by_bits.
  • --num-layers INTEGER: Last N transformer blocks to adapt; -1 for all. Default: 16.
  • --target-modules TEXT: Comma-separated module suffixes to adapt. Default: q_proj,v_proj.
  • --use-dora: Use DoRA in place of LoRA (currently NotImplementedError on mlx-lm 0.31.x).
  • --max-seq-length INTEGER: Tokens per training sample. Default: 1600 (M3 Max 36 GB safe ceiling for 9B). See the fine-tuning training-ceiling map for per-model recipes.
  • --iters INTEGER: Training iterations. Default: 1000.
  • --learning-rate / --lr FLOAT: Default: 1e-4.
  • --batch-size INTEGER: Default: 1 (Mac UMA constraint).
  • --grad-accumulation-steps INTEGER: Default: 1.
  • --grad-checkpoint / --no-grad-checkpoint: Enable activation checkpointing. Default: on.
  • --val-batches INTEGER: Default: 25.
  • --steps-per-report INTEGER: Default: 10.
  • --steps-per-eval INTEGER: Default: 200.
  • --steps-per-save INTEGER: Default: 100.
  • -o, --output PATH: Adapter output directory. Default: ./optiq_lora_adapters.

optiq lora info

lora-info.shbash
$ optiq lora info ADAPTER_DIR

Prints the per-layer rank distribution and trainable parameter count.

optiq serve

Dual-protocol inference server. Wraps mlx_lm.server and exposes both the OpenAI /v1/chat/completions endpoint and the Anthropic /v1/messages endpoint from the same process. Drive it with the OpenAI SDK, the Anthropic SDK, plain curl, or Claude Code (via ANTHROPIC_BASE_URL). Adds OptIQ-aware mixed-precision KV cache and a mounted LoRA adapter on top.

serve.shbash
$ optiq serve [OPTIONS] -- [mlx_lm.server flags]

OptIQ-specific options

  • --kv-config PATH: JSON config from optiq kv-cache. Per-layer mixed-precision KV. Overrides --kv-bits.
  • --kv-bits INTEGER: Uniform KV quantization (4 or 8). Omit both --kv-config and --kv-bits for fp16 KV.
  • --kv-group-size INTEGER: KV quantization group size. Default: 64.
  • --quantized-kv-start INTEGER: Token offset at which KV quantization kicks in. Default: 0.
  • --adapter PATH-OR-REPO: Apply a LoRA adapter at startup. Accepts a HuggingFace repo id (auto-downloaded) or a local directory. OptIQ-trained adapters surface their optiq_lora_config.json sidecar in startup logs.
  • --anthropic / --no-anthropic: Expose the Anthropic /v1/messages endpoint alongside OpenAI's. Default: on. Pass --no-anthropic to disable.

Forwarded to mlx_lm.server

All other flags pass through unchanged. Common ones include --model, --host, --port, --max-tokens, --temp, --top-p, --top-k. See python -m mlx_lm.server --help for the full upstream list and current defaults.

Examples

terminalbash
# 1. Generate a per-layer KV config (once per model)
$ optiq kv-cache mlx-community/Qwen3.5-9B-OptiQ-4bit --target-bits 5.0

# 2. Serve with mixed-precision KV + Anthropic on by default
$ optiq serve \
    --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --kv-config optiq_output/kv_cache/kv_config.json \
    --port 8080

# 3. Drive Claude Code through it
export ANTHROPIC_BASE_URL="http://localhost:8080"
export ANTHROPIC_API_KEY="not-used"
$ claude

optiq eval

Two-stage evaluation harness. --task smoketest runs a fast triage (KL + GSM8K-50, ~5 min on 27B). --task all runs the full benchmark suite (MMLU + GSM8K + IFEval + BFCL + HumanEval, ~1.5 h on 27B). Add --score for the unweighted-mean Capability Score. Individual tasks are addressable on their own. The methodology and what the suite caught is in the eval-framework write-up.

eval.shbash
$ optiq eval MODEL_PATH --task [TASK] [OPTIONS]

Tasks

  • smoketest: KL on 64 prompts × 256 tokens, plus GSM8K on 50 samples. Fast triage.
  • all: full benchmark suite below, in one run.
  • kl: KL divergence vs reference (default 64 prompts).
  • mmlu: 5-shot MMLU on 1000 samples.
  • gsm8k: GSM8K on 1000 samples (default), CoT 3-shot. Default task if none specified.
  • gsm8k-50: GSM8K on 50 samples (smoketest tier).
  • ifeval: full 540-prompt IFEval (instruction following).
  • bfcl: BFCL-V3 simple, 200 questions (function calling).
  • humaneval: full 164 problems, code execution sandboxed (apple/container → sandbox-exec → subprocess + rlimit).

Options

  • --task TEXT: One of the tasks above. Default: gsm8k.
  • --n-samples INTEGER: Override the per-task default sample count.
  • --baseline PATH: Side-by-side comparison against another model (gsm8k task only).
  • --reference-model PATH: Explicit reference for the KL eval. If unset, auto-resolved.
  • --reference-mode [auto|bf16|uniform_4bit]: KL reference selection strategy. Default: auto (bf16 if it fits in RAM via HfApi.model_info per-shard sizes, else mlx-community uniform-4-bit baseline).
  • --score: With --task all, also compute the Capability Score (unweighted mean of MMLU, GSM8K, IFEval, BFCL, HumanEval).
  • --kv-bits INTEGER: Uniform KV quantization (4 or 8) during generation-based benchmarks. KL and MMLU are forward-pass only, so this has no effect there.
  • --kv-group-size INTEGER: Default: 64.
  • --kv-config PATH: Per-layer mixed-precision KV from optiq kv-cache. Overrides --kv-bits.
  • --output-json PATH: Dump the structured eval record (every metric + Capability Score) to JSON.

Examples

terminalbash
# Fast triage
$ optiq eval ./optiq_output/Qwen3.5-9B/optiq_mixed --task smoketest

# Full benchmark suite + Capability Score, dump to JSON
$ optiq eval ./optiq_mixed --task all --score \
    --output-json ./bench.json

# Single benchmark
$ optiq eval ./optiq_mixed --task humaneval

# GSM8K head-to-head against the uniform-4-bit baseline
$ optiq eval ./optiq_mixed --task gsm8k \
    --baseline ./optiq_output/Qwen3.5-9B/uniform_4bit

# Eval the full suite with mixed-precision KV active during generation
$ optiq eval ./optiq_mixed --task all \
    --kv-config ./kv/qwen35_9b/kv_config.json

optiq benchmark

Quick-and-dirty perplexity + throughput on a converted model, with optional baseline side-by-side. For headline accuracy numbers, prefer optiq eval above.

benchmark.shbash
$ optiq benchmark MODEL_PATH [--baseline UNIFORM_PATH] [--n-samples N]
  • --baseline PATH: Side-by-side comparison against a uniform-4-bit baseline (or any other model path).
  • --n-samples INTEGER: Perplexity samples. Default: 50.

optiq latency

Predicts decode tok/s for a quantized model using the Apple Silicon roofline model: latency ≈ model_bytes / memory_bandwidth + per_layer_overhead. Detects your hardware tier and bandwidth automatically.

latency.shbash
# Bare prediction (linear-layer-only roofline; optimistic)
$ optiq latency ./optiq_mixed

# Calibrated: load + run the model once, fit the overhead constant
$ optiq latency ./optiq_mixed --calibrate
  • --calibrate: Run 8 warmup + 15 measured generations to fit the framework-overhead constant for this hardware. The bare prediction only counts weight-loading time and is optimistic; on M3 Max ~83% of decode latency is overhead (attention, norms, KV cache, framework), so calibration produces meaningfully better predictions for the model you measured and for similarly-sized models on the same hardware.

Top-level options

  • optiq --version: Print the installed version.
  • optiq --help: Show top-level help.
  • optiq COMMAND --help: Per-command help.

Environment variables

  • HF_HOME: Hugging Face cache root. Default: ~/.cache/huggingface.
  • HF_HUB_ENABLE_HF_TRANSFER=1: Enable accelerated HF downloads (requires pip install hf_transfer).
  • OPTIQ_ADAPTER_CACHE: Where optiq serve --adapter <hf-repo> downloads remote adapters. Default: under the HF cache.
  • ANTHROPIC_BASE_URL / ANTHROPIC_API_KEY: Set these to point Claude Code (and any Anthropic-SDK consumer) at optiq serve.