Reference · CLI

CLI reference

The optiq CLI is built with Click. Seven top-level commands cover the entire mlx-optiq workflow:

optiq convert: quantize an HF model with mixed-precision sensitivity
optiq kv-cache: measure per-layer KV sensitivity, write a serving config
optiq lora train / info: sensitivity-aware LoRA training
optiq serve: dual-protocol inference server (OpenAI + Anthropic)
optiq eval: smoketest + full benchmark suite (MMLU, GSM8K, IFEval, BFCL, HumanEval)
optiq benchmark: perplexity + throughput on a converted model
optiq latency: Apple Silicon roofline throughput predictor

`optiq convert`

Quantize a Hugging Face model with mixed-precision sensitivity-driven bit allocation.

convert.shbash

$ optiq convert MODEL [OPTIONS]

Required

MODEL: HF repo ID (e.g. Qwen/Qwen3.5-9B) or local path to a bf16 checkpoint.

Options

--target-bpw FLOAT: Average bits per weight. Default: 4.5.
--candidate-bits TEXT: Comma-separated bit-widths to choose from. Default: 4,8.
--reference [auto|bf16|uniform_4bit]: Reference precision for sensitivity probes. Default: auto (bf16 if it fits in RAM, else uniform_4bit).
--calibration-mix TEXT: Calibration set. Default: optiq (the bundled 6-domain mix). Pass a path to override with a custom JSONL file.
--group-size INTEGER: Quantization group size. Default: 64.
--n-calibration INTEGER: Calibration sequences for sensitivity (drawn from the 40-sample mix). Default: 24 (roughly 4 samples from each of the 6 domains). Higher = more stable per-layer KL ranking, linearly slower convert.
--n-floor-per-block INTEGER: Block-aware floor: minimum components per transformer block kept above the lowest bit. Prevents lowest-bit assignments from concentrating in the middle. Default: 2; 0 disables.
--skip-baselines: Don't also build the uniform-4-bit comparison artifact.
-o, --output TEXT: Output directory. Default: optiq_output/<model_basename>.

Examples

terminalbash

# Standard 4.5 BPW mix on a 9B (auto-routes to bf16 reference)
$ optiq convert Qwen/Qwen3.5-9B --target-bpw 4.5

# 27B+ on a 36 GB Mac (auto-routes to uniform_4bit reference)
$ optiq convert Qwen/Qwen3.5-27B --reference uniform_4bit

# Custom bit set (3-bit / 6-bit mix at 4 BPW average)
$ optiq convert Qwen/Qwen3.5-4B \
    --target-bpw 4.0 --candidate-bits 3,4,6,8

`optiq kv-cache`

Measure per-layer KV-cache sensitivity and write a per-layer KV bit-width config that optiq serve --kv-config consumes.

kv-cache.shbash

$ optiq kv-cache MODEL [OPTIONS]

Options

--target-bits FLOAT: Average KV bits across full-attention layers. Default: 5.0.
--candidate-bits TEXT: Default: 4,8.
--n-samples INTEGER: Calibration samples for KV sensitivity. Default: 5.
--seq-len INTEGER: Calibration sequence length. Default: 512.
--group-size INTEGER: Default: 64.
-o, --output TEXT: Where to write kv_config.json and kv_sensitivity.json. Default: optiq_output/kv_cache.

Gemma-4 caveat Gemma-4 uses shared-KV attention; the current mlx-lm KV-cache implementation doesn't yet support mixed-precision quantization on this layout. optiq kv-cache exits cleanly with an explanation when run on Gemma-4. Use stock fp16 KV serving on Gemma instead.

`optiq lora`

Train and inspect sensitivity-aware LoRA adapters.

`optiq lora train`

lora-train.shbash

$ optiq lora train MODEL [OPTIONS]

--data PATH: Directory containing train.jsonl (and optional valid.jsonl). Required.
--rank INTEGER: Base LoRA rank. Default: 8.
--scale FLOAT: LoRA alpha scaling (alpha = rank * scale). Default: 20.0.
--dropout FLOAT: LoRA dropout. Default: 0.0.
--rank-scaling [constant|by_bits|by_kl]: How to scale rank across layers. by_bits uses the per-layer bit assignments OptIQ recorded at convert time; by_kl scales by raw KL sensitivity; constant is uniform. Default: by_bits.
--num-layers INTEGER: Last N transformer blocks to adapt; -1 for all. Default: 16.
--target-modules TEXT: Comma-separated module suffixes to adapt. Default: q_proj,v_proj.
--use-dora: Use DoRA in place of LoRA (currently NotImplementedError on mlx-lm 0.31.x).
--max-seq-length INTEGER: Tokens per training sample. Default: 1600 (M3 Max 36 GB safe ceiling for 9B). See the fine-tuning training-ceiling map for per-model recipes.
--iters INTEGER: Training iterations. Default: 1000.
--learning-rate / --lr FLOAT: Default: 1e-4.
--batch-size INTEGER: Default: 1 (Mac UMA constraint).
--grad-accumulation-steps INTEGER: Default: 1.
--grad-checkpoint / --no-grad-checkpoint: Enable activation checkpointing. Default: on.
--val-batches INTEGER: Default: 25.
--steps-per-report INTEGER: Default: 10.
--steps-per-eval INTEGER: Default: 200.
--steps-per-save INTEGER: Default: 100.
-o, --output PATH: Adapter output directory. Default: ./optiq_lora_adapters.

`optiq lora info`

lora-info.shbash

$ optiq lora info ADAPTER_DIR

Prints the per-layer rank distribution and trainable parameter count.

`optiq serve`

Dual-protocol inference server. Wraps mlx_lm.server and exposes both the OpenAI /v1/chat/completions endpoint and the Anthropic /v1/messages endpoint from the same process. Drive it with the OpenAI SDK, the Anthropic SDK, plain curl, or Claude Code (via ANTHROPIC_BASE_URL). Adds OptIQ-aware mixed-precision KV cache and a mounted LoRA adapter on top.

serve.shbash

$ optiq serve [OPTIONS] -- [mlx_lm.server flags]

OptIQ-specific options

--kv-config PATH: JSON config from optiq kv-cache. Per-layer mixed-precision KV. Overrides --kv-bits.
--kv-bits INTEGER: Uniform KV quantization (4 or 8). Omit both --kv-config and --kv-bits for fp16 KV.
--kv-group-size INTEGER: KV quantization group size. Default: 64.
--quantized-kv-start INTEGER: Token offset at which KV quantization kicks in. Default: 0.
--adapter PATH-OR-REPO: Apply a LoRA adapter at startup. Accepts a HuggingFace repo id (auto-downloaded) or a local directory. OptIQ-trained adapters surface their optiq_lora_config.json sidecar in startup logs.
--anthropic / --no-anthropic: Expose the Anthropic /v1/messages endpoint alongside OpenAI's. Default: on. Pass --no-anthropic to disable.

Forwarded to `mlx_lm.server`

All other flags pass through unchanged. Common ones include --model, --host, --port, --max-tokens, --temp, --top-p, --top-k. See python -m mlx_lm.server --help for the full upstream list and current defaults.

Examples

terminalbash

# 1. Generate a per-layer KV config (once per model)
$ optiq kv-cache mlx-community/Qwen3.5-9B-OptiQ-4bit --target-bits 5.0

# 2. Serve with mixed-precision KV + Anthropic on by default
$ optiq serve \
    --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --kv-config optiq_output/kv_cache/kv_config.json \
    --port 8080

# 3. Drive Claude Code through it
export ANTHROPIC_BASE_URL="http://localhost:8080"
export ANTHROPIC_API_KEY="not-used"
$ claude

`optiq eval`

Two-stage evaluation harness. --task smoketest runs a fast triage (KL + GSM8K-50, ~5 min on 27B). --task all runs the full benchmark suite (MMLU + GSM8K + IFEval + BFCL + HumanEval, ~1.5 h on 27B). Add --score for the unweighted-mean Capability Score. Individual tasks are addressable on their own. The methodology and what the suite caught is in the eval-framework write-up.

eval.shbash

$ optiq eval MODEL_PATH --task [TASK] [OPTIONS]

Tasks

smoketest: KL on 64 prompts × 256 tokens, plus GSM8K on 50 samples. Fast triage.
all: full benchmark suite below, in one run.
kl: KL divergence vs reference (default 64 prompts).
mmlu: 5-shot MMLU on 1000 samples.
gsm8k: GSM8K on 1000 samples (default), CoT 3-shot. Default task if none specified.
gsm8k-50: GSM8K on 50 samples (smoketest tier).
ifeval: full 540-prompt IFEval (instruction following).
bfcl: BFCL-V3 simple, 200 questions (function calling).
humaneval: full 164 problems, code execution sandboxed (apple/container → sandbox-exec → subprocess + rlimit).

Options

--task TEXT: One of the tasks above. Default: gsm8k.
--n-samples INTEGER: Override the per-task default sample count.
--baseline PATH: Side-by-side comparison against another model (gsm8k task only).
--reference-model PATH: Explicit reference for the KL eval. If unset, auto-resolved.
--reference-mode [auto|bf16|uniform_4bit]: KL reference selection strategy. Default: auto (bf16 if it fits in RAM via HfApi.model_info per-shard sizes, else mlx-community uniform-4-bit baseline).
--score: With --task all, also compute the Capability Score (unweighted mean of MMLU, GSM8K, IFEval, BFCL, HumanEval).
--kv-bits INTEGER: Uniform KV quantization (4 or 8) during generation-based benchmarks. KL and MMLU are forward-pass only, so this has no effect there.
--kv-group-size INTEGER: Default: 64.
--kv-config PATH: Per-layer mixed-precision KV from optiq kv-cache. Overrides --kv-bits.
--output-json PATH: Dump the structured eval record (every metric + Capability Score) to JSON.

Examples

terminalbash

# Fast triage
$ optiq eval ./optiq_output/Qwen3.5-9B/optiq_mixed --task smoketest

# Full benchmark suite + Capability Score, dump to JSON
$ optiq eval ./optiq_mixed --task all --score \
    --output-json ./bench.json

# Single benchmark
$ optiq eval ./optiq_mixed --task humaneval

# GSM8K head-to-head against the uniform-4-bit baseline
$ optiq eval ./optiq_mixed --task gsm8k \
    --baseline ./optiq_output/Qwen3.5-9B/uniform_4bit

# Eval the full suite with mixed-precision KV active during generation
$ optiq eval ./optiq_mixed --task all \
    --kv-config ./kv/qwen35_9b/kv_config.json

`optiq benchmark`

Quick-and-dirty perplexity + throughput on a converted model, with optional baseline side-by-side. For headline accuracy numbers, prefer optiq eval above.

benchmark.shbash

$ optiq benchmark MODEL_PATH [--baseline UNIFORM_PATH] [--n-samples N]

--baseline PATH: Side-by-side comparison against a uniform-4-bit baseline (or any other model path).
--n-samples INTEGER: Perplexity samples. Default: 50.

`optiq latency`

Predicts decode tok/s for a quantized model using the Apple Silicon roofline model: latency ≈ model_bytes / memory_bandwidth + per_layer_overhead. Detects your hardware tier and bandwidth automatically.

latency.shbash

# Bare prediction (linear-layer-only roofline; optimistic)
$ optiq latency ./optiq_mixed

# Calibrated: load + run the model once, fit the overhead constant
$ optiq latency ./optiq_mixed --calibrate

--calibrate: Run 8 warmup + 15 measured generations to fit the framework-overhead constant for this hardware. The bare prediction only counts weight-loading time and is optimistic; on M3 Max ~83% of decode latency is overhead (attention, norms, KV cache, framework), so calibration produces meaningfully better predictions for the model you measured and for similarly-sized models on the same hardware.

Top-level options

optiq --version: Print the installed version.
optiq --help: Show top-level help.
optiq COMMAND --help: Per-command help.

Environment variables

HF_HOME: Hugging Face cache root. Default: ~/.cache/huggingface.
HF_HUB_ENABLE_HF_TRANSFER=1: Enable accelerated HF downloads (requires pip install hf_transfer).
OPTIQ_ADAPTER_CACHE: Where optiq serve --adapter <hf-repo> downloads remote adapters. Default: under the HF cache.
ANTHROPIC_BASE_URL / ANTHROPIC_API_KEY: Set these to point Claude Code (and any Anthropic-SDK consumer) at optiq serve.

CLI reference

optiq convert

Required

Options

Examples

optiq kv-cache

Options

optiq lora

optiq lora train

optiq lora info

optiq serve

OptIQ-specific options

Forwarded to mlx_lm.server

Examples

optiq eval

Tasks

Options

Examples

optiq benchmark

optiq latency

Top-level options

Environment variables

`optiq convert`

`optiq kv-cache`

`optiq lora`

`optiq lora train`

`optiq lora info`

`optiq serve`

Forwarded to `mlx_lm.server`

`optiq eval`

`optiq benchmark`

`optiq latency`