CLI reference
The optiq CLI is built with Click. Seven top-level commands cover the entire mlx-optiq workflow:
optiq convert: quantize an HF model with mixed-precision sensitivityoptiq kv-cache: measure per-layer KV sensitivity, write a serving configoptiq lora train / info: sensitivity-aware LoRA trainingoptiq serve: dual-protocol inference server (OpenAI + Anthropic)optiq eval: smoketest + full benchmark suite (MMLU, GSM8K, IFEval, BFCL, HumanEval)optiq benchmark: perplexity + throughput on a converted modeloptiq latency: Apple Silicon roofline throughput predictor
optiq convert
Quantize a Hugging Face model with mixed-precision sensitivity-driven bit allocation.
$ optiq convert MODEL [OPTIONS]
Required
MODEL: HF repo ID (e.g.Qwen/Qwen3.5-9B) or local path to a bf16 checkpoint.
Options
--target-bpw FLOAT: Average bits per weight. Default:4.5.--candidate-bits TEXT: Comma-separated bit-widths to choose from. Default:4,8.--reference [auto|bf16|uniform_4bit]: Reference precision for sensitivity probes. Default:auto(bf16 if it fits in RAM, else uniform_4bit).--calibration-mix TEXT: Calibration set. Default:optiq(the bundled 6-domain mix). Pass a path to override with a custom JSONL file.--group-size INTEGER: Quantization group size. Default:64.--n-calibration INTEGER: Calibration sequences for sensitivity (drawn from the 40-sample mix). Default:24(roughly 4 samples from each of the 6 domains). Higher = more stable per-layer KL ranking, linearly slower convert.--n-floor-per-block INTEGER: Block-aware floor: minimum components per transformer block kept above the lowest bit. Prevents lowest-bit assignments from concentrating in the middle. Default:2;0disables.--skip-baselines: Don't also build the uniform-4-bit comparison artifact.-o, --output TEXT: Output directory. Default:optiq_output/<model_basename>.
Examples
# Standard 4.5 BPW mix on a 9B (auto-routes to bf16 reference) $ optiq convert Qwen/Qwen3.5-9B --target-bpw 4.5 # 27B+ on a 36 GB Mac (auto-routes to uniform_4bit reference) $ optiq convert Qwen/Qwen3.5-27B --reference uniform_4bit # Custom bit set (3-bit / 6-bit mix at 4 BPW average) $ optiq convert Qwen/Qwen3.5-4B \ --target-bpw 4.0 --candidate-bits 3,4,6,8
optiq kv-cache
Measure per-layer KV-cache sensitivity and write a per-layer KV bit-width config that optiq serve --kv-config consumes.
$ optiq kv-cache MODEL [OPTIONS]
Options
--target-bits FLOAT: Average KV bits across full-attention layers. Default:5.0.--candidate-bits TEXT: Default:4,8.--n-samples INTEGER: Calibration samples for KV sensitivity. Default:5.--seq-len INTEGER: Calibration sequence length. Default:512.--group-size INTEGER: Default:64.-o, --output TEXT: Where to writekv_config.jsonandkv_sensitivity.json. Default:optiq_output/kv_cache.
mlx-lm KV-cache implementation doesn't yet support mixed-precision quantization on this layout. optiq kv-cache exits cleanly with an explanation when run on Gemma-4. Use stock fp16 KV serving on Gemma instead.
optiq lora
Train and inspect sensitivity-aware LoRA adapters.
optiq lora train
$ optiq lora train MODEL [OPTIONS]
--data PATH: Directory containingtrain.jsonl(and optionalvalid.jsonl). Required.--rank INTEGER: Base LoRA rank. Default:8.--scale FLOAT: LoRA alpha scaling (alpha = rank * scale). Default:20.0.--dropout FLOAT: LoRA dropout. Default:0.0.--rank-scaling [constant|by_bits|by_kl]: How to scale rank across layers.by_bitsuses the per-layer bit assignments OptIQ recorded at convert time;by_klscales by raw KL sensitivity;constantis uniform. Default:by_bits.--num-layers INTEGER: Last N transformer blocks to adapt;-1for all. Default:16.--target-modules TEXT: Comma-separated module suffixes to adapt. Default:q_proj,v_proj.--use-dora: Use DoRA in place of LoRA (currentlyNotImplementedErroron mlx-lm 0.31.x).--max-seq-length INTEGER: Tokens per training sample. Default:1600(M3 Max 36 GB safe ceiling for 9B). See the fine-tuning training-ceiling map for per-model recipes.--iters INTEGER: Training iterations. Default:1000.--learning-rate/--lr FLOAT: Default:1e-4.--batch-size INTEGER: Default:1(Mac UMA constraint).--grad-accumulation-steps INTEGER: Default:1.--grad-checkpoint / --no-grad-checkpoint: Enable activation checkpointing. Default: on.--val-batches INTEGER: Default:25.--steps-per-report INTEGER: Default:10.--steps-per-eval INTEGER: Default:200.--steps-per-save INTEGER: Default:100.-o, --output PATH: Adapter output directory. Default:./optiq_lora_adapters.
optiq lora info
$ optiq lora info ADAPTER_DIR
Prints the per-layer rank distribution and trainable parameter count.
optiq serve
Dual-protocol inference server. Wraps mlx_lm.server and exposes both the OpenAI /v1/chat/completions endpoint and the Anthropic /v1/messages endpoint from the same process. Drive it with the OpenAI SDK, the Anthropic SDK, plain curl, or Claude Code (via ANTHROPIC_BASE_URL). Adds OptIQ-aware mixed-precision KV cache and a mounted LoRA adapter on top.
$ optiq serve [OPTIONS] -- [mlx_lm.server flags]
OptIQ-specific options
--kv-config PATH: JSON config fromoptiq kv-cache. Per-layer mixed-precision KV. Overrides--kv-bits.--kv-bits INTEGER: Uniform KV quantization (4 or 8). Omit both--kv-configand--kv-bitsfor fp16 KV.--kv-group-size INTEGER: KV quantization group size. Default:64.--quantized-kv-start INTEGER: Token offset at which KV quantization kicks in. Default:0.--adapter PATH-OR-REPO: Apply a LoRA adapter at startup. Accepts a HuggingFace repo id (auto-downloaded) or a local directory. OptIQ-trained adapters surface theiroptiq_lora_config.jsonsidecar in startup logs.--anthropic / --no-anthropic: Expose the Anthropic/v1/messagesendpoint alongside OpenAI's. Default: on. Pass--no-anthropicto disable.
Forwarded to mlx_lm.server
All other flags pass through unchanged. Common ones include --model, --host, --port, --max-tokens, --temp, --top-p, --top-k. See python -m mlx_lm.server --help for the full upstream list and current defaults.
Examples
# 1. Generate a per-layer KV config (once per model) $ optiq kv-cache mlx-community/Qwen3.5-9B-OptiQ-4bit --target-bits 5.0 # 2. Serve with mixed-precision KV + Anthropic on by default $ optiq serve \ --model mlx-community/Qwen3.5-9B-OptiQ-4bit \ --kv-config optiq_output/kv_cache/kv_config.json \ --port 8080 # 3. Drive Claude Code through it export ANTHROPIC_BASE_URL="http://localhost:8080" export ANTHROPIC_API_KEY="not-used" $ claude
optiq eval
Two-stage evaluation harness. --task smoketest runs a fast triage (KL + GSM8K-50, ~5 min on 27B). --task all runs the full benchmark suite (MMLU + GSM8K + IFEval + BFCL + HumanEval, ~1.5 h on 27B). Add --score for the unweighted-mean Capability Score. Individual tasks are addressable on their own. The methodology and what the suite caught is in the eval-framework write-up.
$ optiq eval MODEL_PATH --task [TASK] [OPTIONS]
Tasks
smoketest: KL on 64 prompts × 256 tokens, plus GSM8K on 50 samples. Fast triage.all: full benchmark suite below, in one run.kl: KL divergence vs reference (default 64 prompts).mmlu: 5-shot MMLU on 1000 samples.gsm8k: GSM8K on 1000 samples (default), CoT 3-shot. Default task if none specified.gsm8k-50: GSM8K on 50 samples (smoketest tier).ifeval: full 540-prompt IFEval (instruction following).bfcl: BFCL-V3 simple, 200 questions (function calling).humaneval: full 164 problems, code execution sandboxed (apple/container → sandbox-exec → subprocess + rlimit).
Options
--task TEXT: One of the tasks above. Default:gsm8k.--n-samples INTEGER: Override the per-task default sample count.--baseline PATH: Side-by-side comparison against another model (gsm8k task only).--reference-model PATH: Explicit reference for the KL eval. If unset, auto-resolved.--reference-mode [auto|bf16|uniform_4bit]: KL reference selection strategy. Default:auto(bf16 if it fits in RAM viaHfApi.model_infoper-shard sizes, else mlx-community uniform-4-bit baseline).--score: With--task all, also compute the Capability Score (unweighted mean of MMLU, GSM8K, IFEval, BFCL, HumanEval).--kv-bits INTEGER: Uniform KV quantization (4 or 8) during generation-based benchmarks. KL and MMLU are forward-pass only, so this has no effect there.--kv-group-size INTEGER: Default:64.--kv-config PATH: Per-layer mixed-precision KV fromoptiq kv-cache. Overrides--kv-bits.--output-json PATH: Dump the structured eval record (every metric + Capability Score) to JSON.
Examples
# Fast triage $ optiq eval ./optiq_output/Qwen3.5-9B/optiq_mixed --task smoketest # Full benchmark suite + Capability Score, dump to JSON $ optiq eval ./optiq_mixed --task all --score \ --output-json ./bench.json # Single benchmark $ optiq eval ./optiq_mixed --task humaneval # GSM8K head-to-head against the uniform-4-bit baseline $ optiq eval ./optiq_mixed --task gsm8k \ --baseline ./optiq_output/Qwen3.5-9B/uniform_4bit # Eval the full suite with mixed-precision KV active during generation $ optiq eval ./optiq_mixed --task all \ --kv-config ./kv/qwen35_9b/kv_config.json
optiq benchmark
Quick-and-dirty perplexity + throughput on a converted model, with optional baseline side-by-side. For headline accuracy numbers, prefer optiq eval above.
$ optiq benchmark MODEL_PATH [--baseline UNIFORM_PATH] [--n-samples N]
--baseline PATH: Side-by-side comparison against a uniform-4-bit baseline (or any other model path).--n-samples INTEGER: Perplexity samples. Default:50.
optiq latency
Predicts decode tok/s for a quantized model using the Apple Silicon roofline model: latency ≈ model_bytes / memory_bandwidth + per_layer_overhead. Detects your hardware tier and bandwidth automatically.
# Bare prediction (linear-layer-only roofline; optimistic) $ optiq latency ./optiq_mixed # Calibrated: load + run the model once, fit the overhead constant $ optiq latency ./optiq_mixed --calibrate
--calibrate: Run 8 warmup + 15 measured generations to fit the framework-overhead constant for this hardware. The bare prediction only counts weight-loading time and is optimistic; on M3 Max ~83% of decode latency is overhead (attention, norms, KV cache, framework), so calibration produces meaningfully better predictions for the model you measured and for similarly-sized models on the same hardware.
Top-level options
optiq --version: Print the installed version.optiq --help: Show top-level help.optiq COMMAND --help: Per-command help.
Environment variables
HF_HOME: Hugging Face cache root. Default:~/.cache/huggingface.HF_HUB_ENABLE_HF_TRANSFER=1: Enable accelerated HF downloads (requirespip install hf_transfer).OPTIQ_ADAPTER_CACHE: Whereoptiq serve --adapter <hf-repo>downloads remote adapters. Default: under the HF cache.ANTHROPIC_BASE_URL/ANTHROPIC_API_KEY: Set these to point Claude Code (and any Anthropic-SDK consumer) atoptiq serve.