mlx-optiq
Family guide · Gemma-4

Gemma-4 on Apple Silicon

Google's Gemma-4 instruct series spans five sizes: two compact dense (e2b, e4b), a dense 12 B (the unified text+vision model, now with image input), one sparse mixture-of-experts (26 B-A4B), and one frontier-class dense (31 B). All five ship with mlx-optiq-4-bit quants on Hugging Face, and all five take image input. They share the Gemma chat template and a distinctive shared-KV attention design.

Image input Gemma-4 takes image input. When a quant carries an optiq_vision sidecar, optiq serve and the Lab answer image+text prompts with no mlx-vlm dependency. See vision (image input).

Available quants

ModelSizeCapabilityvs uniform-4Best for
gemma-4-31B-it-OptiQ-4bit18.1 GB79.69+3.47Strongest dense quant
gemma-4-26B-A4B-it-OptiQ-4bit14.9 GB72.68+3.06Sparse MoE, 4 B active
gemma-4-12B-it-OptiQ-4bit8.3 GB68.23+6.40Unified text+vision, image input
gemma-4-e4b-it-OptiQ-4bit6.0 GB65.84+13.57Best small-model recovery
gemma-4-e2b-it-OptiQ-4bit4.0 GB53.21+2.12Compact daily-driver
Headline result gemma-4-e4b-it at uniform 4-bit collapses to 23.5% on GSM8K. mlx-optiq recovers it to 55.5%, a +32-point jump at the same 6 GB on disk. This is one of the cleanest examples of why uniform-bit quantization wastes potential. Full lineup story in the Gemma-4 launch blog.

QAT variants

Google also ships QAT-unquantized Gemma-4 checkpoints — bf16 weights trained so that 4-bit quantization is near-lossless. OptiQ runs its sensitivity pass on those bases and still beats a uniform 4-bit quant of the same QAT base on the six-metric Capability Score. The scores below compare each OptiQ-QAT quant against a genuine uniform-4 quant of its own QAT base, holding the base fixed, so the delta is purely what the per-layer allocation adds.

Model (QAT base)SizeCapabilityΔ vs uniform-4
gemma-4-e2b-it-qat-OptiQ-4bit4.3 GB54.23+2.09
gemma-4-e4b-it-qat-OptiQ-4bit6.5 GB65.75+1.19
gemma-4-12B-it-qat-OptiQ-4bit8.9 GB69.64+1.37
gemma-4-26B-A4B-it-qat-OptiQ-4bit20.7 GB71.32+0.19
gemma-4-31B-it-qat-OptiQ-4bit22.4 GB78.93+1.65

The 31B flagship posts the largest absolute gain (+1.65, with a +9-point jump on long-context HashHop retrieval). The 26B-A4B sparse-MoE is the one case where 4-bit QAT already saturates the routed experts; reaching them takes a higher average bit-width, so its OptiQ quant ships larger and wins by a smaller margin. Load any of these exactly like the non-QAT quants below — same import optiq + mlx_lm.load path.

Hello world

hello.pypython
from mlx_lm import load, generate

model, tok = load("mlx-community/gemma-4-31B-it-OptiQ-4bit")

prompt = tok.apply_chat_template(
    [{"role": "user", "content": "What's the difference between TF-IDF and BM25?"}],
    tokenize=False, add_generation_prompt=True,
)
print(generate(model, tok, prompt=prompt, max_tokens=400))

Recommended sampling

Gemma-4-it variants prefer slightly higher temperature than Qwen3.x:

sampling.pypython
from mlx_lm.sample_utils import make_sampler

# Default chat (Google's recommended)
sampler = make_sampler(temp=1.0, top_p=0.95, top_k=64)

# Reasoning / math: slightly tighter
sampler = make_sampler(temp=0.7, top_p=0.9)

Mixed-precision KV cache

Now supported on Gemma-4 Gemma-4 uses shared-KV attention plus a sliding-window RotatingKVCache for SWA layers. Upstream mlx-lm raises NotImplementedError: RotatingKVCache Quantization NYI on the rotating path, which blocks mixed-precision KV on the rotating path. mlx-optiq ships optiq.runtime.kv.RotatingQuantizedKVCache (a drop-in rotating cache with quantized (packed, scales, biases) storage) plus a small SDPA dispatch patch that handles Gemma-4's KV-sharing edge case. The patch is auto-installed by optiq serve and optiq kv-cache whenever quantized KV is requested.

Each Gemma-4 OptiQ-4bit repo on Hugging Face now bundles a recommended kv_config.json from a real sensitivity-analysis pass. Drop-in via --kv-config:

gemma-4 long-context, mixed-precision KVbash
$ optiq serve \
    --model mlx-community/gemma-4-31B-it-OptiQ-4bit \
    --kv-config kv_config.json

The 26B-A4B MoE

Gemma-4-26B-A4B is a sparse mixture-of-experts: 26 B total, 4 B active per token. Different from Qwen MoE, Gemma uses switch_glu with a fused gate-and-up projection. mlx-optiq's MoE walker handles both layouts. The per-expert sensitivity rolls up into a single switch_glu tensor for allocation purposes.

Long-context

fp16 KV is still the simplest path; mixed-precision KV (above) gives faster decode at long contexts. Either way:

serve.shbash
# Stock fp16 KV serving via mlx-optiq (no --kv-config)
$ optiq serve --model mlx-community/gemma-4-31B-it-OptiQ-4bit \
    --max-tokens 8192 --temp 1.0 --top-p 0.95

Fine-tuning recipes

Empirical training-ceiling map at iogpu.wired_limit_mb=0 on 36 GB Mac:

ModelMax seq lenPeak memTokens/sec
gemma-4-e2b-it2,40022 GB~28
gemma-4-e4b-it1,80024 GB~22
gemma-4-26B-A4B-it51227.6 GB22.2
gemma-4-31B-it3221.4 GB30.9

The 31B-dense is unusually memory-tight at long context due to its larger embedding+vocab footprint. The 26B-A4B MoE actually allows a longer sequence at the same RAM budget. Sparse activation pays off here.

finetune.shbash
$ optiq lora train mlx-community/gemma-4-26B-A4B-it-OptiQ-4bit \
    --data ./my_data \
    --max-seq-length 512 \
    --rank 8 --rank-scaling by_bits \
    --num-layers 16 --iters 2000 \
    -o ./gemma_adapter

Next: read about how sensitivity works or the LoRA fine-tuning guide.