Model roster benchmark
A head-to-head benchmark of every chat and coding model in the hal0 registry,
measured under uniform conditions on a single AMD Strix Halo
box: same context length, same profile, exclusive GPU access, one model at a
time. The numbers below are throughput as served by hal0’s llama-server
provider — i.e. what a slot on this hardware actually delivers, not vendor
marketing figures.
Hardware
- SoC
- AMD Ryzen AI Max+ 395 — Strix Halo (Zen 5, 16C/28T)
- iGPU
- Radeon 8060S — RDNA 3.5, gfx1151 · Vulkan-capable
- NPU
- AMD XDNA — amdxdna driver
- Memory
- 128 GB unified LPDDR5X · ~96 GB GPU-addressable (GTT) · UMA
- Host
- Proxmox LXC · Ubuntu 24.04 · kernel 7.0.6
- Model store
- /mnt/ai-models · ZFS
Binary & runtime
- hal0
- v0.5.0a1 · llama-server provider (OpenAI /v1/*)
- Container
- ghcr.io/hal0ai/amd-strix-halo-toolboxes:rocm-7.2.4-rocmfp4-server
- llama.cpp
- build b9219-1faa48eef · rocmfp4 fork (draft-mtp speculative)
- ROCm
- 7.2.4
| Registry ID | HuggingFace | Caps | Params | Size GB | KV | Spec | Decode t/s | Prefill t/s | MTP acc% | Notes |
|---|---|---|---|---|---|---|---|---|---|---|
| chadrock-35b-ace-saber | jcbtc/chadrock-35b-ace-saber-rocmfp4-mtp | 35B-A3B | 19 | f16 | draft-mtp | 100.5 | 902.9 | 83.1 | vision; | |
| qwen3.6-35b-a3b-crown-halo-mtp-dynamic | jcbtc/qwen3.6-35b-a3b-crown-halo-mtp-dynamic | 35B-A3B | 22.6 | f16 | draft-mtp | 84.4 | 872.5 | 90.9 | vision; | |
| chadrock3.6-27b-pi-agent-rocmfp4-mtp | jcbtc/chadrock3.6-27b-pi-agent-rocmfp4-mtp | 27B dense | 14.8 | q8 | draft-mtp | 35.5 | 301.4 | 79.8 | ||
| qwopus3-6-27b-v2-mtp-bf16-to-rocmfp4-strix-lean | local / auto-scan | 27B dense | 14.8 | q4 | draft-mtp | 31.6 | 310.3 | 70.3 | ||
| gemma-4-12B-agentic-fable5 | yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF | 12B dense | 7.4 | q4 | none | 22.7 | 688.5 | |||
| qwen3-coder-next-q4kxl | unsloth/Qwen3-Coder-Next-GGUF | — | 49.6 | q4 | none | 37.8 | 716 | |||
| qwen3-coder-next-reap-40b-a3b-q4kxl | lovedheart/Qwen3-Coder-Next-REAP-40B-A3B-GGUF | 40B MoE | 28.5 | q4 | none | 26.8 | 755.5 | |||
| Qwopus3.6-27B-Coder-MTP | Jackrong/Qwopus3.6-27B-Coder-MTP-GGUF | 27B | 22.4 | q4 | draft-mtp | 19.8 | 224.5 | 60.5 | ||
| qwen3.6-35b-a3b-q4kxl | unsloth/Qwen3.6-35B-A3B-GGUF | — | 35B MoE | 22.4 | q4 | none | 46.1 | 1299.8 | ||
| qwen3.6-27b | unsloth/Qwen3.6-27B-GGUF | 27B | 20 | q4 | none | 10.1 | 299.6 | |||
| chadrock3-6-35b-uncensored-mtp-strix-lean | local / auto-scan | 35B MoE | 19 | q4 | draft-mtp | 102.1 | 889.6 | 86 | ||
| qwen3-coder-reap-25b-a3b-q5km | bartowski/cerebras_Qwen3-Coder-REAP-25B-A3B-GGUF | 25B MoE | 17.7 | q4 | none | 54.7 | 1367.6 | |||
| gemma-4-26b-a4b-it-q4kxl | unsloth/gemma-4-26B-A4B-it-GGUF | — | 26B MoE | 17.1 | q4 | none | 40.9 | 1335.7 | ||
| qwen3.6-27b-heretic-q4km | DavidAU/Qwen3.6-27B-Heretic-Uncensored-FINETUNE-NEO-CODE-Di-IMatrix-MAX-GGUF | 27B | 16.9 | q4 | none | 11.6 | 294.7 | vision; | ||
| chadrock3-6-27b-pi-agent-mtp-rocmfp4-strix-lean | local / auto-scan | 27B | 14.8 | q4 | draft-mtp | 34.9 | 307.7 | 79.8 | ||
| hermes-4-14b-q5km | bartowski/NousResearch_Hermes-4-14B-GGUF | 14B | 10.5 | q4 | none | 20.3 | 613.6 | |||
| qwen3.5-9b-deepseek-v4-flash-mtp | Jackrong/Qwen3.5-9B-DeepSeek-V4-Flash-MTP-GGUF | 9B | 7.6 | q4 | draft-mtp | 51.3 | 613.4 | 69.7 | ||
| qwopus3-5-9b-coder-mtp-q6-k | local / auto-scan | 9B | 7.6 | q4 | draft-mtp | 44.6 | 617.5 | 56 | ||
| gemma-4-12b-it | unsloth/gemma-4-12b-it-GGUF | — | 12B | 7.4 | q4 | none | 22.7 | 686.6 | ||
| gemma4-v2-q4-k-m | local / auto-scan | — | — | 7.4 | q4 | none | 22.6 | 679.8 | ||
| qwen3.5-9b-q4kxl | unsloth/Qwen3.5-9B-GGUF | — | 9B | 6 | q4 | none | 33 | 1044.7 | ||
| qwopus3-5-4b-coder-mtp-q6-k | local / auto-scan | 4B | 3.6 | q4 | draft-mtp | 85 | 889.2 | 76.7 | ||
| qwen3.5-4b-q4kxl | unsloth/Qwen3.5-4B-GGUF | — | 4B | 2.9 | q4 | none | 52.7 | 1695.2 | ||
| qwen3-4b-q4-k-m | local / auto-scan | — | 4B | 2.5 | q4 | none | 61.9 | 1849.4 | ||
| qwen3-zero-coder-v2-0.8b-f16 | DavidAU/Qwen3-Zero-Coder-Reasoning-V2-0.8B-NEO-EX-GGUF | 0.8B | 1.6 | q4 | none | 76.3 | 4827.2 | |||
| qwen3.5-0.8b | unsloth/Qwen3.5-0.8B-GGUF | — | 0.8B | 0.6 | q4 | none | 169.8 | 6248.1 |
≥60 t/s 25–60 <25 · MTP speculativeVisionTool-callingCodingReasoning · Click any header to sort.
How to read it
Section titled “How to read it”- Decode t/s — single-stream token generation throughput (greedy, 256-token completion, median of 3 runs). This is the number you feel during a chat.
- Prefill t/s — prompt-ingestion throughput on a ~2k-token prompt.
- MTP acc% — for multi-token-prediction models, the share of speculatively-drafted tokens the target model accepted. Higher acceptance = more of the speedup is real.
- Caps — capability tags that drive routing and slot features (MTP, vision, tool-calling, coding, reasoning).
- Spec / KV / Size — the speculative-decode mode, KV-cache quantization, and on-disk GGUF size for the measured configuration.
What the data shows
Section titled “What the data shows”- MTP A3B models dominate. The sparse 35B-A3B models with self-speculative
draft-mtp lead the board —
chadrock-35b-ace-sabertops it at ~100 t/s decode. - Acceptance matters more than raw size. A model with weak MTP acceptance (drafted tokens getting rejected) gives back most of the speculative speedup — watch the acc% column, not just the decode figure.
- The curated default is the slowest real model. Dense
qwen3.6-27blands at ~10 t/s — an A3B-MTP model is several times faster for the same class of work.