Skip to content

Model roster benchmark

A head-to-head benchmark of every chat and coding model in the hal0 registry, measured under uniform conditions on a single AMD Strix Halo box: same context length, same profile, exclusive GPU access, one model at a time. The numbers below are throughput as served by hal0’s llama-server provider — i.e. what a slot on this hardware actually delivers, not vendor marketing figures.

Sequential, exclusive-GPU sweep on a single Strix Halo box · 26/26 models measured · 2026-06-19

Hardware

SoC
AMD Ryzen AI Max+ 395 — Strix Halo (Zen 5, 16C/28T)
iGPU
Radeon 8060S — RDNA 3.5, gfx1151 · Vulkan-capable
NPU
AMD XDNA — amdxdna driver
Memory
128 GB unified LPDDR5X · ~96 GB GPU-addressable (GTT) · UMA
Host
Proxmox LXC · Ubuntu 24.04 · kernel 7.0.6
Model store
/mnt/ai-models · ZFS

Binary & runtime

hal0
v0.5.0a1 · llama-server provider (OpenAI /v1/*)
Container
ghcr.io/hal0ai/amd-strix-halo-toolboxes:rocm-7.2.4-rocmfp4-server
llama.cpp
build b9219-1faa48eef · rocmfp4 fork (draft-mtp speculative)
ROCm
7.2.4
Registry ID HuggingFace Caps Params Size GB KV Spec Decode t/s Prefill t/s MTP acc% Notes
chadrock-35b-ace-saber jcbtc/chadrock-35b-ace-saber-rocmfp4-mtp 35B-A3B 19 f16 draft-mtp 100.5 902.9 83.1 vision;
qwen3.6-35b-a3b-crown-halo-mtp-dynamic jcbtc/qwen3.6-35b-a3b-crown-halo-mtp-dynamic 35B-A3B 22.6 f16 draft-mtp 84.4 872.5 90.9 vision;
chadrock3.6-27b-pi-agent-rocmfp4-mtp jcbtc/chadrock3.6-27b-pi-agent-rocmfp4-mtp 27B dense 14.8 q8 draft-mtp 35.5 301.4 79.8
qwopus3-6-27b-v2-mtp-bf16-to-rocmfp4-strix-lean local / auto-scan 27B dense 14.8 q4 draft-mtp 31.6 310.3 70.3
gemma-4-12B-agentic-fable5 yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF 12B dense 7.4 q4 none 22.7 688.5
qwen3-coder-next-q4kxl unsloth/Qwen3-Coder-Next-GGUF 49.6 q4 none 37.8 716
qwen3-coder-next-reap-40b-a3b-q4kxl lovedheart/Qwen3-Coder-Next-REAP-40B-A3B-GGUF 40B MoE 28.5 q4 none 26.8 755.5
Qwopus3.6-27B-Coder-MTP Jackrong/Qwopus3.6-27B-Coder-MTP-GGUF 27B 22.4 q4 draft-mtp 19.8 224.5 60.5
qwen3.6-35b-a3b-q4kxl unsloth/Qwen3.6-35B-A3B-GGUF 35B MoE 22.4 q4 none 46.1 1299.8
qwen3.6-27b unsloth/Qwen3.6-27B-GGUF 27B 20 q4 none 10.1 299.6
chadrock3-6-35b-uncensored-mtp-strix-lean local / auto-scan 35B MoE 19 q4 draft-mtp 102.1 889.6 86
qwen3-coder-reap-25b-a3b-q5km bartowski/cerebras_Qwen3-Coder-REAP-25B-A3B-GGUF 25B MoE 17.7 q4 none 54.7 1367.6
gemma-4-26b-a4b-it-q4kxl unsloth/gemma-4-26B-A4B-it-GGUF 26B MoE 17.1 q4 none 40.9 1335.7
qwen3.6-27b-heretic-q4km DavidAU/Qwen3.6-27B-Heretic-Uncensored-FINETUNE-NEO-CODE-Di-IMatrix-MAX-GGUF 27B 16.9 q4 none 11.6 294.7 vision;
chadrock3-6-27b-pi-agent-mtp-rocmfp4-strix-lean local / auto-scan 27B 14.8 q4 draft-mtp 34.9 307.7 79.8
hermes-4-14b-q5km bartowski/NousResearch_Hermes-4-14B-GGUF 14B 10.5 q4 none 20.3 613.6
qwen3.5-9b-deepseek-v4-flash-mtp Jackrong/Qwen3.5-9B-DeepSeek-V4-Flash-MTP-GGUF 9B 7.6 q4 draft-mtp 51.3 613.4 69.7
qwopus3-5-9b-coder-mtp-q6-k local / auto-scan 9B 7.6 q4 draft-mtp 44.6 617.5 56
gemma-4-12b-it unsloth/gemma-4-12b-it-GGUF 12B 7.4 q4 none 22.7 686.6
gemma4-v2-q4-k-m local / auto-scan 7.4 q4 none 22.6 679.8
qwen3.5-9b-q4kxl unsloth/Qwen3.5-9B-GGUF 9B 6 q4 none 33 1044.7
qwopus3-5-4b-coder-mtp-q6-k local / auto-scan 4B 3.6 q4 draft-mtp 85 889.2 76.7
qwen3.5-4b-q4kxl unsloth/Qwen3.5-4B-GGUF 4B 2.9 q4 none 52.7 1695.2
qwen3-4b-q4-k-m local / auto-scan 4B 2.5 q4 none 61.9 1849.4
qwen3-zero-coder-v2-0.8b-f16 DavidAU/Qwen3-Zero-Coder-Reasoning-V2-0.8B-NEO-EX-GGUF 0.8B 1.6 q4 none 76.3 4827.2
qwen3.5-0.8b unsloth/Qwen3.5-0.8B-GGUF 0.8B 0.6 q4 none 169.8 6248.1

≥60 t/s 25–60 <25 · MTP speculativeVisionTool-callingCodingReasoning · Click any header to sort.

  • Decode t/s — single-stream token generation throughput (greedy, 256-token completion, median of 3 runs). This is the number you feel during a chat.
  • Prefill t/s — prompt-ingestion throughput on a ~2k-token prompt.
  • MTP acc% — for multi-token-prediction models, the share of speculatively-drafted tokens the target model accepted. Higher acceptance = more of the speedup is real.
  • Caps — capability tags that drive routing and slot features (MTP, vision, tool-calling, coding, reasoning).
  • Spec / KV / Size — the speculative-decode mode, KV-cache quantization, and on-disk GGUF size for the measured configuration.
  • MTP A3B models dominate. The sparse 35B-A3B models with self-speculative draft-mtp lead the board — chadrock-35b-ace-saber tops it at ~100 t/s decode.
  • Acceptance matters more than raw size. A model with weak MTP acceptance (drafted tokens getting rejected) gives back most of the speculative speedup — watch the acc% column, not just the decode figure.
  • The curated default is the slowest real model. Dense qwen3.6-27b lands at ~10 t/s — an A3B-MTP model is several times faster for the same class of work.