Model roster benchmark

A head-to-head benchmark of every chat and coding model in the hal0 registry, measured under uniform conditions on a single AMD Strix Halo box: same context length, same profile, exclusive GPU access, one model at a time. The numbers below are throughput as served by hal0’s llama-server provider — i.e. what a slot on this hardware actually delivers, not vendor marketing figures.

Sequential, exclusive-GPU sweep on a single Strix Halo box · 26/26 models measured · 2026-06-19

Hardware

SoC: AMD Ryzen AI Max+ 395 — Strix Halo (Zen 5, 16C/28T)
iGPU: Radeon 8060S — RDNA 3.5, gfx1151 · Vulkan-capable
NPU: AMD XDNA — amdxdna driver
Memory: 128 GB unified LPDDR5X · ~96 GB GPU-addressable (GTT) · UMA
Host: Proxmox LXC · Ubuntu 24.04 · kernel 7.0.6
Model store: /mnt/ai-models · ZFS

Binary & runtime

hal0: v0.5.0a1 · llama-server provider (OpenAI /v1/*)
Container: ghcr.io/hal0ai/amd-strix-halo-toolboxes:rocm-7.2.4-rocmfp4-server
llama.cpp: build b9219-1faa48eef · rocmfp4 fork (draft-mtp speculative)
ROCm: 7.2.4

Registry ID	HuggingFace	Caps	Params	Size GB	KV	Spec	Decode t/s	Prefill t/s	MTP acc%	Notes
chadrock-35b-ace-saber	jcbtc/chadrock-35b-ace-saber-rocmfp4-mtp		35B-A3B	19	f16	draft-mtp	100.5	902.9	83.1	vision;
qwen3.6-35b-a3b-crown-halo-mtp-dynamic	jcbtc/qwen3.6-35b-a3b-crown-halo-mtp-dynamic		35B-A3B	22.6	f16	draft-mtp	84.4	872.5	90.9	vision;
chadrock3.6-27b-pi-agent-rocmfp4-mtp	jcbtc/chadrock3.6-27b-pi-agent-rocmfp4-mtp		27B dense	14.8	q8	draft-mtp	35.5	301.4	79.8
qwopus3-6-27b-v2-mtp-bf16-to-rocmfp4-strix-lean	local / auto-scan		27B dense	14.8	q4	draft-mtp	31.6	310.3	70.3
gemma-4-12B-agentic-fable5	yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF		12B dense	7.4	q4	none	22.7	688.5
qwen3-coder-next-q4kxl	unsloth/Qwen3-Coder-Next-GGUF		—	49.6	q4	none	37.8	716
qwen3-coder-next-reap-40b-a3b-q4kxl	lovedheart/Qwen3-Coder-Next-REAP-40B-A3B-GGUF		40B MoE	28.5	q4	none	26.8	755.5
Qwopus3.6-27B-Coder-MTP	Jackrong/Qwopus3.6-27B-Coder-MTP-GGUF		27B	22.4	q4	draft-mtp	19.8	224.5	60.5
qwen3.6-35b-a3b-q4kxl	unsloth/Qwen3.6-35B-A3B-GGUF	—	35B MoE	22.4	q4	none	46.1	1299.8
qwen3.6-27b	unsloth/Qwen3.6-27B-GGUF		27B	20	q4	none	10.1	299.6
chadrock3-6-35b-uncensored-mtp-strix-lean	local / auto-scan		35B MoE	19	q4	draft-mtp	102.1	889.6	86
qwen3-coder-reap-25b-a3b-q5km	bartowski/cerebras_Qwen3-Coder-REAP-25B-A3B-GGUF		25B MoE	17.7	q4	none	54.7	1367.6
gemma-4-26b-a4b-it-q4kxl	unsloth/gemma-4-26B-A4B-it-GGUF	—	26B MoE	17.1	q4	none	40.9	1335.7
qwen3.6-27b-heretic-q4km	DavidAU/Qwen3.6-27B-Heretic-Uncensored-FINETUNE-NEO-CODE-Di-IMatrix-MAX-GGUF		27B	16.9	q4	none	11.6	294.7		vision;
chadrock3-6-27b-pi-agent-mtp-rocmfp4-strix-lean	local / auto-scan		27B	14.8	q4	draft-mtp	34.9	307.7	79.8
hermes-4-14b-q5km	bartowski/NousResearch_Hermes-4-14B-GGUF		14B	10.5	q4	none	20.3	613.6
qwen3.5-9b-deepseek-v4-flash-mtp	Jackrong/Qwen3.5-9B-DeepSeek-V4-Flash-MTP-GGUF		9B	7.6	q4	draft-mtp	51.3	613.4	69.7
qwopus3-5-9b-coder-mtp-q6-k	local / auto-scan		9B	7.6	q4	draft-mtp	44.6	617.5	56
gemma-4-12b-it	unsloth/gemma-4-12b-it-GGUF	—	12B	7.4	q4	none	22.7	686.6
gemma4-v2-q4-k-m	local / auto-scan	—	—	7.4	q4	none	22.6	679.8
qwen3.5-9b-q4kxl	unsloth/Qwen3.5-9B-GGUF	—	9B	6	q4	none	33	1044.7
qwopus3-5-4b-coder-mtp-q6-k	local / auto-scan		4B	3.6	q4	draft-mtp	85	889.2	76.7
qwen3.5-4b-q4kxl	unsloth/Qwen3.5-4B-GGUF	—	4B	2.9	q4	none	52.7	1695.2
qwen3-4b-q4-k-m	local / auto-scan	—	4B	2.5	q4	none	61.9	1849.4
qwen3-zero-coder-v2-0.8b-f16	DavidAU/Qwen3-Zero-Coder-Reasoning-V2-0.8B-NEO-EX-GGUF		0.8B	1.6	q4	none	76.3	4827.2
qwen3.5-0.8b	unsloth/Qwen3.5-0.8B-GGUF	—	0.8B	0.6	q4	none	169.8	6248.1

≥60 t/s 25–60 <25 · MTP speculativeVisionTool-callingCodingReasoning · Click any header to sort.

How to read it

Decode t/s — single-stream token generation throughput (greedy, 256-token completion, median of 3 runs). This is the number you feel during a chat.
Prefill t/s — prompt-ingestion throughput on a ~2k-token prompt.
MTP acc% — for multi-token-prediction models, the share of speculatively-drafted tokens the target model accepted. Higher acceptance = more of the speedup is real.
Caps — capability tags that drive routing and slot features (MTP, vision, tool-calling, coding, reasoning).
Spec / KV / Size — the speculative-decode mode, KV-cache quantization, and on-disk GGUF size for the measured configuration.

What the data shows

MTP A3B models dominate. The sparse 35B-A3B models with self-speculative draft-mtp lead the board — chadrock-35b-ace-saber tops it at ~100 t/s decode.
Acceptance matters more than raw size. A model with weak MTP acceptance (drafted tokens getting rejected) gives back most of the speculative speedup — watch the acc% column, not just the decode figure.
The curated default is the slowest real model. Dense qwen3.6-27b lands at ~10 t/s — an A3B-MTP model is several times faster for the same class of work.