Recommended loadouts
Curated starting points, not gospel. Mix, match, tweak — the slot system takes a different model per slot whenever you change your mind. Sizes are published GGUF Q4_K_M / Q8_0 file sizes (verified on Hugging Face, May 2026); no tok/s numbers here. Picks are refreshed to the latest open-weight releases as of 2026-05-15 — each entry keeps a previous fallback in parens for users on existing setups.
Headline target: the 128 GB Strix Halo SKU
Section titled “Headline target: the 128 GB Strix Halo SKU”Ryzen AI Max+ 395 with 128 GB LPDDR5X-8000 unified memory. The iGPU carveout is BIOS-tunable up to ~96 GB, and some configs report ~110 GB usable for the GPU when paged through GTT. Q4 70B fits with massive headroom; Q4 MoE 100B+ on a 17B–22B active path becomes feasible; mid-class loadouts leave 100+ GB free for context, KV cache, embed + audio slots, and multi-tab usage.
64 GB Strix Halo SKUs (Ryzen AI Max 385 / 390) are still well-served by every small + mid tier below, plus a tight Q4 70B / Llama-4 Scout with shorter context windows.
Strix Halo loadouts
Section titled “Strix Halo loadouts”Coding — small / mid / large
Section titled “Coding — small / mid / large”| Tier | Size | primary | embed | Notes |
|---|---|---|---|---|
| Small | ~5 GB | Qwen2.5-Coder-7B-Instruct-Q4_K_M | — | No Qwen3-Coder small variant has shipped yet; 7B Qwen2.5 stays the best small dedicated coder. |
| Mid | ~19 GB | Qwen3-Coder-30B-A3B-Instruct-Q4_K_M (~18.6 GB, MoE 3B active — runs near 3B speeds, reasons like 30B). | nomic-embed-text-v2-moe-Q4_K_M (~140 MB) for repo-aware search. | fallback: Qwen2.5-Coder-32B-Instruct-Q4_K_M (~20 GB). |
| Large | ~42 GB | Hermes-4-70B-Q4_K_M (~42.5 GB) for hybrid reasoning + tool-friendly coding. | — | Alt: Llama-4-Scout-17B-16E-Instruct-Q4_K_M (~50 GB, MoE 17B active, 10M context). |
No dedicated 70B+ coder exists in GGUF, so the convention is to fall
back on a top-tier general / reasoning model for hard problems.
128 GB headroom keeps both the 30B-A3B coder and a 70B reasoning
model hot in separate slots. Large fallback:
Llama-3.3-70B-Instruct-Q4_K_M.
General chat — small / mid / large
Section titled “General chat — small / mid / large”| Tier | Size | primary | Notes |
|---|---|---|---|
| Small | ~2.5 GB | Qwen3-4B-Instruct-2507-Q4_K_M (Aug 2025 release, 1M-token context). | Snappy on any modern box. fallback: Llama-3.2-3B-Instruct-Q4_K_M. |
| Mid | ~19 GB | Qwen3-30B-A3B-Instruct-2507-Q4_K_M (~18.6 GB, MoE 3B active). | Smaller-RAM alt: gemma-3-12b-it-Q4_K_M (~6.6 GB). fallback: Meta-Llama-3.1-8B-Instruct-Q4_K_M. |
| Large | ~50 GB | Llama-4-Scout-17B-16E-Instruct-Q4_K_M (~50 GB, MoE 17B active, 10M context). | On 128 GB you also get the embed slot hot at the same time and headroom for STT/TTS. |
Large fallback: Llama-3.3-70B-Instruct-Q4_K_M (~42 GB) or
Qwen2.5-72B-Instruct-Q4_K_M (~47 GB). 64 GB SKUs don’t comfortably
run a large primary + embed + audio simultaneously.
Voice mode (~3 GB total)
Section titled “Voice mode (~3 GB total)”| Slot | Pick | Notes |
|---|---|---|
primary | Qwen3-4B-Instruct-2507-Q4_K_M (~2.5 GB) — low-latency reply. | fallback: Llama-3.2-3B-Instruct-Q4_K_M. |
stt | Moonshine base (~190 MB) via the moonshine toolbox — built for edge real-time. | Higher-accuracy alt: whisper-large-v3-turbo (~1.6 GB). 2025 SOTA: Canary-Qwen-2.5B (Open ASR leaderboard, 5.63% WER). |
tts | Kokoro-82M v1.0 (~330 MB, 8 languages / 54 voices, Jan 2025) via the kokoro toolbox. | Voice-cloning alt: F5-TTS. |
128 GB leaves the entire rest of the budget free for a large embed or a second chat model warm in another slot.
Creative / fun writing (~42 GB)
Section titled “Creative / fun writing (~42 GB)”primary:Hermes-4-70B-Q4_K_M(~42.5 GB, Aug 2025 — hybrid-mode reasoning + creative strength).- Lighter alt:
Hermes-4-14B-Q4_K_M(~9 GB, Qwen-3-14B base). - fallback:
Mistral-Small-24B-Instruct-2501-Q4_K_M(~14 GB).
Privacy-first / minimal footprint (<1 GB)
Section titled “Privacy-first / minimal footprint (<1 GB)”primary:gemma-3-1b-it-Q4_K_M(~0.7 GB) — text-only, March 2025.- fallback:
Phi-3-mini-4k-instruct-q4.gguf(~2.4 GB, the curated default); orQwen2.5-0.5B-Instruct-Q4_K_M(~400 MB, the CI smoke model). embed:nomic-embed-text-v2-moe-Q4_K_M(~140 MB, multilingual MoE — 137M params).
Runs comfortably on CPU-only fallback boxes; smallest viable hal0 install.
RAG / knowledge-base (~19 GB)
Section titled “RAG / knowledge-base (~19 GB)”primary:Qwen3-30B-A3B-Instruct-2507-Q4_K_M(~18.6 GB) for synthesis. fallback:Qwen2.5-14B-Instruct-Q4_K_M(~9 GB).embed:bge-m3(~600 MB Q8 — multilingual, multi-vector, 8192-token context, top retrieval R@1 in 2026 benchmarks). Lower-footprint alt:nomic-embed-text-v2-moe(~140 MB). fallback:bge-large-en-v1.5-Q8_0(~670 MB).
The embed slot also serves rerank via /v1/rerankings. 128 GB extra:
huge room for KV cache → long-context retrieval (64k+) without
paging.
Agentic tool-use (~42 GB)
Section titled “Agentic tool-use (~42 GB)”primary:Hermes-4-70B-Q4_K_M(~42.5 GB) — Nous’s hybrid-reasoning model is explicitly tuned for tool-call faithfulness and format adherence.- Lighter alt:
Hermes-4-14B-Q4_K_M(~9 GB). - fallback:
Qwen2.5-32B-Instruct-Q4_K_M(~20 GB). embed:bge-m3(~600 MB) ornomic-embed-text-v2-moe(~140 MB) for retrieval-augmented routing.
Lines up with the v0.2 agents / MCP roadmap.
Maxed-out single model (~50–75 GB)
Section titled “Maxed-out single model (~50–75 GB)”The biggest realistic single-model loadout that still fits a 128 GB Strix Halo with room to breathe. Pick one:
primary:Llama-4-Scout-17B-16E-Instruct-Q4_K_M(~50 GB) — 10M context, MoE 17B active. The current best balance of size and capability.primary:Hermes-4-70B-Q8_0(~75 GB) — 70B at Q8 instead of Q4, trading size for quant headroom.primary:Mistral-Large-Instruct-2411-Q4_K_M(123B, ~73 GB) — older but still excellent for raw single-model quality.
Discrete GPU & CPU loadouts
Section titled “Discrete GPU & CPU loadouts”For NVIDIA the path is CUDA-backed llama.cpp; for AMD discrete it’s the ROCm toolbox image. Both go through the same slot lifecycle as Strix Halo — what changes is dedicated VRAM vs the unified pool.
NVIDIA RTX 5090 (32 GB VRAM)
Section titled “NVIDIA RTX 5090 (32 GB VRAM)”primary:Qwen3-Coder-30B-A3B-Instruct-Q4_K_M(~18.6 GB) or any Q4 ~30B chat — comfortable with a 16–32k context.embed:nomic-embed-text-v2-moe-Q4_K_M(~140 MB) co-resident.- Q4 70B (
Hermes-4-70B/Llama-3.3-70B) is feasible but tight with partial CPU offload; expect lower tok/s than VRAM-resident inference. - Trade vs Strix Halo: no headroom for a hot STT/TTS slot alongside a 30B primary.
NVIDIA RTX 4090 / 3090 (24 GB VRAM)
Section titled “NVIDIA RTX 4090 / 3090 (24 GB VRAM)”primary:Qwen3-30B-A3B-Instruct-2507-Q4_K_M(~18.6 GB) fits with shorter context, orgemma-3-12b-it-Q4_K_M(~6.6 GB) for a longer window.embed: small Q4 embed only (nomic-embed-text-v2-moe~140 MB).- Q4 70B requires partial CPU offload — works, but drops well below VRAM-resident speeds.
- Trade vs 5090: tighter context budgets at the same model size.
NVIDIA RTX 4080 / 4080 Super (16 GB VRAM)
Section titled “NVIDIA RTX 4080 / 4080 Super (16 GB VRAM)”primary:gemma-3-12b-it-Q4_K_M(~6.6 GB) orHermes-4-14B-Q4_K_M(~9 GB).embed:nomic-embed-text-v2-moe-Q4_K_M(~140 MB) leaves several GB for a ~16k context.- Q4 32B class (Qwen3-30B-A3B) is offload-only here — workable occasionally, not as a daily driver.
- Trade vs 24 GB cards: keep the primary at ~13B class for a smooth experience.
NVIDIA RTX 3080 / AMD RX 7900 XT / XTX (10–24 GB VRAM)
Section titled “NVIDIA RTX 3080 / AMD RX 7900 XT / XTX (10–24 GB VRAM)”primary: a 4–14B Q4 —Hermes-4-14B-Q4_K_M,gemma-3-12b-it-Q4_K_M, orQwen3-4B-Instruct-2507-Q4_K_M(~2.5 GB) for low-latency.embed: small Q4 embed if the card has 16 GB+; skip on 10–12 GB cards.- AMD route is
hal0-toolbox-rocm; NVIDIA stays on the CUDA llama.cpp build. - Trade: one slot at a time is the norm — no simultaneous primary + embed + audio.
CPU-only (32–64 GB system RAM, no GPU)
Section titled “CPU-only (32–64 GB system RAM, no GPU)”primary:gemma-3-1b-it-Q4_K_M(~0.7 GB) orQwen3-4B-Instruct-2507-Q4_K_M(~2.5 GB) for a snappier feel. fallback:Phi-3-mini-4k-instruct-q4.gguf(~2.4 GB, the curated default).embed:nomic-embed-text-v2-moe-Q4_K_M(~140 MB) — runs fine on CPU.- No
stt/ttsslots — Moonshine and Kokoro are technically CPU-capable but streaming audio at usable latency wants at least an iGPU. - Expect: a few tok/s on chat, fine for occasional Q&A and dev smoke; not the streaming experience.
Strix Halo’s unified pool is what unlocks the 70B Q4 and large /
agentic tiers; on discrete cards you trade ceiling for raw tok/s on
smaller models. hal0 picks the right provider automatically based on
probe (hal0/hardware/probe.py → /etc/hal0/hardware.json → slot
defaults).
Loadouts are starting points. Every real install ends up tweaked.