Skip to content

NVIDIA discrete GPU

hal0 supports NVIDIA discrete GPUs through a CUDA-backed llama.cpp toolbox. NVIDIA is on the v1 supported list — the path works today, with one caveat: the dedicated CUDA toolbox image hasn’t been published to ghcr.io/hal0ai/ yet, so most users in v1 fall back to the Vulkan path on NVIDIA hardware.

NVIDIA discrete is a supported target, not the reference platform. For dedicated VRAM with mature drivers, NVIDIA is the obvious choice; for big-model headroom in one box, Strix Halo remains the headline target.

  • RTX 5090 (32 GB GDDR7)
  • RTX 4090 / 3090 (24 GB GDDR6X)
  • RTX 4080 / 4080 Super (16 GB GDDR6X)
  • RTX 3080 / 3080 Ti (10–12 GB GDDR6X)
  • Older 20-series and below: technically supported by llama.cpp; not a v1 focus.

Cards below 10 GB run very small models only (Q4 4B and under).

Out of the box in v1, NVIDIA users typically run on the Vulkan toolbox — the same image Strix Halo and AMD discrete use. It works, and it’s the path the installer picks when it detects an NVIDIA GPU without a published CUDA toolbox.

The CUDA toolbox is on the build list. When it lands, expect a non-trivial throughput improvement on chat workloads and substantially better large-context handling. We’ll publish before-and-after numbers on this page once the CUDA toolbox ships.

These mirror the discrete-card section of the Strix Halo loadouts.

  • primary: Qwen3-Coder-30B-A3B-Instruct-Q4_K_M (~18.6 GB) or any Q4 ~30B chat — comfortable with a 16–32k context.
  • embed: nomic-embed-text-v2-moe-Q4_K_M (~140 MB) co-resident.
  • Q4 70B (Hermes-4-70B / Llama-3.3-70B) is feasible but tight with partial CPU offload; expect lower tok/s than VRAM-resident inference.
  • Trade vs Strix Halo: no headroom for a hot STT/TTS slot alongside a 30B primary.
  • primary: Qwen3-30B-A3B-Instruct-2507-Q4_K_M (~18.6 GB) fits with shorter context, or gemma-3-12b-it-Q4_K_M (~6.6 GB) for a longer window.
  • embed: small Q4 embed only (nomic-embed-text-v2-moe ~140 MB).
  • Q4 70B requires partial CPU offload — works, but drops well below VRAM-resident speeds.
  • Trade vs 5090: tighter context budgets at the same model size.
  • primary: gemma-3-12b-it-Q4_K_M (~6.6 GB) or Hermes-4-14B-Q4_K_M (~9 GB).
  • embed: nomic-embed-text-v2-moe-Q4_K_M (~140 MB) leaves several GB for a ~16k context.
  • Q4 32B class (Qwen3-30B-A3B) is offload-only here — workable occasionally, not as a daily driver.
  • Trade vs 24 GB cards: keep the primary at ~13B class for a smooth experience.
  • primary: a 4–14B Q4 — Hermes-4-14B-Q4_K_M, gemma-3-12b-it-Q4_K_M, or Qwen3-4B-Instruct-2507-Q4_K_M (~2.5 GB) for low-latency.
  • embed: skip on 10–12 GB cards.
  • One slot at a time is the norm.

The standard one-liner from the install page handles the Vulkan path on NVIDIA:

Terminal window
curl -fsSL https://hal0.dev/install | bash

You’ll want:

  • Recent NVIDIA proprietary drivers installed via your distro (575+ series recommended for newest cards).
  • Vulkan runtime present (vulkaninfo --summary returns devices).
  • The service user with /dev/nvidia* access — usually handled by the driver install.

The hardware probe detects the GPU and writes VRAM size to /etc/hal0/hardware.json so slot fit warnings are accurate.

Probe doesn’t list the GPU. Run nvidia-smi to confirm the driver sees the card. If that’s empty, fix the host driver install first.

Slot fails to start with CUDA / library errors. v1 doesn’t bundle a CUDA toolbox — make sure the slot is on the Vulkan provider:

Terminal window
hal0 slot list --json | grep provider

If a slot is set to a CUDA provider, swap it back to llama-cpp until the CUDA toolbox publishes:

Terminal window
hal0 slot swap primary --provider llama-cpp

Lower than expected throughput. Confirm the card isn’t power-limited or PCIe-mode-limited. NVIDIA’s Vulkan implementation is solid but won’t hit the throughput of a native CUDA llama.cpp build until the dedicated toolbox ships.