Slots
A slot is hal0’s unit of served inference: one named place that runs one model. Behind every slot is a single podman container, and around every slot is a lifecycle the platform manages for you. Slots are what the dispatcher routes to, what the GPU arbiter coordinates, and what capabilities and profiles configure — so they are the central concept in hal0.
One slot, one container
Section titled “One slot, one container”Each slot is backed by a systemd template unit named
hal0-slot@<name>.service whose ExecStart runs podman run. That launches
exactly one container — hal0-slot-<name> — serving one model on a loopback
port. The container runtime defaults to podman; the unit uses --replace so a
restart never trips over a stale container name, and publishes its port on
127.0.0.1 only.
The SlotManager owns this. It renders the unit, starts and stops it,
tracks state, and registers the running container as an internal upstream so the
dispatcher can forward to it. A slot is configured by a small TOML file that
supplies the model, context size, and port; the container image and tuned flags
come from the slot’s profile.
The lifecycle state machine
Section titled “The lifecycle state machine”A slot moves through an explicit set of states. The SlotState values are also
the wire representation streamed to the dashboard, so what you see is a real
transition, not a systemd snapshot:
The slot lifecycle transitions visualized.
| State | Meaning |
|---|---|
offline | Not running; no active unit. |
pulling | Model files are downloading or being verified. |
starting | The unit has started; waiting for the container to come up. |
warming | Container is up; the model is loading into the GPU/GTT pool. |
ready | Health probe passed; ready to serve. |
serving | A request is actively in flight. |
idle | Up but not serving — either warm-but-quiet, or up with no loadable model. |
unloading | Graceful shutdown in progress. |
error | Failed; details in state.json and the journal. |
Transitions are constrained — only legal edges are allowed — and persisted
atomically so a dashboard reader never observes a half-written state. The
dispatchable “ready set” is ready, serving, and idle: those are the
states a request can be routed to. Any other state means the slot is
mid-lifecycle, and the dispatcher returns a structured slot.loading 503 with a
Retry-After hint rather than failing with a raw connection error.
Serving and idling
Section titled “Serving and idling”When the dispatcher forwards a request to a slot, it wraps the call in the
slot’s serving state: the first concurrent request moves the slot
ready/idle → serving, and the last one to finish moves it back to ready.
For a streamed response, the slot stays serving until the stream drains. A
background idle monitor demotes a quiet ready slot to idle after its idle
timeout (300 seconds by default), so the dashboard can distinguish “warm and
working” from “warm but quiet.”
Single-flight dispatch
Section titled “Single-flight dispatch”Two safety mechanisms keep concurrent traffic from colliding:
- Per-slot lock. Load, unload, and restart for a given slot serialize behind one lock, so two requests can never race to start or swap the same slot’s container.
- Coalesced prefetch. When the dispatcher needs to fetch a cold upstream’s model list, identical concurrent fetches share a single execution — a hundred simultaneous requests for the same uncached model trigger one upstream call, not a hundred. They all receive the same result (or the same error).
The effect is that a burst of requests for a model that isn’t loaded yet produces exactly one load, and everyone waits on that load rather than starting their own.
The GPU arbiter
Section titled “The GPU arbiter”The reference hardware has a single iGPU sharing one unified memory pool, so two GPU workloads can’t both hold their weights at once. The GPU arbiter makes that exclusivity explicit. It sorts GPU-backed container slots into two groups — an llm group (GPU chat/embedding slots) and an img group (the ComfyUI image-generation slot) — and only one group may hold the GPU’s memory at a time. NPU and CPU slots are never arbitrated; they don’t contend for the iGPU.
When the GPU is in image mode, the arbiter refuses to dispatch to an llm-group
slot: that request gets a structured gpu.image_mode 503 (with a Retry-After)
instead of silently failing. The image container itself stays resident — what’s
exclusive is the GPU memory, not the container — so switching modes is fast.
After the image slot has had no jobs for its configured idle window
(idle_restore_minutes, 60 by default; 0 means manual-only), the arbiter
frees the diffusion models and restores the LLM slots it had stopped.
Seeded slots, roles, and aliases
Section titled “Seeded slots, roles, and aliases”Every install starts with a set of seeded slots so the common capabilities have a home before you create anything:
chat, embed, rerank, stt, tts, img, vision, and agent.
When the NPU runtime (FLM) is present, two more shadow slots are seeded:
stt-npu and embed-npu. Seeded slots are reserved — you can’t create a
conflicting name or delete them.
Each slot carries a type (llm, embedding, reranking, transcription,
tts, or image) and a device preference (gpu-rocm, gpu-vulkan, cpu,
or npu). hal0 also keeps a couple of back-compat aliases so older callers
keep working: primary resolves to chat, and agent-hermes resolves to
agent. Aliases are never written to disk and never appear in slot listings —
they’re a transparent translation at dispatch time.
Addressing a slot by its name (e.g. model: "chat") is the stable way to pin a
co-resident model: the name doesn’t change when you swap the underlying model
file.