OpenAI-compatible API

hal0 exposes an OpenAI-compatible API at http://localhost:8080/v1. Any client written against the OpenAI SDK — Python openai, openai-node, LangChain, OpenWebUI, LiteLLM, Aider, Cursor with a custom base URL — works against hal0 unmodified.

The API binds 0.0.0.0:8080 by default (override with HAL0_PORT). Routes are implemented in src/hal0/api/routes/v1.py.

Endpoints

Method	Path	Purpose
`GET`	`/v1/models`	List loaded models + slot aliases.
`GET`	`/v1/models/{model_id}`	Detail for one model.
`POST`	`/v1/chat/completions`	Chat with a model. Supports streaming.
`POST`	`/v1/completions`	Plain completion (no chat template).
`POST`	`/v1/embeddings`	Embed text into vectors.
`POST`	`/v1/rerankings`	Rerank candidates against a query.
`POST`	`/v1/audio/transcriptions`	Speech-to-text (Moonshine).
`POST`	`/v1/audio/speech`	Text-to-speech (Kokoro).
`POST`	`/v1/images/generations`	Image generation (ComfyUI on ROCm).

curl: list models

curl http://localhost:8080/v1/models

The response includes one entry per registry model plus one entry per loaded slot name, so you can address the model directly ("qwen2.5-0.5b-instruct-q4_k_m") or by slot ("primary"). See Slot as model.

curl: chat completion

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "primary",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'

For streaming, add "stream": true — see Streaming.

curl: plain completion

curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "primary",
    "prompt": "Once upon a time",
    "max_tokens": 64
  }'

curl: embeddings

curl http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "embed",
    "input": ["hal0 runs locally", "OpenAI-compatible"]
  }'

curl: rerank

curl http://localhost:8080/v1/rerankings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "embed",
    "query": "atomic config writes",
    "documents": [
      "TOML config is written via NamedTemporaryFile + os.replace.",
      "Slots bind 127.0.0.1 in the 8081-8099 range."
    ]
  }'

Rerank piggybacks on the embed slot because it uses the same backend process.

curl: speech-to-text

curl http://localhost:8080/v1/audio/transcriptions \
  -H "Content-Type: multipart/form-data" \
  -F file=@hello.wav \
  -F model=stt

See Audio for the full Moonshine + Kokoro story.

curl: text-to-speech

curl http://localhost:8080/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts",
    "input": "Hello from hal0.",
    "voice": "af_bella"
  }' --output speech.wav

curl: image generation

curl http://localhost:8080/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "model": "sdxl-turbo",
    "prompt": "a cat in a hat, studio lighting",
    "size": "1024x1024",
    "response_format": "url"
  }'

Curated models: sdxl-turbo (SAI Non-Commercial Research), sd-1.5-pruned-emaonly (CreativeML Open RAIL-M), flux-schnell (Apache-2.0). See Image generation for the full request shape, response shape, slot configuration, and hardware requirements.

Structured errors

Every failure response carries a structured envelope:

{
  "error": {
    "code": "slot.not_ready",
    "message": "primary is still warming",
    "details": {
      "slot": "primary",
      "state": "warming"
    }
  }
}

Codes are namespaced — slot.*, model.*, dispatch.*, config.*, system.*. The dashboard surfaces them inline; the CLI prints them with the same code so error reports between users and developers stay anchored.

External upstreams

The same /v1/* surface fronts external OpenAI-compatible providers when configured — OpenRouter, Anthropic, OpenAI, custom endpoints. You can mix local + remote per-model in one config; the dispatcher picks the right backend based on the model field.