Skip to content

OpenAI-compatible API

hal0 exposes an OpenAI-compatible API at http://localhost:8080/v1. Any client written against the OpenAI SDK — Python openai, openai-node, LangChain, OpenWebUI, LiteLLM, Aider, Cursor with a custom base URL — works against hal0 unmodified.

The API binds 0.0.0.0:8080 by default (override with HAL0_PORT). Routes are implemented in src/hal0/api/routes/v1.py.

MethodPathPurpose
GET/v1/modelsList loaded models + slot aliases.
GET/v1/models/{model_id}Detail for one model.
POST/v1/chat/completionsChat with a model. Supports streaming.
POST/v1/completionsPlain completion (no chat template).
POST/v1/embeddingsEmbed text into vectors.
POST/v1/rerankingsRerank candidates against a query.
POST/v1/audio/transcriptionsSpeech-to-text (Moonshine).
POST/v1/audio/speechText-to-speech (Kokoro).
POST/v1/images/generationsImage generation (ComfyUI on ROCm).
Terminal window
curl http://localhost:8080/v1/models

The response includes one entry per registry model plus one entry per loaded slot name, so you can address the model directly ("qwen2.5-0.5b-instruct-q4_k_m") or by slot ("primary"). See Slot as model.

Terminal window
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "primary",
"messages": [
{"role": "user", "content": "Hello!"}
]
}'

For streaming, add "stream": true — see Streaming.

Terminal window
curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "primary",
"prompt": "Once upon a time",
"max_tokens": 64
}'
Terminal window
curl http://localhost:8080/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "embed",
"input": ["hal0 runs locally", "OpenAI-compatible"]
}'
Terminal window
curl http://localhost:8080/v1/rerankings \
-H "Content-Type: application/json" \
-d '{
"model": "embed",
"query": "atomic config writes",
"documents": [
"TOML config is written via NamedTemporaryFile + os.replace.",
"Slots bind 127.0.0.1 in the 8081-8099 range."
]
}'

Rerank piggybacks on the embed slot because it uses the same backend process.

Terminal window
curl http://localhost:8080/v1/audio/transcriptions \
-H "Content-Type: multipart/form-data" \
-F file=@hello.wav \
-F model=stt

See Audio for the full Moonshine + Kokoro story.

Terminal window
curl http://localhost:8080/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "tts",
"input": "Hello from hal0.",
"voice": "af_bella"
}' --output speech.wav
Terminal window
curl http://localhost:8080/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"model": "sdxl-turbo",
"prompt": "a cat in a hat, studio lighting",
"size": "1024x1024",
"response_format": "url"
}'

Curated models: sdxl-turbo (SAI Non-Commercial Research), sd-1.5-pruned-emaonly (CreativeML Open RAIL-M), flux-schnell (Apache-2.0). See Image generation for the full request shape, response shape, slot configuration, and hardware requirements.

Every failure response carries a structured envelope:

{
"error": {
"code": "slot.not_ready",
"message": "primary is still warming",
"details": {
"slot": "primary",
"state": "warming"
}
}
}

Codes are namespaced — slot.*, model.*, dispatch.*, config.*, system.*. The dashboard surfaces them inline; the CLI prints them with the same code so error reports between users and developers stay anchored.

The same /v1/* surface fronts external OpenAI-compatible providers when configured — OpenRouter, Anthropic, OpenAI, custom endpoints. You can mix local + remote per-model in one config; the dispatcher picks the right backend based on the model field.