Speech to text and text to speech
hal0 exposes two voice directions through OpenAI-compatible endpoints:
- Speech-to-text (
POST /v1/audio/transcriptions) —whisper-v3:turboserved by FastFlowLM (FLM) on the XDNA NPU, co-loaded with the chat model in one FLM process (the “NPU trio”: chat + STT + embed). - Text-to-speech (
POST /v1/audio/speech) — Kokoro-82M ONNX container on CPU, 54 voices, MP3 by default (default voice:af_heart).
Both are children of the voice capability: voice.stt maps to the
stt slot and voice.tts maps to the tts slot.
Enable the voice capability
Section titled “Enable the voice capability”Voice is opt-in. Enable each direction from the dashboard’s capability
picker, or by POSTing a partial selection to the capability API. The body
accepts any subset of { backend, provider, model, enabled }:
curl -X POST http://localhost:8080/api/capabilities/voice/stt \ -H 'Content-Type: application/json' \ -d '{"enabled": true, "backend": "npu"}'curl -X POST http://localhost:8080/api/capabilities/voice/tts \ -H 'Content-Type: application/json' \ -d '{"enabled": true, "backend": "cpu", "provider": "kokoro"}'The response is { "ok": true, "selection": { ...current selection... } }.
How the NPU STT path works
Section titled “How the NPU STT path works”Enabling voice.stt with backend=npu does not spawn a standalone
process. It drives the FLM trio — one flm serve anchor process serving
chat, transcription (whisper-v3:turbo), and embeddings together. The
orchestrator toggles the anchor’s --asr 1 flag and writes a
type=transcription slot record for dispatch gating; the live anchor is
reloaded so the change takes effect immediately. NPU transcription requires the
FLM chat anchor to be loaded.
Transcribe audio (speech to text)
Section titled “Transcribe audio (speech to text)”Send a multipart upload with the audio file and a required model field.
The raw multipart bytes are forwarded unchanged to the upstream:
curl -X POST http://localhost:8080/v1/audio/transcriptions \ -F 'file=@recording.wav' \ -F 'model=<your-stt-model>'The model form field is required; omitting it returns 400
(request.missing_model) rather than a misleading 404.
Synthesize speech (text to speech)
Section titled “Synthesize speech (text to speech)”Send a JSON body with model, input, and voice. The response is the
audio stream:
curl -X POST http://localhost:8080/v1/audio/speech \ -H 'Content-Type: application/json' \ -d '{ "model": "<your-tts-model>", "input": "Hello from hal0.", "voice": "<voice-id>" }' \ --output speech.wavAs with transcription, model is required — a missing or empty model
returns 400 (request.missing_model).
See also
Section titled “See also”- Capabilities and profiles — the capability overlay and the FLM NPU trio.
- Manage slots — the
sttandttsslots. - OpenAI-compatible API — the full
/v1surface.