Skip to content

Speech to text and text to speech

hal0 exposes two voice directions through OpenAI-compatible endpoints:

  • Speech-to-text (POST /v1/audio/transcriptions) — whisper-v3:turbo served by FastFlowLM (FLM) on the XDNA NPU, co-loaded with the chat model in one FLM process (the “NPU trio”: chat + STT + embed).
  • Text-to-speech (POST /v1/audio/speech) — Kokoro-82M ONNX container on CPU, 54 voices, MP3 by default (default voice: af_heart).

Both are children of the voice capability: voice.stt maps to the stt slot and voice.tts maps to the tts slot.

Voice is opt-in. Enable each direction from the dashboard’s capability picker, or by POSTing a partial selection to the capability API. The body accepts any subset of { backend, provider, model, enabled }:

Terminal window
curl -X POST http://localhost:8080/api/capabilities/voice/stt \
-H 'Content-Type: application/json' \
-d '{"enabled": true, "backend": "npu"}'

The response is { "ok": true, "selection": { ...current selection... } }.

Enabling voice.stt with backend=npu does not spawn a standalone process. It drives the FLM trio — one flm serve anchor process serving chat, transcription (whisper-v3:turbo), and embeddings together. The orchestrator toggles the anchor’s --asr 1 flag and writes a type=transcription slot record for dispatch gating; the live anchor is reloaded so the change takes effect immediately. NPU transcription requires the FLM chat anchor to be loaded.

Send a multipart upload with the audio file and a required model field. The raw multipart bytes are forwarded unchanged to the upstream:

Terminal window
curl -X POST http://localhost:8080/v1/audio/transcriptions \
-F 'file=@recording.wav' \
-F 'model=<your-stt-model>'

The model form field is required; omitting it returns 400 (request.missing_model) rather than a misleading 404.

Send a JSON body with model, input, and voice. The response is the audio stream:

Terminal window
curl -X POST http://localhost:8080/v1/audio/speech \
-H 'Content-Type: application/json' \
-d '{
"model": "<your-tts-model>",
"input": "Hello from hal0.",
"voice": "<voice-id>"
}' \
--output speech.wav

As with transcription, model is required — a missing or empty model returns 400 (request.missing_model).