Speech to text and text to speech

hal0 exposes two voice directions through OpenAI-compatible endpoints:

Speech-to-text (POST /v1/audio/transcriptions) — whisper-v3:turbo served by FastFlowLM (FLM) on the XDNA NPU, co-loaded with the chat model in one FLM process (the “NPU trio”: chat + STT + embed).
Text-to-speech (POST /v1/audio/speech) — Kokoro-82M ONNX container on CPU, 54 voices, MP3 by default (default voice: af_heart).

Both are children of the voice capability: voice.stt maps to the stt slot and voice.tts maps to the tts slot.

Enable the voice capability

Voice is opt-in. Enable each direction from the dashboard’s capability picker, or by POSTing a partial selection to the capability API. The body accepts any subset of { backend, provider, model, enabled }:

Speech-to-text (NPU, FLM + whisper-v3:turbo)
Text-to-speech (CPU, Kokoro)

curl -X POST http://localhost:8080/api/capabilities/voice/stt \
  -H 'Content-Type: application/json' \
  -d '{"enabled": true, "backend": "npu"}'

curl -X POST http://localhost:8080/api/capabilities/voice/tts \
  -H 'Content-Type: application/json' \
  -d '{"enabled": true, "backend": "cpu", "provider": "kokoro"}'

The response is { "ok": true, "selection": { ...current selection... } }.

How the NPU STT path works

Enabling voice.stt with backend=npu does not spawn a standalone process. It drives the FLM trio — one flm serve anchor process serving chat, transcription (whisper-v3:turbo), and embeddings together. The orchestrator toggles the anchor’s --asr 1 flag and writes a type=transcription slot record for dispatch gating; the live anchor is reloaded so the change takes effect immediately. NPU transcription requires the FLM chat anchor to be loaded.

Transcribe audio (speech to text)

Send a multipart upload with the audio file and a required model field. The raw multipart bytes are forwarded unchanged to the upstream:

curl -X POST http://localhost:8080/v1/audio/transcriptions \
  -F 'file=@recording.wav' \
  -F 'model=<your-stt-model>'

The model form field is required; omitting it returns 400 (request.missing_model) rather than a misleading 404.

Synthesize speech (text to speech)

Send a JSON body with model, input, and voice. The response is the audio stream:

curl -X POST http://localhost:8080/v1/audio/speech \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "<your-tts-model>",
    "input": "Hello from hal0.",
    "voice": "<voice-id>"
  }' \
  --output speech.wav

As with transcription, model is required — a missing or empty model returns 400 (request.missing_model).