Skip to content

Audio — STT & TTS

hal0 ships two audio endpoints — speech-to-text on the stt slot and text-to-speech on the tts slot. Both speak the OpenAI Audio shape so any client that hits OpenAI’s audio API works here.

The stt slot defaults to Moonshine — a small, fast ASR model built for edge real-time. The toolbox image is hal0-toolbox-moonshine.

Terminal window
curl http://localhost:8080/v1/audio/transcriptions \
-H "Content-Type: multipart/form-data" \
-F file=@hello.wav \
-F model=stt

Response (OpenAI-shape):

{
"text": "Hello, world."
}

The stt slot can host any ASR-compatible model the Moonshine provider supports — for example a higher-accuracy whisper-large-v3-turbo (~1.6 GB) if you have the headroom, or Canary-Qwen-2.5B (Open ASR Leaderboard leader, 5.63% WER) for SOTA accuracy. Swap with:

Terminal window
hal0 slot swap stt --model whisper-large-v3-turbo

See Recommended loadouts → Voice mode for the picks per tier.

The tts slot defaults to Kokoro-82M v1.0 — a small open TTS model with 54 voices across 8 languages. The toolbox image is hal0-toolbox-kokoro.

Terminal window
curl http://localhost:8080/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "tts",
"input": "Hello from hal0.",
"voice": "af_bella"
}' --output speech.wav

For voice cloning, the Kokoro provider also supports F5-TTS. Swap with:

Terminal window
hal0 slot swap tts --model f5-tts

Moonshine and Kokoro are first-class providers in the v1 plan — they have working code paths and a slot lifecycle integration. The toolbox container images (hal0-toolbox-moonshine, hal0-toolbox-kokoro) are not yet published to ghcr.io/hal0ai/ — Phase 2 publishes them, and that’s the last gap before v1.0 cut.

Until those images land, the stt and tts slots are visible in the UI but won’t start. The dashboard’s Slots view marks them “image pending” with a link to the relevant roadmap entry.

  • Real-time streaming TTS (chunked PCM output).
  • Speaker diarization for transcription.
  • Voice cloning UX in the dashboard.
  • WebSocket transport for full duplex voice mode.