Skip to content

Streaming (SSE)

When a request to /v1/chat/completions or /v1/completions sets "stream": true, hal0 returns a server-sent-events (SSE) stream. The format is the standard OpenAI streaming wire format: hal0 dispatches the request to a slot and forwards the upstream inference server’s SSE bytes through unchanged, so any OpenAI-compatible streaming client works without modification.

Each event is a line beginning with data: followed by a JSON object, with a blank line separating events. The stream is terminated by a final data: [DONE] sentinel.

data: {"id":"chatcmpl-…","object":"chat.completion.chunk","created":1700000000,"model":"chat","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}
data: {"id":"chatcmpl-…","object":"chat.completion.chunk","created":1700000000,"model":"chat","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}
data: {"id":"chatcmpl-…","object":"chat.completion.chunk","created":1700000000,"model":"chat","choices":[{"index":0,"delta":{"content":" world"},"finish_reason":null}]}
data: {"id":"chatcmpl-…","object":"chat.completion.chunk","created":1700000000,"model":"chat","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]
FieldMeaning
objectchat.completion.chunk for chat; text_completion for /v1/completions.
choices[].deltaThe incremental piece. The first chunk typically carries {"role":"assistant"}; subsequent chunks carry {"content":"…"}; the last carries {}.
choices[].finish_reasonnull until the final content-bearing chunk, then stop (or length, etc.).
[DONE]Literal sentinel marking end of stream — it is not JSON. Stop reading after it.

The Content-Type of a streaming response is text/event-stream. hal0 measures per-chunk throughput on the streaming path (counting content deltas) to populate its slot tok/s metrics, but does not alter the bytes it forwards.

Terminal window
curl -N http://hal0.local:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "chat",
"stream": true,
"messages": [{"role": "user", "content": "Say hi in three words."}]
}'

-N disables curl’s buffering so chunks print as they arrive.