Skip to content

TurnsMarkdown

The WebSocket API provides turn-level speech-to-text. You send audio data and receive events at the start and end of each turn. It's useful when every millisecond of processing time counts.

For regular live transcription, use Realtime instead.

How It Works

  1. Connect — Open a WebSocket connection to wss://api.reson8.dev/v1/speech-to-text/turns with an authentication header.
  2. Configure — Use query parameters to set the audio encoding and language.
  3. Stream audio — Send audio data as binary WebSocket frames.
  4. Receive turn events — The server emits events as JSON text frames: a turn start, a candidate end (with text), and either a confirmed end or a continuation.
  5. Close — Close the WebSocket connection when done.

See the API reference for full details on messages, fields, and error codes.

Sequence Diagram

sequenceDiagram
    Client->>Server: Create Connection (auth + query params)
    activate Server
    Client->>Server: Audio (binary)
    Server->>Client: Turn Start
    Client->>Server: Audio (binary)
    Server->>Client: Turn End Candidate (text)
    Client->>Server: Audio (binary)
    Server->>Client: Turn Continuation
    Client->>Server: Audio (binary)
    Server->>Client: Turn End Candidate (text)
    Client->>Server: Audio (binary)
    Server->>Client: Turn End
    Client->>Server: Audio (binary)
    Server->>Client: Turn Start
    Client->>Server: Close Connection
    deactivate Server

Audio Format Detection

When the encoding query parameter is set to auto (the default), the server automatically detects the audio format by reading the container headers at the start of the data. Most common formats (WAV, OGG, FLAC, WebM, etc.) are supported. If you are sending raw audio without container headers, set the encoding parameter explicitly (e.g., pcm_s16le for raw PCM, or mulaw / alaw for G.711 telephony audio) along with sample_rate and channels.

For streaming, you can use a container with an indefinite length (e.g., a WAV header with the data size set to the maximum value) and continuously append audio frames.

Reconnecting mid-stream

If you reconnect to the WebSocket and resume sending audio, the server will not be able to detect the format because the container headers are missing. Each new connection must start with a fresh audio stream that includes the headers. Alternatively, set the encoding parameter explicitly to bypass format detection entirely.

Language

The server auto-detects the language by default; pass the language query parameter to pin a specific one. See Languages for the supported set.

To decide which language to pin on future connections, set include_language=true to receive the detected language on each turn_end_candidate.

Ping / Pong

The server sends WebSocket ping frames at regular intervals to verify the connection is alive. This uses the built-in WebSocket ping/pong mechanism — clients must respond with a pong for every ping received. Most WebSocket libraries and browsers handle this automatically.

If a pong is not received in time, the server will close the connection.