RealtimeMarkdown

The WebSocket API provides real-time speech-to-text transcription. You send audio data and receive transcript results as the speech is recognized.

How It Works

Connect — Open a WebSocket connection to wss://api.reson8.dev/v1/speech-to-text/realtime with an authentication header.
Configure — Use query parameters to set the audio encoding and which fields to include in the response.
Stream audio — Send audio data as binary WebSocket frames.
Receive transcripts — The server returns transcript messages as JSON text frames. Results can be interim (partial, may change) or final (stable).
Close — Close the WebSocket connection when done.

See the API reference for full details on messages, fields, and error codes.

Sequence Diagram

sequenceDiagram
    Client->>Server: Create Connection (auth + query params)
    activate Server
    Client->>Server: Audio (binary)
    Client->>Server: Audio (binary)
    Server->>Client: Transcript (Partial)
    Client->>Server: Audio (binary)
    Server->>Client: Transcript (Final)
    Client->>Server: Audio (binary)
    Client->>Server: Flush Request
    Server->>Client: Transcript (Final)
    Server-->>Client: Flush Confirmation
    Client->>Server: Close Connection
    deactivate Server

Audio Format Detection

When the encoding query parameter is set to auto (the default), the server automatically detects the audio format by reading the container headers at the start of the data. Most common formats (WAV, OGG, FLAC, WebM, etc.) are supported. If you are sending raw PCM audio without container headers, set the encoding parameter explicitly (e.g., pcm_s16le) along with sample_rate and channels.

For streaming, you can use a container with an indefinite length (e.g., a WAV header with the data size set to the maximum value) and continuously append audio frames.

Reconnecting mid-stream

If you reconnect to the WebSocket and resume sending audio, the server will not be able to detect the format because the container headers are missing. Each new connection must start with a fresh audio stream that includes the headers. Alternatively, set the encoding parameter explicitly to bypass format detection entirely.

Language

The server auto-detects the language by default; pass the language query parameter to pin a specific one. See Languages for the supported set.

Ping / Pong

The server sends WebSocket ping frames at regular intervals to verify the connection is alive. This uses the built-in WebSocket ping/pong mechanism — clients must respond with a pong for every ping received. Most WebSocket libraries and browsers handle this automatically.

If a pong is not received in time, the server will close the connection.