PrerecordedMarkdown
Transcribe a complete audio file.
POST https://api.reson8.dev/v1/speech-to-text/prerecorded
Request
Headers
| Header | Value |
|---|---|
| Authorization | ApiKey <api_key> or Bearer <access_token> |
| Content-Type | application/octet-stream |
Query Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
encoding |
string | auto |
Audio encoding: auto for detected container formats, m4a, m4v, mp4, mov, 3gp, or 3g2 for explicit seekable media containers, or pcm_s16le for raw PCM |
sample_rate |
number | 16000 |
Sample rate in Hz (only used depending on encoding) |
channels |
number | 1 |
Number of audio channels (only used depending on encoding) |
language |
string | Language to transcribe. Recommended for best quality. When omitted, the server auto-detects each utterance independently. See Languages for supported codes | |
custom_model_id |
string | Optional. ID of a custom model to bias transcription. Overrides the model configured on the API client | |
include_timestamps |
boolean | false |
Include start_ms and duration_ms on transcripts and words |
include_words |
boolean | false |
Include word-level detail on transcripts |
include_confidence |
boolean | false |
Include confidence on words |
diarize |
boolean | false |
Enable speaker diarization. Splits the response into per-speaker segments |
max_speakers |
number | Optional. Maximum number of distinct speakers (1–4). When omitted, the count is determined automatically. Only used when diarize=true |
Example
curl -X POST "https://api.reson8.dev/v1/speech-to-text/prerecorded" \
-H "Authorization: ApiKey <your_api_key>" \
-H "Content-Type: application/octet-stream" \
--data-binary @recording.wav
import requests
with open("recording.wav", "rb") as f:
response = requests.post(
"https://api.reson8.dev/v1/speech-to-text/prerecorded",
headers={
"Authorization": "ApiKey <your_api_key>",
"Content-Type": "application/octet-stream",
},
data=f,
)
transcript = response.json()
Response
200 OK
Fields
| Field | Type | Included | Description |
|---|---|---|---|
text |
string | Always | Full transcript of the audio file |
start_ms |
number | When include_timestamps=true |
Start time in milliseconds |
duration_ms |
number | When include_timestamps=true |
Duration in milliseconds |
words |
array | When include_words=true and diarize=false |
Word-level detail. When diarize=true, word detail moves into each segment instead |
segments |
array | When diarize=true |
Per-speaker segments. |
Each segment contains:
| Field | Type | Included | Description |
|---|---|---|---|
text |
string | Always | Text spoken in this segment |
speaker_id |
number | Always | Speaker label for this segment (integer, 0-indexed) |
start_ms |
number | When include_timestamps=true |
Start time in milliseconds |
duration_ms |
number | When include_timestamps=true |
Duration in milliseconds |
words |
array | When include_words=true |
Word-level detail for this segment |
Each word contains:
| Field | Type | Included | Description |
|---|---|---|---|
text |
string | Always | The recognized word |
start_ms |
number | When include_timestamps=true |
Start time in milliseconds |
duration_ms |
number | When include_timestamps=true |
Duration in milliseconds |
confidence |
number | When include_confidence=true |
Natural log-probability (≤ 0); apply exp for probability in (0, 1] |
Example
{
"text": "the patient presented with chest pain and shortness of breath"
}
With diarize=true, the response is split into per-speaker segments.
{
"text": "where does it hurt my chest mostly and for how long about two days",
"segments": [
{ "text": "where does it hurt", "speaker_id": 0 },
{ "text": "my chest mostly", "speaker_id": 1 },
{ "text": "and for how long", "speaker_id": 0 },
{ "text": "about two days", "speaker_id": 1 }
]
}
{
"text": "where does it hurt my chest mostly",
"start_ms": 0,
"duration_ms": 3200,
"segments": [
{
"text": "where does it hurt",
"speaker_id": 0,
"start_ms": 0,
"duration_ms": 1500,
"words": [
{ "text": "where", "start_ms": 0, "duration_ms": 250, "confidence": -0.010 },
{ "text": "does", "start_ms": 260, "duration_ms": 200, "confidence": -0.020 },
{ "text": "it", "start_ms": 470, "duration_ms": 150, "confidence": -0.010 },
{ "text": "hurt", "start_ms": 630, "duration_ms": 870, "confidence": -0.030 }
]
},
{
"text": "my chest mostly",
"speaker_id": 1,
"start_ms": 1700,
"duration_ms": 1500,
"words": [
{ "text": "my", "start_ms": 1700, "duration_ms": 180, "confidence": -0.010 },
{ "text": "chest", "start_ms": 1890, "duration_ms": 400, "confidence": -0.041 },
{ "text": "mostly", "start_ms": 2300, "duration_ms": 900, "confidence": -0.020 }
]
}
]
}
Errors
| Status | Code | Description |
|---|---|---|
| 400 | INVALID_REQUEST |
Missing or invalid parameters |
| 401 | UNAUTHORIZED |
Invalid or expired access token |
| 413 | PAYLOAD_TOO_LARGE |
Audio file exceeds maximum size |
| 500 | INTERNAL_ERROR |
Unexpected server error |