Speech-to-Text API¶
Transcribe audio to text or translate audio to English. Compatible with the OpenAI Audio API.
Base URL¶
Authentication¶
When authentication is enabled, include your token in the Authorization header:
Transcriptions¶
Transcribe audio to text in the original language.
POST /audio/transcriptions¶
Transcribes audio into the input language. Supports multiple response formats including verbose JSON with timestamps.
Authentication: Required when auth is enabled. Token must have 'audio-transcriptions' endpoint access.
Headers¶
| Header | Required | Description |
|---|---|---|
Authorization |
Yes | Bearer token for authentication |
Content-Type |
Yes | Must be multipart/form-data |
Request Body¶
Content-Type: multipart/form-data
| Field | Type | Required | Description |
|---|---|---|---|
file |
binary |
Yes | Audio file (flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm) |
model |
string |
Yes | Transcription model (e.g., 'tiny', 'base', 'small', 'medium', 'large') |
language |
string |
No | Language code (ISO-639-1). Auto-detected if not provided. |
prompt |
string |
No | Optional text to guide style or continue previous segment |
response_format |
string |
No | Format: json, text, srt, vtt, verbose_json (default: json) |
temperature |
number |
No | Sampling temperature 0-1 (default: 0) |
Response¶
Returns transcription text. Verbose JSON includes segments, timestamps, and language detection.
Content-Type: application/json or text
Examples¶
Basic transcription:
curl -X POST https://api.getkawai.com/v1/audio/transcriptions \
-H "Authorization: Bearer $API_KEY" \
-F "file=@audio.mp3" \
-F "model=base" \
-F "language=en"
Verbose JSON with timestamps:
curl -X POST https://api.getkawai.com/v1/audio/transcriptions \
-H "Authorization: Bearer $API_KEY" \
-F "file=@audio.mp3" \
-F "model=base" \
-F "response_format=verbose_json"
Translations¶
Translate audio from any language to English.
POST /audio/translations¶
Translates audio into English. The source language is automatically detected.
Authentication: Required when auth is enabled. Token must have 'audio-translations' endpoint access.
Headers¶
| Header | Required | Description |
|---|---|---|
Authorization |
Yes | Bearer token for authentication |
Content-Type |
Yes | Must be multipart/form-data |
Request Body¶
Content-Type: multipart/form-data
| Field | Type | Required | Description |
|---|---|---|---|
file |
binary |
Yes | Audio file (flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm) |
model |
string |
Yes | Translation model (e.g., 'tiny', 'base', 'small', 'medium', 'large') |
prompt |
string |
No | Optional text to guide style |
response_format |
string |
No | Format: json, text, srt, vtt, verbose_json (default: json) |
temperature |
number |
No | Sampling temperature 0-1 (default: 0) |
Response¶
Returns English translation text. Verbose JSON includes segments and timestamps.
Content-Type: application/json or text
Examples¶
Translate Spanish audio to English:
curl -X POST https://api.getkawai.com/v1/audio/translations \
-H "Authorization: Bearer $API_KEY" \
-F "file=@spanish-audio.mp3" \
-F "model=base"
Translate with verbose output:
curl -X POST https://api.getkawai.com/v1/audio/translations \
-H "Authorization: Bearer $API_KEY" \
-F "file=@french-audio.mp3" \
-F "model=base" \
-F "response_format=verbose_json"
Response Formats¶
Available response formats for transcription and translation.
JSON (default)¶
Simple JSON response with only the transcribed/translated text.
Examples¶
Verbose JSON¶
Detailed JSON response with language detection, duration, segments, and word-level timestamps.
Examples¶
{
"task": "transcribe",
"language": "en",
"duration": 5.2,
"text": "Hello, this is the transcribed text.",
"segments": [
{
"id": 0,
"start": 0.0,
"end": 2.5,
"text": "Hello, this is",
"tokens": [123, 456, 789]
}
],
"words": [
{"word": "Hello", "start": 0.0, "end": 0.5}
]
}
Supported Models¶
Whisper models available for transcription and translation.
Whisper Models¶
OpenAI Whisper models for speech recognition.
Examples¶
Available models:
tiny - 39M parameters, fastest
base - 74M parameters, good balance
small - 244M parameters, better accuracy
medium - 769M parameters, high accuracy
large - 1550M parameters, best accuracy