Speech-to-Text API¶

Transcribe audio to text or translate audio to English. Compatible with the OpenAI Audio API.

Base URL¶

https://api.getkawai.com/v1

Authentication¶

When authentication is enabled, include your token in the Authorization header:

Authorization: Bearer API_KEY

Transcriptions¶

Transcribe audio to text in the original language.

`POST /audio/transcriptions`¶

Transcribes audio into the input language. Supports multiple response formats including verbose JSON with timestamps.

Authentication: Required when auth is enabled. Token must have 'audio-transcriptions' endpoint access.

Headers¶

Header	Required	Description
`Authorization`	Yes	Bearer token for authentication
`Content-Type`	Yes	Must be multipart/form-data

Request Body¶

Content-Type: multipart/form-data

Field	Type	Required	Description
`file`	`binary`	Yes	Audio file (flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm)
`model`	`string`	Yes	Transcription model (e.g., 'tiny', 'base', 'small', 'medium', 'large')
`language`	`string`	No	Language code (ISO-639-1). Auto-detected if not provided.
`prompt`	`string`	No	Optional text to guide style or continue previous segment
`response_format`	`string`	No	Format: json, text, srt, vtt, verbose_json (default: json)
`temperature`	`number`	No	Sampling temperature 0-1 (default: 0)

Response¶

Returns transcription text. Verbose JSON includes segments, timestamps, and language detection.

Content-Type: application/json or text

Examples¶

Basic transcription:

curl -X POST https://api.getkawai.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $API_KEY" \
  -F "file=@audio.mp3" \
  -F "model=base" \
  -F "language=en"

Verbose JSON with timestamps:

curl -X POST https://api.getkawai.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $API_KEY" \
  -F "file=@audio.mp3" \
  -F "model=base" \
  -F "response_format=verbose_json"

Translations¶

Translate audio from any language to English.

`POST /audio/translations`¶

Translates audio into English. The source language is automatically detected.

Authentication: Required when auth is enabled. Token must have 'audio-translations' endpoint access.

Headers¶

Header	Required	Description
`Authorization`	Yes	Bearer token for authentication
`Content-Type`	Yes	Must be multipart/form-data

Request Body¶

Content-Type: multipart/form-data

Field	Type	Required	Description
`file`	`binary`	Yes	Audio file (flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm)
`model`	`string`	Yes	Translation model (e.g., 'tiny', 'base', 'small', 'medium', 'large')
`prompt`	`string`	No	Optional text to guide style
`response_format`	`string`	No	Format: json, text, srt, vtt, verbose_json (default: json)
`temperature`	`number`	No	Sampling temperature 0-1 (default: 0)

Response¶

Returns English translation text. Verbose JSON includes segments and timestamps.

Content-Type: application/json or text

Examples¶

Translate Spanish audio to English:

curl -X POST https://api.getkawai.com/v1/audio/translations \
  -H "Authorization: Bearer $API_KEY" \
  -F "file=@spanish-audio.mp3" \
  -F "model=base"

Translate with verbose output:

curl -X POST https://api.getkawai.com/v1/audio/translations \
  -H "Authorization: Bearer $API_KEY" \
  -F "file=@french-audio.mp3" \
  -F "model=base" \
  -F "response_format=verbose_json"

Response Formats¶

Available response formats for transcription and translation.

JSON (default)¶

Simple JSON response with only the transcribed/translated text.

Examples¶

{
  "text": "Hello, this is the transcribed text."
}

Verbose JSON¶

Detailed JSON response with language detection, duration, segments, and word-level timestamps.

Examples¶

{
  "task": "transcribe",
  "language": "en",
  "duration": 5.2,
  "text": "Hello, this is the transcribed text.",
  "segments": [
    {
      "id": 0,
      "start": 0.0,
      "end": 2.5,
      "text": "Hello, this is",
      "tokens": [123, 456, 789]
    }
  ],
  "words": [
    {"word": "Hello", "start": 0.0, "end": 0.5}
  ]
}

Supported Models¶

Whisper models available for transcription and translation.

Whisper Models¶

OpenAI Whisper models for speech recognition.

Examples¶

Available models:

tiny    - 39M parameters, fastest
base    - 74M parameters, good balance
small   - 244M parameters, better accuracy
medium  - 769M parameters, high accuracy
large   - 1550M parameters, best accuracy

Speech-to-Text API¶

Base URL¶

Authentication¶

Transcriptions¶

POST /audio/transcriptions¶

Headers¶

Request Body¶

Response¶

Examples¶

Translations¶

POST /audio/translations¶

Headers¶

Request Body¶

Response¶

Examples¶

Response Formats¶

JSON (default)¶

Examples¶

Verbose JSON¶

Examples¶

Supported Models¶

Whisper Models¶

Examples¶

`POST /audio/transcriptions`¶

`POST /audio/translations`¶