Non-Live STT
Non-Live STT (Speech-To-Text) refers to any transcription that will not be streamed in real-time but rather processed entirely and returned back as a whole.
This page will show you all possible options you can add in a request, going from the most basic request to the most complete one.
Supported request formats
Approach | Best For | Efficiency | Ease of Use | Notes |
---|---|---|---|---|
Multipart | Large files, developer flexibility | High | Moderate | Direct binary data without size inflation. |
Base64 in JSON | Small files or devs restricted to JSON-only | Low (size inflation) | Easy | Adds ~33% overhead due to Base64 encoding. |
File URL | Pre-hosted files (e.g., in cloud storage) | Very High | Easiest | Fastest when file is already hosted online. |
Base request
Base request is the minimum amount of information you must send in the JSON body as a request.
POST https://api.sonarspeak.com/v1/stt
Content-Type: multipart/form-data
--boundary
Content-Disposition: form-data; name="audio_data";
Content-Type: application/octet-stream
<binary-audio-data-here>
--boundary
Content-Disposition: form-data; name="metadata"
Content-Type: application/json
{
"provider": "openai",
"model": "whisper-1",
"audio_data": {
"using": "multipart",
"format": "auto"
}
}
--boundary--
import requests
import json
url = "https://api.sonarspeak.com/v1/stt"
access_token = "your_access_token_here"
# Path to the audio file
audio_file_path = "path_to_your_audio_file.raw"
# Define the metadata for the transcription
metadata = {
"provider": "openai",
"model": "whisper-1",
"audio_data": {
"using": "multipart",
"format": "auto" # Or use known format
}
}
# Open the audio file in binary mode
with open(audio_file_path, "rb") as audio_file:
# Create the multipart form-data payload
files = {
"audio_file": (audio_file_path, audio_file, "application/octet-stream"),
"metadata": (None, json.dumps(metadata), "application/json"),
}
# Add the authorization header
headers = {
"Authorization": f"Bearer {access_token}"
}
# Send the request
response = requests.post(url, files=files, headers=headers)
# Print the response
if response.ok:
print("Response:", response.json())
else:
print("Error:", response.status_code, response.text)
Specifying sample rate
If not specified, sample rate is defaulted to 16000
Recommended rates
Most STT models and providers recommend to always use 16000
as a standard value.
If you have a lower rate, it is also recommended to resample it to 16000
Hz.
Using rates above 16000
might perform better, but not always will be the case, and as a counterpart, it will be
slower processing it.
SonarSpeak will not modify sample rates unless explicitly specified. If no value is provided, it defaults to 16000
.
Developers should resample audio files to 16000
if higher or unspecified rates impact transcription performance.
Although it is optional, it can be added in the audio_data
JSON object.
Adding options
Syntax sanity
All options MUST be encased inside "options": {}
object
About unsupported options
If an option is not supported by the model, the request WILL NOT BE PROCESSED and notified accordingly.
To avoid this behavior, you'll need to IGNORE UNSUPPORTED OPTIONS.
Automatic Punctuation
Enable or disable the punctuation generation.
Defaults to true
Language Specification
Enable automatic language detection or specify an input language
Defaults to true
Timestamping
Enable or disable timestamping
granularity
supports:word
: Each word has it's own timestamp.phrase
: Timestamping is done via group of words.
per_speaker
requires Speaker Diarization. Defaults to true
Defaults to false
{
...
"options": {
...
"timestamps": {
"enable_timestamps": true,
"granularity": "word",
"per_speaker": true
}
}
}
Speaker Diarization
Enable multi-speaker detection with dialogue separation.
Defaults to false
min_speaker_count
and max_speaker_count
help the model distinguish between several speakers and it might not be needed by some models,
which detect or limit number of speakers automatically.
{
...
"options": {
...
"speaker_diarization": {
"enabled": true,
"min_speaker_count": 2,
"max_speaker_count": 4
}
}
}
Enabling confidence scores
Enable returning confidence scores in response.
Defaults to true
Alternative transcriptions
Enable returning alternative transcriptions in response.
Defaults to false
Specific vocabulary
Lets the AI know words that might appear and weight them up in processing.
Profanity filter
Wipe, mask or leave intact slurs, bad words, etc.
intact
: Text remains untouched.mask
: Replaces profane text with symbols (e.g., asterisks****
).remove
: Removes profanity entirely.
Defaults to intact
Ignoring unsupported options
As explained in Adding Options section, some requests might not complete because the model does not support some of the options passed in the request.
For example, if you are using a model or a provider that does not support speaker diarization, you will get an error like this:
{
"error": "Provider or Model does not support some of the options specified",
"option_errors": [
{
"name": "timestamp.per_speaker",
"level": "provider",
"message": "timestamp.per_speaker is not supported by provider 'openai'"
},
{
"name": "speaker_diarization",
"level": "provider",
"message": "speaker_diarization not supported by provider 'openai'"
}
]
}
In order to avoid this, you can Know your provider features by looking at the provider specification in this documentation or Ignore the unsupported options and using the default values, when applicable.
Defaults to false
Ignoring unsupported options might lead to a specific message in the response indicating which options have been omitted. Refer to Non-live Response Example to check how they are notified in the response.
A complete request example
This is how a complete request should look like, when ignoring all unsupported options.
This is a massive example
Most of the cases you won't need to use all settings and with a Basic request should be enough.
{
"provider": "openai",
"model": "whisper-1",
"audio_data": {
"content": "example.com/path/to/audiofile",
"using": "url",
"format": "mp3",
"sample_rate_hertz": 41800
},
"options": {
"automatic_punctuation": true,
"language": {
"auto_language": false,
"language_code": "en-US"
},
"timestamps": {
"enable_timestamps": true,
"granularity": "word",
"per_speaker": true
},
"speaker_diarization": {
"enabled": true,
"min_speaker_count": 2,
"max_speaker_count": 4
},
"enable_confidence_scores": false,
"enable_alternative_transcriptions": true,
"vocabulary": ["Linux", "sonarspeak"],
"profanity_filter": "remove",
"ignore_unsupported": true
}
}