Non-Live STT

Non-Live STT (Speech-To-Text) refers to any transcription that will not be streamed in real-time but rather processed entirely and returned back as a whole.

This page will show you all possible options you can add in a request, going from the most basic request to the most complete one.

For using Non-Live STT service, use the following host

https://api.sonarspeak.com/v1/stt

Supported request formats

Approach	Best For	Efficiency	Ease of Use	Notes
Multipart	Large files, developer flexibility	High	Moderate	Direct binary data without size inflation.
Base64 in JSON	Small files or devs restricted to JSON-only	Low (size inflation)	Easy	Adds ~33% overhead due to Base64 encoding.
File URL	Pre-hosted files (e.g., in cloud storage)	Very High	Easiest	Fastest when file is already hosted online.

Base request

Base request is the minimum amount of information you must send in the JSON body as a request.

Base64File URLMultipart

{
    "provider": "openai",
    "model": "whisper-1",
    "audio_data": {
        "using": "base64",
        "content": "<BASE64_FORMATTED_FILE>",
        "format": "auto"
    }
}

{
    "provider": "openai",
    "model": "whisper-1",
    "audio_data": {
        "using": "url",
        "content": "https://example.com/path/to/audio/file.wav",
        "format": "auto"
    }
}

RawPython friendly

POST https://api.sonarspeak.com/v1/stt
Content-Type: multipart/form-data

--boundary
Content-Disposition: form-data; name="audio_data";
Content-Type: application/octet-stream

<binary-audio-data-here>
--boundary
Content-Disposition: form-data; name="metadata"
Content-Type: application/json

{
  "provider": "openai",
  "model": "whisper-1",
  "audio_data": {
    "using": "multipart",
    "format": "auto"
  }
}
--boundary--

import requests
import json

url = "https://api.sonarspeak.com/v1/stt"
access_token = "your_access_token_here"

# Path to the audio file
audio_file_path = "path_to_your_audio_file.raw"

# Define the metadata for the transcription
metadata = {
    "provider": "openai",
    "model": "whisper-1",
    "audio_data": {
        "using": "multipart",
        "format": "auto"  # Or use known format
    }
}

# Open the audio file in binary mode
with open(audio_file_path, "rb") as audio_file:
    # Create the multipart form-data payload
    files = {
        "audio_file": (audio_file_path, audio_file, "application/octet-stream"),
        "metadata": (None, json.dumps(metadata), "application/json"),
    }

    # Add the authorization header
    headers = {
        "Authorization": f"Bearer {access_token}"
    }

    # Send the request
    response = requests.post(url, files=files, headers=headers)

    # Print the response
    if response.ok:
        print("Response:", response.json())
    else:
        print("Error:", response.status_code, response.text)

Specifying sample rate

If not specified, sample rate is defaulted to 16000

Recommended rates

Most STT models and providers recommend to always use 16000 as a standard value. If you have a lower rate, it is also recommended to resample it to 16000Hz.

Using rates above 16000 might perform better, but not always will be the case, and as a counterpart, it will be slower processing it.

SonarSpeak will not modify sample rates unless explicitly specified. If no value is provided, it defaults to 16000. Developers should resample audio files to 16000 if higher or unspecified rates impact transcription performance.

Although it is optional, it can be added in the audio_data JSON object.

...
  "audio_data": {
    ...
    "sample_rate_hertz": 42000
  }
...

Adding options

Syntax sanity

All options MUST be encased inside "options": {} object

About unsupported options

If an option is not supported by the model, the request WILL NOT BE PROCESSED and notified accordingly.

To avoid this behavior, you'll need to IGNORE UNSUPPORTED OPTIONS.

Automatic Punctuation

Enable or disable the punctuation generation.

Defaults to true

{
  ...
  "options": {
    ...
    "automatic_punctuation": true
  }
}

Language Specification

Enable automatic language detection or specify an input language

Defaults to true

{
  ...
  "options": {
    ...
    "language": {
      "auto_language": false,
      "language_code": "en-US"
    }
  }
}

Timestamping

Enable or disable timestamping

granularity supports:
- word: Each word has it's own timestamp.
- phrase: Timestamping is done via group of words.

per_speaker requires Speaker Diarization. Defaults to true

Defaults to false

{
  ...
  "options": {
    ...
    "timestamps": {
      "enable_timestamps": true,
      "granularity": "word",
      "per_speaker": true
    }
  }
}

Speaker Diarization

Enable multi-speaker detection with dialogue separation.

Defaults to false

min_speaker_count and max_speaker_count help the model distinguish between several speakers and it might not be needed by some models, which detect or limit number of speakers automatically.

{
  ...
  "options": {
    ...
    "speaker_diarization": {
      "enabled": true,
      "min_speaker_count": 2,
      "max_speaker_count": 4
    }
  }
}

Enabling confidence scores

Enable returning confidence scores in response.

Defaults to true

{
  ...
  "options": {
    ...
    "enable_confidence_scores": true
  }
}

Alternative transcriptions

Enable returning alternative transcriptions in response.

Defaults to false

{
  ...
  "options": {
    ...
    "enable_alternative_transcriptions": true
  }
}

Specific vocabulary

Lets the AI know words that might appear and weight them up in processing.

{
  ...
  "options": {
    ...
    "vocabulary": ["Linux", "sonarspeak"]
  }
}

Profanity filter

Wipe, mask or leave intact slurs, bad words, etc.

intact: Text remains untouched.
mask: Replaces profane text with symbols (e.g., asterisks ****).
remove: Removes profanity entirely.

Defaults to intact

{
  ...
  "options": {
    ...
    "profanity_filter": "mask"
  }
}

Ignoring unsupported options

As explained in Adding Options section, some requests might not complete because the model does not support some of the options passed in the request.

For example, if you are using a model or a provider that does not support speaker diarization, you will get an error like this:

{
  "error": "Provider or Model does not support some of the options specified",
  "option_errors": [
    {
      "name": "timestamp.per_speaker",
      "level": "provider",
      "message": "timestamp.per_speaker is not supported by provider 'openai'"
    },
    {
      "name": "speaker_diarization",
      "level": "provider",
      "message": "speaker_diarization not supported by provider 'openai'"
    }
  ]
}

In order to avoid this, you can Know your provider features by looking at the provider specification in this documentation or Ignore the unsupported options and using the default values, when applicable.

Defaults to false

{
  ...
  "options": {
    ...
    "ignore_unsupported": true
  }
}

Ignoring unsupported options might lead to a specific message in the response indicating which options have been omitted. Refer to Non-live Response Example to check how they are notified in the response.

A complete request example

This is how a complete request should look like, when ignoring all unsupported options.

This is a massive example

Most of the cases you won't need to use all settings and with a Basic request should be enough.

{
  "provider": "openai",
  "model": "whisper-1",
  "audio_data": {
    "content": "example.com/path/to/audiofile",
    "using": "url",
    "format": "mp3",
    "sample_rate_hertz": 41800
  },
  "options": {
    "automatic_punctuation": true,
    "language": {
      "auto_language": false,
      "language_code": "en-US"
    },
    "timestamps": {
      "enable_timestamps": true,
      "granularity": "word",
      "per_speaker": true
    },
    "speaker_diarization": {
      "enabled": true,
      "min_speaker_count": 2,
      "max_speaker_count": 4
    },
    "enable_confidence_scores": false,
    "enable_alternative_transcriptions": true,
    "vocabulary": ["Linux", "sonarspeak"],
    "profanity_filter": "remove",
    "ignore_unsupported": true
  }
}