Ask AI

Voice Settings

Voice can be complicated.

Voice Settings & Troubleshooting

Media support in browsers has evolved over time, and is complicated now by there being material differences between browsers on Desktop versus mobile browsers.

Rest assured, Sapience takes care of 99% of the complexity for you, and will choose sensible default settings based upon the device you are on, and the browser you are using.

For those of you that want to tweak the voice settings we use for dictation, you can access them in Menu > Settings & Customization > Advanced > Voice Settings.

💡

Most users should not touch these settings. For the audiophiles among you - go to town.

 
Voice settings windows, which can be found under Menu > Settings & Customization > Advanced > Voice Settings.
Voice settings windows, which can be found under Menu > Settings & Customization > Advanced > Voice Settings.

Voice Recording Details

Configure how audio is recorded and processed for voice transcription.


Transcription Model

Whisper: this is used widely in the industry, for example by Whisper Flow and any number of AI voice startups. Good in noisy environments, which most users are in (typing on keyboard, background air conditioning, etc).

GPT Transcribe: wider array of languages supported, and fewer errors if you have a pristine audio environment. Less tolerant.

Microphone Settings

Input Sample Rate

The frequency at which audio is captured from your microphone.

Option
Description
Recommendation
16 kHz
Optimized for speech recognition
Recommended - Whisper/GPT models are trained on 16kHz audio
44.1 kHz
CD quality audio
Unnecessary for speech; larger files
48 kHz
Professional audio standard
Unnecessary for speech; larger files

Audio Channels

Whether to record in mono (single channel) or stereo (dual channel).

Option
Description
Recommendation
Mono
Single audio channel
Recommended - Speech only needs one channel; smaller files
Stereo
Left and right channels
Only useful for music or spatial audio

Echo Cancellation

Reduces echo from speakers being picked up by the microphone.

  • Enable if you're not using headphones and speakers might create feedback
  • Recommended: ON - The transcription API does not do this processing

Noise Suppression

Reduces background noise (fans, traffic, ambient sounds).

  • Enable for noisy environments
  • Recommended: ON - The transcription API does not do this processing

Auto Gain Control

Automatically adjusts microphone volume to maintain consistent levels.

  • Enable to normalize volume if you speak at varying distances from the mic
  • Recommended: ON - Helps ensure consistent audio levels

Recording Output

Output Format

The audio format used for the recorded file.

Format
Codec
File Size
API Compatibility
Recommendation
MP3
MPEG Layer 3
Small
Best
Recommended - Most reliable with gpt-4o-transcribe
WebM
Opus
Smallest
Unreliable
May cause "invalid format" errors with newer models
WAV
PCM (uncompressed)
Large
Good
Works but creates unnecessarily large files
Why MP3? OpenAI's gpt-4o-transcribe model has stricter format requirements than whisper-1. WebM/Opus files frequently cause "invalid file format" errors. MP3 provides the best balance of compatibility and file size.

Output Sample Rate

The sample rate of the recorded audio file.

  • Should match Input Sample Rate for best quality
  • Recommended: 16 kHz - Matches what transcription models expect

Buffer Size

Size of the audio processing buffer. Affects latency vs. stability.

Option
Trade-off
4096
Lower latency, may drop audio on slower devices
8192
Balanced
16384
Recommended - Most stable, slight latency increase

Output Bitrate

Overall bitrate for the encoded audio.

Option
Quality
File Size
64 kbps
Lower
Smallest
96 kbps
Good
Small
128 kbps
Recommended
Balanced
192 kbps
High
Larger

Encoder Bitrate

Internal encoder bitrate (primarily affects WebM/Opus encoding).

  • Recommended: 96 kbps - Good quality for speech

MP3 Encoding

MP3 Encoder Bitrate

Bitrate used when converting audio to MP3 format.

Option
Use Case
64 kbps
Minimize file size, acceptable quality
96 kbps
Good balance
128 kbps
Recommended - Clear speech quality
192 kbps
High quality, larger files

Transcription

Convert to MP3 Before Sending

Automatically converts audio to MP3 before sending to the transcription API.

  • Enable if using WebM or WAV output format and experiencing API errors
  • Not needed if Output Format is already set to MP3
  • Adds slight processing time but ensures API compatibility

Recommended Configuration

For the most reliable voice transcription experience:

Microphone:
  Input Sample Rate: 16 kHz
  Channels: Mono
  Echo Cancellation: ON
  Noise Suppression: ON
  Auto Gain Control: ON

Recording Output:
  Output Format: MP3
  Output Sample Rate: 16 kHz
  Buffer Size: 16384
  Output Bitrate: 128 kbps
  Encoder Bitrate: 96 kbps

MP3 Encoding:
  MP3 Encoder Bitrate: 128 kbps

Transcription:
  Convert to MP3 Before Sending: OFF (not needed when format is MP3)

Troubleshooting

"Invalid file format" errors

  • Change Output Format to MP3
  • Or enable Convert to MP3 Before Sending

Audio sounds choppy or has gaps

  • Increase Buffer Size to 16384
  • Close other browser tabs using the microphone

Transcription misses words or is inaccurate

  • Enable Noise Suppression to reduce background noise
  • Ensure Input Sample Rate is 16 kHz
  • Speak clearly and at a consistent distance from the microphone

Recording fails to start

  • Check browser permissions for microphone access
  • Ensure no other application is using the microphone
  • Try refreshing the page
Did this answer your question?
😞
😐
🤩