API Voice Input Guidelines

Voice Input Guidelines for Analysis by Beyond Verbal’s REST API


Beyond Verbal’s REST API (API) supports both direct voice inputs such as microphone, call and audio line-in as well as uploading of prerecorded files.

Our Emotions Analytics engine requires a minimum 13 seconds of uninterrupted speech to produce a single analysis batch. The period needed to produce one such batch is referred to in this manual as “Voice Section.” By employing Voice Activated Detection (VAD) methods, the engine algorithms can collect and assemble a Voice Section over a 20-30 second window of normal conversation (which typically contains alternating periods of speech and silence).


Analysis is performed simultaneously during the upload process, ensuring a rapid response once enough voice data had been collected.

Minimal voice quality specifications and preferred input codec

  • The API requires WAV PCM 8 KHz, 16 bit Mono.
  • The API also accepts converted voice from other source of lossless formats with varying degrees of acceptable performance degradations:
  • Conversion to WAV from a lossless codec such as FLAC will not degrade accuracy.
  • Conversion to WAV from a lossless codec such as AMR (commonly used in mobile phones) will result in varying degrees of acceptable performance degradations
  • Inputs from mobile phones generally tend to produce higher-quality voice samples than laptops.


  • Conversion to WAV PCM from lossy codecs such as highly compressed (low bit rate) MP3, G729 or other over-compressed codecs. These will likely yield sub-optimal results due to high degradation/destruction of the important vocal information required for our Emotions Analytics algorithms to run properly.
  • Saturated and/or clipped voice input files that loose much of the data required for algorithms to work properly.


  • Poorly recorded voice inputs containing voice deformities, excessive hiss and other vocal contaminants.

See Tips for recording quality files section for more information


Once sufficient voice data has been collected in the engine, the API immediately produces a mood and attitudes analysis, according to the issued license key. A new analysis batch set will appear once enough additional voice data has been gathered, and the analysis process continues in this way throughout the voice input session.


To obtain a section-based analysis, the app must first open a voice upload channel to the API and then request retrieval of an available analysis. See REST API Quick Guide for more information.



How do I record high quality voice?

  • Although our algorithms are – to an extent – noise tolerant, the lower the background noise, the higher the accuracy of the emotional analysis result will be. Try recording in a moderately quiet room, with relatively low background noise (TV, other people speaking, fans, noisy external city sounds etc.). Same rule of thumb applies for pre-recorded voice inputs injected directly via the API.
  • The voice input should only consist of a single speaker. Having multiple speakers on the same voice input will skew the analysis..
  • Whenever possible use your highest-quality recording equipment (mobile phone’s microphone is good as well).
  • Position yourself about 15-50 cm (6 – 20 inches) from the microphone to avoid sound saturation and reduce white noise.
  • If recording on a PC, make sure that the recorder volume is set in the middle range of the scale.


What are some common problems and easy fixes?

  • Not enough voice to analyze. Please speak at length. 20 seconds can feel like a long time if you are speaking alone.
  • Recording in a car or similarly noisy environment (such as planes, restaurants, trains, and automobiles). Try again at home or office.
  • Sitting too close or out of reach of the microphone. A good distance is about how far you would hold a magazine that you are reading.
  • Speaking too loud or too soft. Speak as you would in a normal conversation­—clearly, but no need to be extremely loud.
  • More than one person is heard in the recording (this can also result from a noisy TV in the background). Wait for a quiet moment, and make sure your TV and radio are off.
  • Using halting speech. Imagine you are speaking to a friend.

How to elicit emotions: Question design guidelines (if relevant)

People enjoy speaking their minds, they just need a little encouragement to talk.

  • People tend to speak with a little more emotion when asked to begin talking about their opinions with “I believe that…”
  • Remind them to talk as if in a conversation with friends. Putting people in a social frame of mind helps their speech flow more fluently.
  • Emphasize that it’s fun and interactive.

If relevant – Don’t look at the screen while talking.