Chinese Space-Age Tech Outfit Embarks on Global Shopping Spree

By Bloomberg News

September 9, 2016 — 9:33 AM IDT

KuangChi, the Chinese technology company hoping to send tourists into space and develop flying jet-packs, could soon go on a $600 million investment spurt.

uangChi Science Ltd., whose other projects include the Wearable Spiritual Armour exoskeleton, launched a $300 million fund in May to back startups working on cutting-edge computing, space-faring technology and communications. Continue reading “Chinese Space-Age Tech Outfit Embarks on Global Shopping Spree”

Virtual Reality Classrooms Another Way Chinese Kids Gain an Edge

Bloomberg Technology
Date:August 10, 2016 — 12:00 AM IDT
By: David Ramli

Genders of virtual teachers can change to suit cultural norms of the classroom.

Deep within a building shaped like the Starship Enterprise a little-known Chinese company is working on the future of education. Vast banks of servers record children at work and play, tracking touchscreen swipes, shrugs and head swivels – amassing a database that will be used to build intimate profiles of millions of kids.

This is the Fuzhou hive of NetDragon Websoft Holdings Ltd a hack-and-slash videogame maker and unlikely candidate to transform learning via headset-mounted virtual reality teachers. It’s one of a growing number of companies from International Business Machines Corp. to Lenovo Group Ltd. studying how to use technology like VR to arrest a fickle child’s attention. (And perhaps someday to make a mint from that data by showing them ads.)

China – where parents have been known to try anything to give their kids an edge and tend to be less obsessive about privacy – may be an ideal testing ground for the VR classroom of the future. As it’s envisioned, there’ll be no napping in the back row. Lessons change when software predicts a student’s mind is wandering by spotting an upward tilt of the head. Dull lectures can be immediately livened up with pop quizzes. Even the instructor’s gender can change to suit the audience, such as making the virtual educator male in cultures where teachers are typically men.

“It is the next big thing and it’s been brewing for quite some time,” said Jan-Martin Lowendahl, a research vice president with Gartner Inc. “If there’s any place it would work, it’s China, Korea, those kinds of places.”
“It’s hugely revolutionary and it’s also necessary because it’s obvious that the current educational models do not scale.”

The notion of adaptive, computer-based teaching has bounced around for more than a decade. Done right, it’s got the potential to fundamentally alter learning. Educators who’ve relied on their gut and visual cues could be replaced or augmented by digital avatars powered by algorithms, which can in turn be replicated across the planet. Advocates argue that the benefits of using machines to scrutinize children and learning to adapt to their foibles will outweigh questions of privacy because soon there won’t be enough human teachers.

“There’s no way we can deal with it without adding scalable learning technologies,” Lowendahl said.

Of course, the growing corporate involvement isn’t altruistic — there’s money to be made, and by some accounts Chinese companies are taking the lead in commercialization. NetDragon wants to become among the first to put it in practice on a larger scale. It paid 77 million pounds ($100 million) for British online education provider Promethean World Plc last year and now serves 2.2 million teachers with 40 million pupils. It’s field-testing VR lessons, handing out headsets and tablets in Chinese schools and encouraging teachers to try out tailored curricula on their kids.
A student tries out NetDragon’s 101 VR Immersion Classroom headset at the China Science and Technology Museum in Beijing.

Researchers then track pupils’ activity within the VR environment; as a complement to that, tablets come with cameras that can be used to visually monitor students.

“Not only do we want to track it when they’re in the classroom, we want to track it when they’re on the go, when they’re mobile or when they’re at home so we can have a 360-view of how kids learn,” NetDragon vice chairman and former Microsoft executive Simon Leung said, adding that the technology might be ready by 2017. “Once we can monitor their likes and dislikes, for example, you can recommend different services to them, very targeted advertising to them.”

All the companies interviewed emphasized that data would only be collected with the explicit permission of legal guardians. But it remains unclear how parents will take to having computers interpret their child’s every move. And each school will need to be sold on the benefits of a technology that could eventually supplant them.

Taking a step further into the realm of science fiction, a person’s behavior in digital environments offers clues as to their ability to learn, their creativity and even afflictions like Alzheimer’s disease, said Andrea Stevenson Won an associate professor who studies humans in virtual reality environments at Cornell University.

“I really don’t, as a citizen, love the idea of someone being able to track aspects of my behavior and make predictions about a medical condition I might have, and also be able to advertise directly to me,” she said.

Useful digital teachers can only be built by feeding the data of millions into computers that then find patterns and turn them into action. Other outfits like Massachusetts Institute of Technology spin-off Affectiva Inc. and Israeli start-up Beyond Verbal Communication aim to learn what users are thinking based on facial expressions and vocal patterns. Apart from NetDragon, Lenovo unit Stoneware is installing tracking technology in classrooms.

But while companies in the West have built the technology, they’re wary of using it on children. Chinese businesses are less so, which is why some are taking the lead in monetizing it through education. Henry Lau, an associate professor at the University of Hong Kong who helps run a VR lab called the “imseCAVE,” sticks (adult) subjects in an interactive simulation box and then studies them while they drive virtual cranes or subdue criminals with mace.

“The American focus is definitely on technology initially whereas a lot of Chinese companies tend to make use of cost-effective platforms to develop the content,” said Lau, who adds that coders are cheaper in China. “We’re working with companies in Beijing that are developing and booming a lot in what we call e-learning software.”

Chalapathy Neti is vice president of education innovation on IBM’s Watson team, which is building profiles of students around the world. Asian parents have proven to be more willing than those in the West to let their children take part, he agreed. But the researcher says the boost in learning created by combining VR and interactivity could be incredible.

“In a virtual reality lesson if a student pauses and stares at stars in the sky and he’s looking at one in particular, I’ll know there’s interest in that particular field,” he said. “When my children went to the zoo they weren’t looking at the animals but were focused on the engine of the train we were in – these guys now are in the automotive industry.”

Click here to read the article.

Emotions Analytics API V3

Beyond Verbal’s voice – driven emotions analytics API is ready to support your boldest ideas.

Must reads – just to get you started.

  • Voice Input Guidelines: Gives you all the information you need to record good audio which can be analyzed by our engines.
  • Quick Integration Guide: Contains all the technical guidance you need to start using our API.
  • Output Definition: Getting the data is one thing, it is another to understand it. Read our output definitions document to make sense of all our emotional outputs – it’s the sensible thing to do 😉

Additional (and useful) reads:

Sign upfor your free API trial now.

What is our Emotional Features – a detailed guide

Guide to Moods and Attitudes

Understanding Beyond Verbal’s engine emotional definitions

Chapter 1 – Introduction

The Emotions Analytics engine of Beyond Verbal takes raw voice input and analyses it for mood and attitude. What exactly is provided depends on your license key.

  • The Emotions Analytics engine measures the speaker’s current mood. It requires at least 13 seconds of continuous voice to render an emotional analysis. For more information please refer to our recording guidelines.
  • The outputs are distributed into moods and attitude outputs, as listed below within this document.

Chapter 2 – Attitude Outputs

The Emotions Analytics engine measures the speaker’s emotional state during the analyzed voice section intro three separate attitude outputs: Temper, Valence, and Arousal. They are all measured on a scale of 0 to 100.

2.1         Temper

Temper reflects a speaker’s temperament or emotional state ranging from gloomy or depressive at the low-end, embracive and friendly in the mid-range, and confrontational or aggressive at the high-end of the scale.

The temper output is divided into two distinct measurements:

  • Continuous Scale ranging from 0 to 100, representing a temperament shift from depressive at the low end to aggressive at the high end.
  • Temper groups which consist of three distinct groups: Low, Med, High

2.1.1      High Temper

High temper occur when the speaker experiences and expresses aggressive emotions, such as active resistance, anger, hatred, hostility, aggressiveness, forceful commandment and/or arrogance.

Aggressive emotions may have different levels of intensity, and may even be combined, to an extent, with embracive feelings (but not depressive feelings). The Temper scale present this ambiguity as a score that increases the more intensive and “pure” the aggressive emotions are.

2.1.2      Medium Temper

Medium temper occur when the speaker experience and expresses the following three types of emotions:

  • Embracive “positive” emotions, communicated in a warm and friendly manner, such as positivity, empathy, acceptance, friendliness, closeness, kindness, affection, love, calmness, and motivation.
  • Self-controlled “neutral” emotions communicated in a “matter-of-fact” intonation.
  • No significant emotions are evident in the speaker’s voice.

As medium emotions populate the middle of the spectrum, they may contain elements of depressive or aggressive emotions (but not both). The Temper score represent this ambiguity as the score varies for the mid-range towards one of the ends.

2.1.3      Low Temper

Low temper occur when the speaker experiences and expresses depressive emotions in an inhibited fashion, such as sadness, pain, suffering, insult, inferiority, self-blame, self-criticism, regret, fear, anxiety and concern (can also be interpreted as fatigued). It is as though the speaker is waning, growing smaller or pulling back.

Depressive emotions can be expressed in different levels of intensity and may even be combined, to an extent, with embracive feelings (but not aggressive).

2.2         Valence


Valence is an output which measures speaker’s level of negativity / positivity.

The Valence output is divided into two distinct measurements:

  • Continuous Scale ranging from 0 to 100, representing a valence shift from negative attitude at the lower part of the scale to a positive attitude at the higher part of the same scale.

Valence groups which consist of three distinct groups: Negative, Neutral and Positive.


There are three possible and distinct Valence groups:

  • Negative Valence. The speaker’s voice conveys emotional pain and weakness or aggressive and antagonistic emotions.
  • Neutral Valence. The speaker’s voice conveys no preference and comes across as self-control or neutral.
  • Positive Valence. The speaker’s voice conveys affection, love, acceptance and openness.


2.3         Arousal

Arousal is an output that measures a speaker’s degree of energy ranging from tranquil, bored or sleepy to excited and highly energetic. Arousal can also correspond to similar concepts such as involvement and stimulation.


The Arousal output is divided into two distinct measurements:

  • Continuous Scale ranging from 0 to 100, representing a shift from tranquil at the lower part of the scale to excited at the higher part of the same scale.
  • Arousal groups which consist of three distinct groups: Low, Mid and High.

There are three possible and distinct Arousal groups:

  • Low Arousal, conveys low levels of alertness and can be registered in cases of sadness, comfort, relief or sleepiness.
  • Mid Arousal, conveys a medium level of alertness and can be registered in cases of normal conduct, indifference or self-control.
  • High Arousal, conveys a high level of alertness such as excitement, surprise, passionate communication, extreme happiness or anger.


Chapter 3 – Mood Groups

Mood groups are an indicator of a speaker’s emotional state during the analyzed voice section.

There are 432 combined emotions which are grouped into eleven main mood groups. Mood groups are distinct outputs and not measured in a scale.


3.1         Aggressive / Confrontational Mood Groups

  • Supremacy and Arrogance. This group is typified by feelings of power, superiority, ascendancy, self-importance or self-entitlement. The feelings can range from a feeling of superiority to a tendency to assert control when dealing with others.
  • Hostility and Anger. This group has negative emotions of antagonism, enmity or unfriendliness that can be directed against individuals, entities, objects or ideas. The feelings can range from aversion and offensiveness to open aggressiveness and incitement.
  • Criticism and Cynicism. This group is typified by a feeling of general distrust or skepticism. The feelings can also be described as scornful and jaded negativity.

3.2         Self-Control Mood Group

  • Self-control and practicality. This group is typified by feelings of controlled emotions, behaviors and desires. The feelings can range from self-restraint to irrelevance.

3.3         Embracive Mood Groups

  • Leadershipand Charisma. This group is typified by feelings of power, vision and motivation. The feelings can range from protectiveness, communication of ideas or ideology with an underline of motivation.
  • Creativeness and Passion. This group is typified by a feeling of eagerness and/or desire. The feelings can range from desire, want and craving with an underline of action to achieve goals. These emotions are highly correlated with vivid imagination, hopes and dreams.
  • Friendliness and Warm. This group is typified by positive feelings and pleasant accommodation. The feelings include approval, empathy and hospitability. The group can also include feelings of being approved or wanted by others (“being part of a team”) as well as being receptive to another person, idea or item.
  • Love and Happiness. This group is typified by long term happiness, affiliation and pleasurable sensation. The group also includes feelings of strong affection for another person, idea or item as well as arising out of kinship or personal ties.

3.4         Depressive / Gloomy Mood Groups

  • Loneliness and Unfulfillment. This group is typified by feelings of inadequacy, lack of worth, disappointment or failure.
  • Sadness and Sorrow. This group is typified by emotional pain such as unhappiness, self-pity and powerlessness.
  • Defensiveness and Anxiety. This group is typified by negative emotions of fear, worry and uneasiness. The group also includes low self-esteem and can also often be accompanied by inner turmoil and restlessness.

API Analysis Result Interpretation

Analysis Result Interpretation Guide

The UPSTREAM and ANALYSIS requests returns a JSON object which contains analysis result.
Following table summarizes fields, values and their descriptions of the returned JSON object.

JSON Object Field Description
“status”: “success”, The status of request. Can be “success” or “error”.
“result”: { The object of analysis results.
“duration”: 20477.25,

Duration of voice data processed in milliseconds

“sessionStatus”: “Processing”, Session status can be:
“Started” – no analysis data yet produced,
“Processing” – intermediate results , more analysis can be expected,
“Done” – analysis session has ended, the result has an analysis results for whole session.
“analysisSegments”: [ The array containing analysis segments
{ First analysis segment object. Following fields are  properties of the segment
“offset”: 16, Offset of the segment from the beginning of the sample being analyzed. (in milliseconds)
“duration”: 14980, Duration of the analysis segment in the sample being analyzed
“analysis”: { Analysis object. Contains analysis values for the segment. The content of the object is provided as example. The real fields can vary depending on license type
“Temper”: { Temper Object.
“Value”: 81.94444444, Value of the Temper
“Group”: “high”, Group of the Temper
“Summary”: { Summary object which contains an accumulated value of Temper
“Mode”: “high” Accumulated Mode value
“Valence”: {  Valence Object. (similar to Temper object)
“Value”: 38.19095477,  
“Group”: “neutral”,  
“Summary”: {  
“Mode”: “neutral”  
“Arousal”: {  Arousal Object. (similar to Temper object)
“Value”: 93.57235142,  
“Group”: “high”,  
“Summary”: {  
“Mode”: “high”  
“AudioQuality”: {  Audio Quality object
“Value”: 100,  
“Group”: “good”,  
“Summary”: {  
 “Mode”: “good”  
“Mood”: {  Mood Object, Contains Mood Group objects
“Group11”: { Mood Group 11 Object
 “Primary”: { Primary mood of Mood Group 11
“Id”: 2,  Id of the phrase
“Phrase”: “Criticism, Cynicism”  Phrase (Primary Mood Group 11 Phrase)
“Secondary”: {  Secondary
“Id”: 3,  Id of the phrase
“Phrase”: “Defensiveness, Anxiety”  Phrase (Secondary Mood Group 11 Phrase)
“Composite”: {  Composite Mood Object
“Primary”: {  
“Id”: 135,  
“Phrase”: “Extrovert and pressured. Potential for outburst.”  
“Id”: 204,  
 “Phrase”: “A loud and emotional state.”  


Table of Mood Phrases

Getting a Table of Mood Phrases in your language

The Moods table provides mapping between mood Id (returned in Moods section of analysis result) and a text of the Mood phrases.

Example of mood section with Ids

“Mood”: {

“Group21”: { “Primary”: { “Id”: 1, // will be mapped to “Creative, Passionate”


“Secondary”: { “Id”: 7, // will be mapped to “Loneliness, Unfulfillment”




There is no need to fetch this table each time a particular phrase required. In order to reduce network traffic and CPU requirements, the application can pre-load this table into its memory and then use each time when the phrase text is required.

Please be informed that not all languages are supported yet. Please contact Beyond Verbal to get information on how to support the language of your interest.

Moods Request

GET URL:{groupeName}/{language}

Moods Request Parameters


Location (Body\Url\Header)

Optional Explanation
Group name Url No

This field specifies the Mood type for which table is requested.
Supported values:

Language Url Yes

Requested language according to ISO-639 and ISO-3166 standards. Default en-us
Alternatively you can set required language in standard HTTP Accept-Language header

Auth token

Authorization Header




Authorization: Bearer 21G2BA4iZJavSJQbsyuppWmfSMLgLn-**gDTCfguhzGa_k8

OK (200) Response:


{“Id”:1,”Phrase”:”Creative, Passionate”},

{“Id”:2,”Phrase”:”Criticism, Cynicism”},

{“Id”:3,”Phrase”:”Defensivness, Anxiety”},

{“Id”:4,”Phrase”:”Friendly, Warm”},

{“Id”:5,”Phrase”:”Hostility, Anger”},

{“Id”:6,”Phrase”:”Leadership, Charisma”},

{“Id”:7,”Phrase”:”Loneliness, Unfulfillment”},

{“Id”:8,”Phrase”:”Love, Happiness”}


API Voice Input Guidelines

Voice Input Guidelines for Analysis by Beyond Verbal’s REST API


Beyond Verbal’s REST API (API) supports both direct voice inputs such as microphone, call and audio line-in as well as uploading of prerecorded files.

Our Emotions Analytics engine requires a minimum 13 seconds of uninterrupted speech to produce a single analysis batch. The period needed to produce one such batch is referred to in this manual as “Voice Section.” By employing Voice Activated Detection (VAD) methods, the engine algorithms can collect and assemble a Voice Section over a 20-30 second window of normal conversation (which typically contains alternating periods of speech and silence).


Analysis is performed simultaneously during the upload process, ensuring a rapid response once enough voice data had been collected.

Minimal voice quality specifications and preferred input codec

  • The API requires WAV PCM 8 KHz, 16 bit Mono.
  • The API also accepts converted voice from other source of lossless formats with varying degrees of acceptable performance degradations:
  • Conversion to WAV from a lossless codec such as FLAC will not degrade accuracy.
  • Conversion to WAV from a lossless codec such as AMR (commonly used in mobile phones) will result in varying degrees of acceptable performance degradations
  • Inputs from mobile phones generally tend to produce higher-quality voice samples than laptops.


  • Conversion to WAV PCM from lossy codecs such as highly compressed (low bit rate) MP3, G729 or other over-compressed codecs. These will likely yield sub-optimal results due to high degradation/destruction of the important vocal information required for our Emotions Analytics algorithms to run properly.
  • Saturated and/or clipped voice input files that loose much of the data required for algorithms to work properly.


  • Poorly recorded voice inputs containing voice deformities, excessive hiss and other vocal contaminants.

See Tips for recording quality files section for more information


Once sufficient voice data has been collected in the engine, the API immediately produces a mood and attitudes analysis, according to the issued license key. A new analysis batch set will appear once enough additional voice data has been gathered, and the analysis process continues in this way throughout the voice input session.


To obtain a section-based analysis, the app must first open a voice upload channel to the API and then request retrieval of an available analysis. See REST API Quick Guide for more information.



How do I record high quality voice?

  • Although our algorithms are – to an extent – noise tolerant, the lower the background noise, the higher the accuracy of the emotional analysis result will be. Try recording in a moderately quiet room, with relatively low background noise (TV, other people speaking, fans, noisy external city sounds etc.). Same rule of thumb applies for pre-recorded voice inputs injected directly via the API.
  • The voice input should only consist of a single speaker. Having multiple speakers on the same voice input will skew the analysis..
  • Whenever possible use your highest-quality recording equipment (mobile phone’s microphone is good as well).
  • Position yourself about 15-50 cm (6 – 20 inches) from the microphone to avoid sound saturation and reduce white noise.
  • If recording on a PC, make sure that the recorder volume is set in the middle range of the scale.


What are some common problems and easy fixes?

  • Not enough voice to analyze. Please speak at length. 20 seconds can feel like a long time if you are speaking alone.
  • Recording in a car or similarly noisy environment (such as planes, restaurants, trains, and automobiles). Try again at home or office.
  • Sitting too close or out of reach of the microphone. A good distance is about how far you would hold a magazine that you are reading.
  • Speaking too loud or too soft. Speak as you would in a normal conversation­—clearly, but no need to be extremely loud.
  • More than one person is heard in the recording (this can also result from a noisy TV in the background). Wait for a quiet moment, and make sure your TV and radio are off.
  • Using halting speech. Imagine you are speaking to a friend.

How to elicit emotions: Question design guidelines (if relevant)

People enjoy speaking their minds, they just need a little encouragement to talk.

  • People tend to speak with a little more emotion when asked to begin talking about their opinions with “I believe that…”
  • Remind them to talk as if in a conversation with friends. Putting people in a social frame of mind helps their speech flow more fluently.
  • Emphasize that it’s fun and interactive.

If relevant – Don’t look at the screen while talking.

API Metadata Guide

Using Metadata with Beyond Verbal REST API

Metadata field of the START request allows to attach arbitrary information to each analysis session.

Most common usage of the metadata field is to uniquely identify particular user, device or group of users for later aggregated analysis.

Examples of unique identifiers are: email address, mobile phone number, physical (manufacturer) device ID, facebookId, twitter ID.

Set the clientId field value to your unique identifier. Example


Authorization: Bearer 21G2BA4iZJavSJQbsyuppWmfSMLgLn-**gDTCfguhzGa_k8

“dataFormat”: {“type”: “WAV”},
“metadata”:{“clientId” : “+991199483679”}

Optionally you can also set this id as additional field of metadata object in order to specify it’s origin

email Client email
phone Client phone (mobile)
deviceId Physical device ID
facebookId Facebook Id
trwitterId Twitter Id

Example where clientId is a phone number:


Authorization: Bearer 21G2BA4iZJavSJQbsyuppWmfSMLgLn-**gDTCfguhzGa_k8

  “dataormat”: {“type”: “WAV”},
  “metadata”:{“clientId”: “+991199483679”, “phone” : “+991199483679”}

Example where the clientId is email:


Authorization: Bearer 21G2BA4iZJavSJQbsyuppWmfSMLgLn-**gDTCfguhzGa_k8

    “dataormat”: {“type”: “WAV”},
          “clientId”: “”,
            “email” : “”

API Quick Integration Guide

Getting started with Beyond Verbal’s REST API


This Quick Reference Guide is made to help you get up and running using Beyond Verbal’s Emotions Analytics REST API (API).

Using the API you will:

  1. Send the voice samples to Beyond Verbal cloud-based Emotions Analytics engine
    for analysis.
  2. Receive the analysis back from the API.

To obtain better understanding of the data received back from the API, please take a moment to browse our Analysis Result Interpretation guide.
To better understand how to create and record good-quality, emotionally effective voice inputs please refer to our Voice Input Guidelines for Analysis by Beyond Verbal REST API guide.


Each of the below callouts represents a separate step in the process of working with our API. These steps are described in further details – each in a separately dedicated chapter. Feel free to browse the guide in full or jump directly to the chapters that interest you the most.



Use this request to acquire an authentication token from the API Key. This token needs to be sent with each subsequent request to the analysis server.

The Authentication Token must be sent as an Authorization field in the HTTP request header.

Authentication Request Parameters

Name Location (Body \ URL \ Header) Optional Default value Explanation
grant_type Body No client_credentials Client requests an access token.
apiKey Body No Your API key The API key you received from Beyond Verbal.
Content-Type Header No x-www-form-urlencoded Default Internet media type.


Example: Get token request



Example: Response token


You must use the received authentication token in all subsequent requests.

Example: Using the token


Authorization: Bearer 21G2BA4iZJavSJQbsyuppWmfSMLgLn-**gDTCfguhzGa_k8

This token may be reused for multiple sessions.


Use this request to initialize an analysis session.

Define the type of audio file you will send (wav or pcm). If the format is pcm, also provide the channels, sample rate, and bits per sample (in a wav file these parameters are read from the header).

Optionally, you can send metadata about the client, such as clientId or deviceId. See “Metadata Guide” for more details.

Example: Simple request header

{ dataFormat: { type:”WAV” } , metadata:{ ClientID:”12345” } }

Start Request

Method: POST


Start Request Parameters


Name Location
(Body \ URL \ Header)
Optional Default value Explanation Example
dataFormat Body Yes

dataFormat: {type: “WAV”}

Information about the data stream that will
be sent to the server (using recording / call).

For file with WAV header
dataFormat:{ type:”WAV” }

For Raw PCM file
dataFormat : {
“type”: “pcm”,
“channels”: 1,
“sample_rate”: 8000,
“bits_per_sample”: 16 }

metadata Body Yes metadata: {}

Metadata information describing
user/ client/ device/ session.
For more details see “Metadata” on page 9.


{ clientId: “”,
deviceId: “121332423423”, phone: “1718-555-555”

displayLang Body Yes displayLang: “en-us” Language used for the result.
If the requested language is not supported,
an error is returned.
Note that “en-us” is always supported.
Auth token Authorization Header No     Bearer XXX_TOKEN_XXX

Example: Start request (including token)


Authorization: Bearer 21G2BA4iZJavSJQbsyuppWmfSMLgLn-**gDTCfguhzGa_k8

“dataFormat”: {“type”: “WAV”},
“metadata”: {“clientId”: “12345”}

Start Response

Example: OK Response (200)

{status: “success”, recordingId: “someGUID”}

The recordingId field is used to access a particular session that was created.

Example: Error Response (4xx)

{status: “failure”, reason: “This is optional text explaining the error”}


Use this request to send a voice input to the server for analysis. The response contains the analysis for the whole file.

Note: The response to this request is returned only when the whole body has been received and analyzed by the server. This may take a long time; thus to receive intermediate analysis results use the ANALYSIS request.

Upstream Request

Method: POST


Upstream Request Parameters

Name Location (Body \ URL \ Header) Optional Explanation
recordingId URL No Unique identifier of recording. Provided in the recordingId field of the START response.
Auth token Authorization Header No See “Authentication”
Sample Data Body No

Post your audio samples as the body of this HTTP message.

 Note: Http “Content-Length” header

If the length of your sample is unknown in advance (real-time streaming), use a Chunked Transfer Encoding. See ( for details.

Example: Upstream request


Authorization: Bearer 21G2BA4iZJavSJQbsyuppWmfSMLgLn-**gDTCfguhzGa_k8

Upstream Response

The response is a JSON object containing an array of the analysis results for whole analysis session.

Example: OK Response (200)

status: “success”
recordingId: “someGUID”,
result: {
duration: 295096,
sessionStatus: “Done”,
analysisSegments: [
“offset”: 1586,
“duration”: 23201,
“analysis”: {…}


Use this request to fetch the analysis for a particular offset (segment) of the analysis session. You can issue this request in parallel with the UPSTREAM request in order to receive intermediate analysis.

Analysis Request

Method: GET


Analysis Request Parameters

Name Location (Body \ URL \ Header) Optional Explanation
recordingId URL No Unique identifier of recording. Provided in the recordingId field of the START response.
fromMs URL Yes Filters out any analysis older than the given value.
Auth token Authorization Header No See “Authentication”

Example: Initial analysis request

This example requests an analysis from the beginning of the recording session (fromMS=0).


Authorization: Bearer 21G2BA4iZJavSJQbsyuppWmfSMLgLn-**gDTCfguhzGa_k8

Analysis Response

The response is a JSON object containing an array of all the analysis segments from the fromFS, offset from the beginning of the session, until the current moment in time. This moment is the duration. The value of duration indicates the beginning of the next analysis segment.

Example: Analysis OK response (200)

status: “success”
recordingId: “someGUID”,
result: {
duration: 295096,
sessionStatus: “Done”,
analysisSegments: […]

Subsequent Analysis Request

In the next analysis request, increase fromMS to the next offset, that is the duration received in the response, to receive only the most recent analysis.

Example: Subsequent analysis request

This example requests an analysis from where the previous recording session finished (fromMS=295096).


Authorization: Bearer 21G2BA4iZJavSJQbsyuppWmfSMLgLn-**gDTCfguhzGa_k8

Continue with such requests until the session is completed (session status = done).
Note that you can request additional analysis requests for up to 24 hours after the session started. The analysis data is cached for 24 hours.