Sana Voice Overview

Learning to speak new languages can be difficult. Sana Voice empowers learners to perfect their pronunciation and sound like a native with state-of-the-art speech recognition technology. Sana Voice effectively models pronunciation independent of your native language and provides instant personal feedback.

The Sana Voice API can be used to score a word, sentence or phrase. The endpoint returns overall, word and phoneme level scores.



Scoring A Phrase

Scores a word, sentence or phrase. The endpoint returns overall, word and phoneme level scores.

POST https://voice-api.sanalabs.com/api/v1/score


Body Parameters

KeyMandatoryTypeDescription
dialectYesstringThe dialect to use for scoring. en-us is supported. API has automatic conversion of target_phrase to phonemes for en-us.
user_idYesstringUser ID of the end user who the pronunciation feedback is provided for. This should be anonymized.
target_phraseYesstringA word, phrase or sentence to score.
audioYesBinaryA file with the user audio to be scored. For more information check Audio section
audio_formatNostringCan be wav, .mp3 or webm. Default value is wav. For more information check Audio section

Response Format

On success, the HTTP status code in the response header is 200 OK and the response body is empty. On error, the header status code is an error code and the response body contains a list of Error Response objects.

KeyTypeDescription
overall_scoreintA value between 0–100. The score of the overall phrase.
word_scoresArray of Word Score objectsScores for each different word including phonemes.

Example Curl Request

curl "https://voice-api.sanalabs.com/api/v1/score" \
-H "Content-Type: multipart/form-data" \
-H "X-API-KEY: $API_KEY" \
-X "POST" \
-F "dialect=en-us" \
-F "user_id=123456" \
-F "target_phrase=good joke" \
-F "audio=@audio_file_16k.wav" \
-F "audio_format=wav"

Example Python Code

import requests

url = "https://voice-api.sanalabs.com/api/v1/score"
head = {'X-API-KEY': '$API_KEY'}

files = {
  'audio': ('ex.wav', open('audio.wav', 'rb')),
  'dialect': (None, 'en-us'),
  'user_id': (None, 123456),
  'audio_format': (None, 'wav'),
  'target_phrase': (None, 'good joke'),
}

r = requests.post(url, headers=head, files=files)
print(r.status_code)

Example Response

{
  "target_phrase": "good joke",
  "overall_score": 92,
  "word_scores": [
    {
      "score": 87,
      "is_matched": true,
      "boundaries": {
        "start_ms": 430,
        "end_ms": 780
      },
      "indices": {
        "start": 0,
        "end": 3
      },
      "phoneme_scores": [
        {
          "sounds_like": "W",
          "phoneme_ipa": "ɡ",
          "sounds_like_ipa": "w",
          "phoneme": "G",
          "score": 70
        },
        {
          "sounds_like": "UH",
          "phoneme_ipa": "ʊ",
          "sounds_like_ipa": "ʊ",
          "phoneme": "UH",
          "score": 100
        },
        {
          "sounds_like": "D",
          "phoneme_ipa": "d",
          "sounds_like_ipa": "d",
          "phoneme": "D",
          "score": 100
        }
      ],
      "syllable_scores": [
        {
          "syllable_ipa": "ɡʊd",
          "syllable": "good",
          "score": 87,
          "phoneme_scores": [
            {
              "phoneme": "G",
              "phoneme_ipa": "ɡ",
              "score": 70,
              "score_grade": "D",
              "sounds_like": "#",
              "sounds_like_ipa": "#"
            },
            {
              "phoneme": "UH",
              "phoneme_ipa": "ʊ",
              "score": 100,
              "score_grade": "D",
              "sounds_like": "#",
              "sounds_like_ipa": "#"
            },
            {
              "phoneme": "D",
              "phoneme_ipa": "d",
              "score": 100,
              "score_grade": "D",
              "sounds_like": "#",
              "sounds_like_ipa": "#"
            }
          ]
        }
      ],
      "word": "good"
    },
    {
      "score": 100,
      "is_matched": true,
      "boundaries": {
        "start_ms": 860,
        "end_ms": 1370
      },
      "indices": {
        "start": 5,
        "end": 8
      },
      "phoneme_scores": [
        {
          "sounds_like": "JH",
          "phoneme_ipa": "dʒ",
          "sounds_like_ipa": "dʒ",
          "phoneme": "JH",
          "score": 100
        },
        {
          "sounds_like": "OW",
          "phoneme_ipa": "oʊ",
          "sounds_like_ipa": "oʊ",
          "phoneme": "OW",
          "score": 100
        },
        {
          "sounds_like": "K",
          "phoneme_ipa": "k",
          "sounds_like_ipa": "k",
          "phoneme": "K",
          "score": 100
        }
      ],
      "syllable_scores": [
        {
          "syllable_ipa": "dʒoʊk",
          "syllable": "jowk",
          "score": 100,
          "phoneme_scores": [
            {
              "phoneme": "JH",
              "phoneme_ipa": "dʒ",
              "score": 100,
              "score_grade": "D",
              "sounds_like": "#",
              "sounds_like_ipa": "#"
            },
            {
              "phoneme": "OW",
              "phoneme_ipa": "oʊ",
              "score": 100,
              "score_grade": "D",
              "sounds_like": "#",
              "sounds_like_ipa": "#"
            },
            {
              "phoneme": "K",
              "phoneme_ipa": "k",
              "score": 100,
              "score_grade": "D",
              "sounds_like": "#",
              "sounds_like_ipa": "#"
            }
          ]
        }
      ],
      "word": "joke"
    }
  ]
}

Score Schema

ScoreDescription
[90–100]Excellent. Native-like
[80–90)Good and intelligible
[60–80)It sounds okay, but there is room for improvement
[0–60)Doesn’t sound great. Should be tried again.

Scoring A Multiple Choice Phrase

Scores an audio file against multiple phrases. The endpoint returns overall, word and phoneme level scores for the option with the best match from the choices given.

POST https://voice-api.sanalabs.com/api/v1/multichoice/score


Body Parameters

Same as in phrase-scoring.


Response Format

Same as in phrase-scoring for the option with the best match from the choices given.

Example Curl Request

curl "https://voice-api.sanalabs.com/api/v1/multichoice/score" \
-H "Content-Type: multipart/form-data" \
-H "X-API-KEY: $API_KEY" \
-X "POST" \
-F "dialect=en-us" \
-F "user_id=123456" \
-F "target_phrase=good joke" \
-F "target_phrase=bad joke" \
-F "target_phrase=awesome prank" \
-F "audio=@audio_file_16k.wav" \
-F "audio_format=wav"

Example Python Code

import requests

url = "https://voice-api.sanalabs.com/api/v1/multichoice/score"
head = {'X-API-KEY': '$API_KEY'}

files = [
  ['audio', ('ex.wav', open('audio.wav', 'rb'))],
  ['dialect', (None, 'en-us')],
  ['user_id', (None, 123456)],
  ['audio_format', (None, 'wav')],
  ['target_phrase', (None, 'good joke')],
  ['target_phrase', (None, 'bad joke')],
  ['target_phrase', (None, 'awesome prank')],
]

r = requests.post(url, headers=head, files=files)
print(r.status_code)

Fluency Scoring [BETA]

Returns a fluency assessment score for a target phrase. The endpoint returns an overall assessment score as well as scores for content, fluency and pronunciation along with explanations for each part.

Contact Sana Labs for getting access to this endpoint.


Body Parameters

Same as in phrase-scoring.


Response Format

On success, the HTTP status code in the response header is 200 OK and the response body is empty. On error, the header status code is an error code and the response body contains a list of Error Response objects.

KeyTypeDescription
assessmentassessment dictionaryscores and explanations for different assessment critiria.

Example Response

{
  "target_phrase": "good joke",
  .
  .
  .
  ,

  "assessment": {
    "overall_score": 86,
    "fluency": {
      "score": 85,
      "explanation": [
        {
          "keyword": "unfilled_pauses",
          "description": "Few unfilled pauses."
        },
        {
          "keyword": "hesitations",
          "description": "Few hesitations."
        },
        {
          "keyword": "needs_pronunciation_improvement",
          "description": "Pronunciation could be improved."
        },
        {
          "keyword": "slow",
          "description": "Spoke slowly."
        }
      ]
    },
    "content": {
      "score": 98,
      "explanation": [
        {
          "keyword": "additional_sounds",
          "description": "Few additional phonemes."
        }
      ]
    },
    "pronunciation": {
      "score": 88,
      "explanation": [
        {
          "keyword": "needs_pronunciation_improvement",
          "description": "1 word was pronounced incorrectly."
        }
      ]
    }
  }
}

Getting Syllabification of a Phrase

Returns the syllable structure of a target phrase. The endpoint response has a similar structure to the scoring endpoint.

POST https://voice-api.sanalabs.com/api/v1/syllables


Body Parameters

KeyMandatoryTypeDescription
target_phraseYesstringA word, phrase or sentence to syllabify.

Response Format

On success, the HTTP status code in the response header is 200 OK and the response body is empty. On error, the header status code is an error code and the response body contains a list of Error Response objects.

KeyTypeDescription
word_scoresArray of Word Score objectsSince no scoring of a phrase is happening. The array contains only the entries relevant to syllabification.

Example Curl Request

curl "https://voice-api.sanalabs.com/api/v1/syllables" \
-H "Content-Type: multipart/form-data" \
-H "X-API-KEY: $API_KEY" \
-X "POST" \
-F "target_phrase=Syllable example"

Example Python Code

import requests

url = "https://voice-api.sanalabs.com/api/v1/syllables"
head = {'X-API-KEY': '$API_KEY'}

files = {
  'target_phrase': (None, 'Syllable example'),
}

r = requests.post(url, headers=head, files = files)
print(r.status_code)

Example Response

{
  "target_phrase": "syllable example",
  "word_scores": [
    {
      "word": "Syllable",
      "phoneme_scores": [
        {
          "phoneme": "S",
          "phoneme_ipa": "s"
        },
        {
          "phoneme": "IH",
          "phoneme_ipa": "ɪ"
        },
        {
          "phoneme": "L",
          "phoneme_ipa": "l"
        },
        {
          "phoneme": "AH",
          "phoneme_ipa": "ʌ"
        },
        {
          "phoneme": "B",
          "phoneme_ipa": "b"
        },
        {
          "phoneme": "AH",
          "phoneme_ipa": "ʌ"
        },
        {
          "phoneme": "L",
          "phoneme_ipa": "l"
        }
      ],
      "syllable_scores": [
        {
          "syllable_ipa": "sɪ",
          "syllable": "si",
          "phoneme_scores": [
            {
              "phoneme": "S",
              "phoneme_ipa": "s"
            },
            {
              "phoneme": "IH",
              "phoneme_ipa": "ɪ"
            }
          ]
        },
        {
          "syllable_ipa": "lʌ",
          "syllable": "luh",
          "phoneme_scores": [
            {
              "phoneme": "L",
              "phoneme_ipa": "l"
            },
            {
              "phoneme": "AH",
              "phoneme_ipa": "ʌ"
            }
          ]
        },
        {
          "syllable_ipa": "bʌl",
          "syllable": "buhl",
          "phoneme_scores": [
            {
              "phoneme": "B",
              "phoneme_ipa": "b"
            },
            {
              "phoneme": "AH",
              "phoneme_ipa": "ʌ"
            },
            {
              "phoneme": "L",
              "phoneme_ipa": "l"
            }
          ]
        }
      ],
      "indices": {
        "start": 0,
        "end": 7
      }
    },
    {
      "word": "example",
      "phoneme_scores": [
        {
          "phoneme": "IH",
          "phoneme_ipa": "ɪ"
        },
        {
          "phoneme": "G",
          "phoneme_ipa": "ɡ"
        },
        {
          "phoneme": "Z",
          "phoneme_ipa": "z"
        },
        {
          "phoneme": "AE",
          "phoneme_ipa": "æ"
        },
        {
          "phoneme": "M",
          "phoneme_ipa": "m"
        },
        {
          "phoneme": "P",
          "phoneme_ipa": "p"
        },
        {
          "phoneme": "AH",
          "phoneme_ipa": "ʌ"
        },
        {
          "phoneme": "L",
          "phoneme_ipa": "l"
        }
      ],
      "syllable_scores": [
        {
          "syllable_ipa": "ɪɡ",
          "syllable": "ig",
          "phoneme_scores": [
            {
              "phoneme": "IH",
              "phoneme_ipa": "ɪ"
            },
            {
              "phoneme": "G",
              "phoneme_ipa": "ɡ"
            }
          ]
        },
        {
          "syllable_ipa": "zæm",
          "syllable": "zam",
          "phoneme_scores": [
            {
              "phoneme": "Z",
              "phoneme_ipa": "z"
            },
            {
              "phoneme": "AE",
              "phoneme_ipa": "æ"
            },
            {
              "phoneme": "M",
              "phoneme_ipa": "m"
            }
          ]
        },
        {
          "syllable_ipa": "pʌl",
          "syllable": "puhl",
          "phoneme_scores": [
            {
              "phoneme": "P",
              "phoneme_ipa": "p"
            },
            {
              "phoneme": "AH",
              "phoneme_ipa": "ʌ"
            },
            {
              "phoneme": "L",
              "phoneme_ipa": "l"
            }
          ]
        }
      ],
      "indices": {
        "start": 9,
        "end": 15
      }
    }
  ]
}


Object Model

This section describes the objects that are used throughout the different endpoints.

Word Score

KeyTypeDescription
wordstringDenotes the meaning or category of the tag.
is_matchedbooleanIndicates whether the API successfully detected the word or not. If is_matched is false the engine is not confident that the user has attempted to pronounce this word.
boundariesdictionary of word boundariesTimestamps in milliseconds for a word.
indicesdictionary of word indicesPositional indices of a scored word mapping back to the target_phrase.
scoreintA value between 0–100. Depicts how well the learner pronounced this specific word.
phoneme_scoresAn Array of Phoneme Score ObjectsA score of each Phoneme in a word.
syllable_scoresAn Array of Syllable Score ObjectsA score of each syllable in a word. Words are divided into syllables using the Maximal Onset Principle.

Word Boundaries

KeyTypeDescription
start_msintBeginning timestamp in milliseconds.
end_msintEnding timestamp in milliseconds.

Word Indices

KeyTypeDescription
startintStarting index of a word.
endintEnding index of a word (inclusive).

Phoneme Score

KeyTypeDescription
phonemestringThe distinct unit of sound within the target word in ARPABET 2-letter format.
phoneme_ipastringThe distinct unit of sound within the target word in IPA format.
sounds_likestringThe distinct unit of sound within the inferred word in ARPABET 2-letter format. # is used when no phoneme was detected.
sounds_like_ipastringThe distinct unit of sound within the inferred word in IPA format. # is used when no phoneme was detected.
scoreintA value between 0–100.

Syllable Score

KeyTypeDescription
syllablestringPronunciation respelling of the unit of organization for a sequence of speech sounds within the target word.
syllable_ipastringThe unit of organization for a sequence of speech sounds within the target word in IPA format.
scoreintA value between 0–100.
phoneme_scoresAn Array of Phoneme Score ObjectsA score of each Phoneme in a syllable.

Assessment Dictionary

KeyTypeDescription
overall_scoreintA value between 0–100. Overall score of the assessment
fluencydictionaryContains a score and a list of fluency assessment explanation dictionaries.
contentdictionaryContains a score and a list of content assessment explanation dictionaries.
pronunciationdictionaryContains a score and a list of pronunciation assessment explanation dictionaries.

Fluency Assessment Explanation

Each fluency assessment is accompanied by a list of reasons. They consist of a keyword entry, that is suitable for programmatic parsing, and a description entry that provides a human readable explanation for the existence of that particular reason.

KeywordTypeDescription
successstringPerfect fluency score!
needs_pronunciation_improvementstringPronunciation errors that affected fluency.
slowstringLow articulation rate levels within the speech signal.
unfilled_pausesstringBig spans of silence.
hesitationsstringSpeech hesitations or small pauses.
uneven_articulation_ratestringThe flow of speech had speed variance. Try to speak with the same level of speed throughout.

Content Assessment Explanation

Each content assessment is accompanied by a list of reasons. They consist of a keyword entry, that is suitable for programmatic parsing, and a description entry that provides a human readable explanation for the existence of that particular reason.

KeywordTypeDescription
successstringAll words were attempted successfully!
additional_soundsstringNoise or excess speech was detected.
missing_wordsstringSome words were not detected or uttered successfully.

Pronunciation Assessment Explanation

Each pronunciation assessment is accompanied by a list of reasons. They consist of a keyword entry, that is suitable for programmatic parsing, and a description entry that provides a human readable explanation for the existence of that particular reason.

KeywordTypeDescription
successstringPerfect pronunciation score!
needs_pronunciation_improvementstringPronunciation errors.


API Semantics

This section explains the semantics of our Rest API. It includes common information that is valid for all the endpoints.


API Endpoints

The base URL for all our endpoints is https://voice-api.sanalabs.com. Please note that non-secure access to the API is not available. All HTTP requests will be redirected to HTTPS automatically.


Authentication

Sana API supports two types of authentication:

  • server side for calls made from backend systems which are shielded from direct access by users.
  • client side for calls made from client apps which are installed on users’ devices.

The advantage of server side communication is that it’s simpler to implement however it adds extra latency to the client apps since all requests need to go through proxy backend before being sent towards Sana Voice API.

Server Side Authentication

Allows secure communication towards Sana API from backend systems. The recommendation is to use server side authentication during initial stages of integration with Sana Voice API.

A valid API key is needed to access the Sana Voice API. Contact Sana Labs to get your own API key. Your API keys carry privileges for you to access the Sana Voice API, be sure to keep them secret. Do not share your API keys in publicly accessible places such as Github or client-side code.

The Sana Voice API expects the API key to be included in all API requests to the server in a header that looks like the following:

X-API-KEY: $API_KEY

If the key is omitted or is wrong, you will get a 401 Unauthorized response to your request.

To authorize, pass the X-API-KEY header

curl -H "X-API-KEY: $API_KEY" https://voice-api.sanalabs.com/api/v1/score

Make sure to replace $API_KEY with your API key.

Client Side Authentication

Allows secure communication towards Sana API from client applications (e.g. mobile and browser based apps). The recommendation is to use client side authentication for apps with low latency requirements to make direct calls towards Sana Voice API.

For client side authentication Sana API uses JWT (JSON Web Tokens): https://jwt.io/introduction.

Follow this steps to integrate JWT into your app:

  1. Generate private/public key pair using RSA algorithm with 4096 bits key size: ssh-keygen -m PEM -t rsa -b 4096 -f sana_voice_id_rsa. For more details follow: https://www.ssh.com/ssh/keygen.

  2. Send public key (should have the name sana_voice_id_rsa.pub) to Sana Labs account manager via email.

  3. Before making API request towards Sana Voice API issue JWT token using one of the libraries: https://jwt.io/#libraries. The token should include the following fields in its’ payload: iss - issuer id (provided by Sana Labs account manager); aud - user id (unique identifier for the user that is using the app); exp - token expiration time (check required format according to the library used), recommended setting is to set expiration time one week ahead from when the token is issued. The token should be renewed for each user before it expires.

  4. The token should be signed using private key issued at step 1. The algorithm used for signature should be RS256.

  5. Make requests towards Sana Voice API. The JWT API token is expected to be included in all API requests inside Authorization header using the Bearer schema. The content of the header should look like the following: Authorization: Bearer <token>.

If the token is omitted, wrong or expired, you will get a 401 Unauthorized response to your request.

To authorize, pass the Authorization header

curl -H "Authorization: Bearer $JWT_TOKEN" https://voice-api.sanalabs.com/api/v1/score

Make sure to replace $JWT_TOKEN with your JWT token.


Errors

All endpoints either result in success or an error. The API returns 200 or 201 for successful requests and relevant HTTP status code and an Error Response object in case of an error. See the Error Status Codes section for the HTTP Status Codes Sana Web API returns.


Audio

Sana Voice API supports wav, mp3 and webm audio formats. For the best quality and performance use a sample rate of 16k and 1 channel (mono).


Error Status Codes

The Sana Web API uses the following error codes:

Error CodeError TextError Description
400Bad RequestYour request is invalid.
401UnauthorizedNo API Key or your API key is wrong.
402Payment RequiredYour API Key expired.
404Not FoundThe specified resource could not be found.
405Method Not AllowedYou tried to access a resource with an invalid method.
429Too many requestsYou have exceeded your rate limit.
500Internal Server ErrorThere was a problem on the server side. Please try again later.
503Service UnavailableThe API is temporarily offline for maintenance. Please try again later.