Confidence Scores

Our speech models are probabilistic and utilize Bayes theorem to deduce the most likely word given an audio signal. The probability of having a certain word given an audio signal could be thought of as:

Confidence score = P(word | audio) ∝ P(audio | word) ⨯ P(word)

P(audio | word) indicates the likelihood that the word sounds like the observed signal. This likelihood is learned during training by observing how each specific phoneme/word sounds like in different contexts. P(word) indicates the probability of the word appearing in the given language and is derived from language model.

Given the above probabilistic model, an audio segment would yield a non-zero confidence value for one or more words that are likely being said. Think of words that sound similar or those hard to distinguish in a noisy recording. While we report the word with the highest confidence score in the transcript, there could be some applications that could benefit from confidence score. For example, you could highlight low-confidence words in a transcription service for manual checking.

Bad audio quality, difficult to distinguish words (eg. mumble), words that are out of context, and out of vocabulary words (OOVs) lead to lower confidence scores. Again, the reason is that these conditions allow for more alternatives given the audio signal.

As a rule of thumb, this is a general guideline on how to interpret the confidence score: less than 0.6 : low-confidence, likely incorrect transcription 0.6 - 0.8: likely correct transcription greater than 0.8: quite accurate transcription

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confidence Scores

Clone this wiki locally