Speech Recognition Adaptation

 

The performance of automatic speech recognition systems can drop dramatically when there is an audio mismatch between training and test data. The mismatch can be related to factors such as ambient noise, inter-speaker variability, and the acquisition channel.

Rather than training new models for the new condition, which would involve the expensive process of collecting and preparing new speech data, speech recognition adaptation methods are often used.

Adaptation is a method of using just a small amount of data to tailor existing models to the characteristics of, for example, a new speaker or a new environment.

Speech recognition adaptation methods can be divided into different modes. In supervised adaptation, the related transcriptions are known, whereas unsupervised adaptation refers to circumstances where the adaptation data is unlabelled.

In static adaptation, the data exist in one block. Incremental adaptation, on the other hand, is done incrementally, as more data becomes obtainable.

Maximum likelihood linear regression (MLLR)

Maximum likelihood linear regression is an adaptation method which applies linear transformations to clusters of audio units. The transformations are calculated from the adaptation data and are used to change the means and variances of the Gaussian mixtures, so that these have a higher likelihood of having generated the remarks.

MLLR is an example of an indirect adaptation method; since data is clustered, all units are updated, even if they require representation in the adaptation data. This makes MLLR effective for insignificant amounts of data, but also leads to a quick saturation in performance when the amount of data grows.

Maximum a posteriori (MAP)

Maximum a posteriori is another speech recognition adaptation method, which combines prior knowledge about the model parameters with information acquired from the adaptation data.

Contrary to MLLR, MAP is a direct adaptation method in that it updates the audio units individually. Audio components not present in the adaptation data will not be updated. This means that MAP is not an ideal method for small amounts of data. On the other hand, due to the comprehensive update of every component, it outperforms MLLR when more data is accessible.

The two described techniques can be combined to enhance results even further. In this case, the MLLR calculations can be used as prior information for MAP.

Hidden Markov models (HMM)

Hidden Markov models are a powerful statistical technique for modeling speech signals, and they are the dominating technique in speech recognition today.  Thus, it should be no surprise that it is also the most comprehensive speech recognition adaptation method.

A Hidden Markov model represents a language unit, for example a word or a phoneme.

It has a limited number of states and the transitions between these are probabilistic and take place once every time unit (a model may also stay in the same state). Each state has a probabilistic output function which stands for a random variable or a stochastic process. Gaussian distributions are a common option for representing these functions, and in reality, combinations of Gaussians with individual means and variances and mixture weights are typically used, as these let any arbitrary function to be approximated.

When presented with an observation sequence, a model can calculate the probability of having generated the observations, but since the observations do not exclusively define a particular state sequence, it is not likely to know which states were active, and in what order.

The transition probabilities and the probability distributions together with their weights, are the attributes of an HMM.

And there you have it—three options for speech recognition adaptation.

 

More Technical Articles
Text-to-Speech and Voice Recognition Videos
Text-to-Speech Homepage

 

speech recognition adaptation

Poll

Have you ever used Text-to-Speech technology?: