Hidden Markov model (HMM)-based voice recognition
Modern general-purpose voice recognition systems are typically based on Hidden Markov Models. These are statistical models which result a sequence of symbols or quantities. One possible explanation why HMMs are utilized in voice recognition technology is that a speech signal could be regarded as a piecewise stationary signal or a short-time stationary signal. That is, one could presume in a short-time in the range of 10 milliseconds, speech could be estimated as a stationary process. Speech could thus be considered as a Markov model for many stochastic processes.
Described above is the core element of the most famous, HMM-based approach to voice recognition technology. Modern voice recognition systems use different combinations of a number of standard methods in order to improve results over the fundamental approach described above. A typical large-vocabulary system would require context dependency for the phonemes (so phonemes with different left and right context have different recognitions as HMM states); it would use cepstral normalization to normalize for various speaker and recording circumstances; for further speaker normalization it might utilize vocal tract length normalization (VTLN) for male-female normalization and maximum likelihood linear regression (MLLR) for more average speaker adaptation. The features would have so-called delta and delta-delta coefficients to get speech dynamics and in addition might utilize heteroscedastic linear discriminant analysis (HLDA); or might omit the delta and delta-delta coefficients and utilize splicing and an LDA-based projection followed may be by heteroscedastic linear discriminant analysis or a global semi tied covariance transform (also called maximum likelihood linear transform, or MLLT). Many systems utilize so-called discriminative training techniques which abolish a purely statistical approach to HMM parameter estimation and rather optimize some classification-related degree of the training data. Examples are maximum mutual information (MMI), minimum classification error (MCE) and minimum phone error (MPE).
Decoding of the speech (the expression for what happens when the system is presented with a new utterance and must estimate the most likely source sentence) would probably use the Viterbi algorithm to find the optimal path, and here there is a choice between dynamically producing a combination hidden Markov model which comprises both the acoustic and language model information, or merging it statically beforehand (the finite state transducer, or FST, approach).
Dynamic time warping (DTW)-based voice recognition technology
Dynamic time warping is a method that was historically used for voice recognition technology but has now largely been substituted by the more successful HMM-based method. Dynamic time warping is an algorithm for measuring likeliness between two sequences which may vary in time or speed. For example, similarities in walking patterns would be detected, even if in one video the person was moving slowly and if in another they were moving faster, or even if there were accelerations and decelerations during the course of one observation. DTW has been used with video, audio, and graphics – indeed, any data which can be converted into a linear representation can be analyzed with DTW.
A famous application has been automatic voice recognition technology, to deal with different speaking speeds. In general, it is a technique that lets a computer find an optimal match between two given sequences (e.g. time series) with specific restrictions, i.e. the sequences are "warped" non-linearly to match each other. This sequence arrangement method is often used in the perspective of hidden Markov models.
More Technical Articles
Text-to-Speech and Voice Recognition Videos
Text-to-Speech Homepage