Feature extraction
Today, signal processing is realized in the digital domain, almost without exclusion. Prior to any such processing, the signal is sampled. In sampling, the signal is determined at certain evenly spaced points in time. According to the sampling theorem, any bandwidth-limited signal can be entirely reconstructed if the sampling frequency, Fs, is at least twice the highest frequency component of the signal. The signal is also quantized with accordance to its amplitude, and the quantization error, or noise, is calculated according to the number of bits used.
For ASR (Automatic Speech Recognition) applications, time domain form of the signal is sub-optimal; a more compact and useful representation is better. Feature extraction, also called front-end analysis, is the process by which the audio signal is converted into a sequence of element vectors. Several feature sets can be used for the vectors, MFCCs (Mel Frequency Cepstral Coefficients) are among the more used. It is better that the features:
• Allow an automatic system to differentiate between speech sounds that are alike Sounding.
• Allow for models to be created without too much amount of training data suppress features of the speaker and the environment.
• On analyzing the waveform, the time domain signal is handled as a sequence of frames, each of which is in a form of a feature vector. The duration of the frames are typically 25 ms, which is short enough to be assumed to come from a stationary process. Every frame is first multiplied by a window function, typically a Hamming window.
Classification
The conversion of speech into feature vectors is followed by the process of recognizing what was actually spoken. There are multiple approaches to addressing this problem in automatic speech recognition. A brief description of the principal ones will be given here. These include: knowledge-based methods, template matching, stochastic methods and connectionist methods. These approaches are not mutually exclusive.
Pattern matching techniques
A pattern matching system for automatic speech recognition is built on the idea of matching input utterances to a number of presorted templates, i.e. example audio patterns. Typically each template corresponds to a word in the vocabulary. The computing system will calculate the audio variation between the input utterance and each of the stored templates and choose the template which makes the highest audio similarity to the input.
Neural networks
Neural networks are a trial to model certain properties of the human nervous system; clearly, this has application in automatic speech recognition.
A network comprises of a large number of nodes. These nodes are organized in layers and inter-connected with weights of various strengths. Information is inserted to an input layer, processed by the net, and then forwarded to a layer of output units. The response of each node is typically determined by a non-linear function of the weighted addition of its inputs.
The network's ability to correctly categorize the input depends on the values of the weights and the optimal values are determined during training. In training, some audio information, e.g. spectral amplitudes, is provided to the input nodes of the network, and the output value is matched to the desired value, e.g. a phoneme. The error, the difference between the required and the actual output, is used to change the weights of the network. This process is repeated multiple times for each training utterance, to increase the likelihood of a right classification, and thus, the most accurate automatic speech recognition
More Technical Articles
Text-to-Speech and Voice Recognition Videos
Text-to-Speech Homepage