The performance of voice recognition system is usually expressed in terms of precision and speed. Accuracy may be measured in terms of performance precision which is typically rated according to word error rate (WER), whereas speed is measured with the real time factor. Other metrics of precision include Single Word Error Rate (SWER) and Command Success Rate (CSR).
The majority of voice recognition system users would typically agree that dictation machines can realize very high performance in restricted conditions. There is some confusion, however, over the compatibility of the terms "speech recognition" and "dictation".
Commercially available speaker-dependent dictation systems typically need only a short period of training (typically called ‘enrollment’) and may successfully obtain continuous speech with a huge vocabulary at normal pace with a very high precision. Most commercial companies state that recognition software can realize between 98% to 99% precision if operated under optimal conditions. ‘Optimal conditions’ typically assume that users:
• have speech characteristics which suit the training data,
• can realize proper speaker adaptation, and
• work in a clean noise conditions (e.g. quiet office or laboratory space).
This explains why some users, particularly those whose speech is heavily accented, might have recognition rates much lower than expected. Speech recognition system in video has become a common search technology used by multiple video search companies.
Restricted vocabulary systems, need no training, can recognize a small number of words (for example, the ten digits) as spoken by most speakers. Such speech recognition systems are common for routing incoming phone calls to their destinations in large firms.
Both audio modeling and language modeling are important elements of modern statistically-based voice recognition system algorithms. Hidden Markov models (HMMs) are commonly utilized in many voice recognition systems. Language modeling has a number of other applications like smart keyboard and document classification.
The most well-known voice recognition system technology is Microsoft Agent
Denounced in Windows 7. This technology might not be presented in later versions of Windows. Microsoft Agent is a group of programmable software services that supports the offering of interactive animated characters within the Microsoft Windows® interface. As you might expect, developers use such characters as interactive supporters to introduce, lead, entertain, or otherwise improve their Web pages or applications in addition to the usual use of windows, menus, and controls.
Microsoft Agent allows software developers and Web authors to fit in a new form of user interaction, known as conversational interfaces, that influences natural areas of human social communication. Beside the common mouse and keyboard input, Microsoft Agent includes optional support for voice recognition system so applications can react to voice commands. Characters can react via synthesized speech, recorded audio, or text in a cartoon word balloon.
The conversational interface method facilitated by the Microsoft Agent services does not substitute regular graphical user interface (GUI) design. Instead, character interaction can be easily combined with the conventional interface elements such as windows, menus, and controls to extend and improve your application's interface.
Microsoft Agent's programming interfaces make it possible to animate a character to react to user input. Animated characters show in their own window, offering maximum flexibility for where they can be played on the screen. Microsoft Agent comprises an ActiveX® control that makes its services available to programming languages that support ActiveX, including Web scripting languages like Visual Basic® Scripting Edition (VBScript).
More Technical Articles
Text-to-Speech and Voice Recognition Videos
Text-to-Speech Homepage