Speech Recognition Engine

 It is well noted that personal computers began to become powerful enough to enable users to talk to them and for the computers to talk back in the mid to late 1990s.  Though it is still far from providing natural, unstructured dialogues with computers, speech technology today is providing some very real benefits in real applications. For example:

• Many large firms have added speech recognition engine to their computer based Interactive Voice Response systems. Just by calling a number and speaking, users can purchase and sell stocks from a brokerage company, verify flight information with an airline company, or requisition goods from a retail store. 

• Microsoft Office XP (Office XP) users in the United States, Japan, and China are able to dictate text to Microsoft Word or PowerPoint documents. Users can also dictate instructions and manipulate menus by speaking. For many users, especially speakers of Japanese and Chinese, dictating is much quicker and handy than using a keyboard.

The two key basic technologies supporting speech-enabling computer applications are speech recognition (SR) and speech synthesis.

Introduction to Computer Speech Recognition

Speech recognition (SR) is the process of turning spoken language into printed text. Speech recognition, also called speech-to-text recognition, comprises:

1.  Obtaining and digitizing the sound waves created by a human speaker.

2.  Turning the digitized sound waves into fundamental units of language sounds or phonemes.

3.  Building words from the phonemes.

4.  Analyzing the context in which the words appear to make sure correct spelling for words that sound alike (such as bat and pat).

Recognizers (also called speech recognition engines) are the software drivers that turn the acoustical signal to a digital signal and hand over recognized speech as text to an application. Typically, speech recognition engines support continuous speech recognition, meaning that users can talk naturally into a microphone at the speed of an average conversation. The isolated or discrete speech recognizers that expect the user to pause after each word are nowadays being substituted by continuous speech engines.

Continuous speech recognition engines today provide two modes of speech recognition:

• Dictation, in which the user inserts data by talking directly to the computer.

• Command and control, in which the user instructs actions by telling commands or asking questions.

Through dictation mode, users can dictate memos, letters, and e-mail messages, as well

as insert data. The size of the recognizer's grammar restricts the options of what can be recognized. Most recognizers that provide dictation mode are speaker-dependent, meaning that precision varies depending on the user's speaking mode and accent. To make sure the most accurate recognition, the application must build or access a speaker profile that includes information about the user's speech modes.

Using command and control mode, users can say instructions that control the functions of an application. Applying command and control mode is the simplest way for developers to add a speech interface into an existing application since developers can restrict the elements of the recognition grammar to the available commands. This restriction has several advantages:

• It delivers better accuracy and performance degrees compared to dictation tasks, because a speech recognition engine utilized for dictation should cover nearly a whole language dictionary.

• It decreases the processing overhead that the application needs.

• It also allows speaker-independent processing, reducing the need for speaker profiles or "teaching" of the recognizer.

 

More Technical Articles
Text-to-Speech and Voice Recognition Videos
Text-to-Speech Homepage

speech recognition engine

Poll

Have you ever used Text-to-Speech technology?: