Introduction and History of Audiovisual Speech Synthesis

Speech communication depends not only on auditory, but also on visual information. Facial motions, like smiling, grinning, eye blinking, head nodding, and eyebrow rising give an essential additional information of the speaker's emotional state. The emotional state may be even deduced from facial expression without any sound. Fluent speech is also highlighted and punctuated by facial expressions. With visual information combined to synthesized speech it is also possible to improve the intelligibility significantly, particularly when the auditory speech is degraded by things such as noise, bandwidth filtering, or hearing issues (Cohen et al. 1993, Beskow 1996, Le Goff et al. 1996). The visual information is particularly helpful with front phonemes whose articulation we can see, such as labiodentals and bilabials. For instance, intelligibility between /b/ and /d/ increases dramatically with visual information (Santen et al. 1997). Synthetic face also increases the clearness with natural speech. However, the facial gestures and speech must be coherent. Without coherence the clearness of speech may be even decreased. For example, an interesting incident with separate audio and video is so called McGurk effect. If an audio syllable /ba/ is called onto a visual /ga/, it is perceived as /da/ (Cohen et al. 1993, Cole et al. 1995).

Human facial expression has been under study for more than one hundred years. The first computer-based modeling and animations were created over 25 years ago. In 1972 Parke developed the first three-dimensional face model and in 1974 he developed the first version of his well-known parametric three-dimensional model. Since the computer capabilities have increased quickly during last decades, the development of facial animation has been also very fast, and will stay fast in the future when the users are becoming more comfortable with the conversation situations with machines.

Facial animation has been introduced to synthetic speech for more than ten years. Most of the present audiovisual speech synthesizers are built on a parametric face model presented by Parke in 1982. The model comprised a mesh of about 800 polygons that approximated the surface of human face expressive features like the eyes, the eyebrows, the lips, and the teeth. The polygon surface was handled by using 50 parameters (Beskow 1996). However, current systems contain a number of additions to Parke model to improve it and to make it more appropriate for synthesized speech. These are usually a group of rules for generating facial control parameter trajectories from phonetic text, and a simple tongue model, which were not included in the first Parke model.

Audiovisual speech synthesis may be used in multiple applications. Additional visual information is very useful for hearing impaired people. It can be used as a tool for interactive education of speech reading. Also a face with semi-transparent skin and a well modeled tongue can be utilized to picture tongue positions in speech training for deaf children (Beskow 1996). It may be utilized in information systems in public and noisy circumstances, such as airports, train stations and shopping centers. If it is possible to make the talking head look like some particular individual, it may be utilized in videoconferencing or used as a synthetic newsreader. Multimedia is also an essential application field of talking heads. A full synthetic story teller needs considerably less storage capacity relative to movie clips for example.

 

More Technical Articles
Text-to-Speech and Voice Recognition Videos
Text-to-Speech Homepage

speech synthesis

Poll

Have you ever used Text-to-Speech technology?: