Overview of Text-To-Speech Technology

 Speech has always been the main communication means between people. Speech synthesis is an automatic creation of speech waveforms; it has been a development subject for several decades. The latest advances in speech synthesis have been the motive to a wide synthesizer production with very high precision; however, the sound quality and genuineness remain a major problem. That said, quality of text to speech technology has reached a sufficient level suitable for many applications, such as telecommunications and multimedia. It is possible to increase speech precession considerably using some audiovisual information or expression animation i.e. a talking head. Innovation continues, as some of the text to speech technology techniques (for audiovisual speech in particular) have been lately developed.

 

The text-to-speech or speech synthesis procedure includes two main phases. The first phase is text analysis, where the input text is turned into a phonetic or some other linguistic form, and the second one is the creation of speech waveforms, where the audio output is created from this phonetic information. These two phases are typically called high-level and low-level synthesis. The input text might be data from a word document for example, standard e-mail ASCII characters, a mobile SMS, or scanned text from a hard copy source. The character string gets manipulated and analyzed into phonetic form which is typically a string of phonemes with some extra information for correct accent, timing, and emphasis. Speech sound is generated eventually using the low-level synthesizer which manipulates information originating from high-level one.

 

The easiest way to create synthetic speech is to concatenate audio samples of natural speech, such as individual words or sometimes phrases. This concatenation method guarantees high quality and genuineness, but usually limited by vocabulary and usually available in one voice. This technique is very suitable for some broadcast and information systems. However, it is quite obvious that creating a database of all words and common names from the entire world will be such a hard task. It is maybe even improper to call this speech synthesis because it is just pure recordings. Thus, for unlimited speech synthesis using real text to speech technology we have to operate shorter samples of speech signal, such as phonemes, syllables, diaphones or even shorter samples. 

 

Another widely used technique to create synthetic speech is formant synthesis which is built on the source-filter-model of speech creation. This text to speech technology is also called terminal analogy because it replicates only the voice source and the formant frequencies, rather than physical characteristics of the vocal tract. The stimulation signal could be either voiced with fundamental frequency or wordless noise. A mixed stimulation of these two may be used for voiced consonants and some desired sounds too. The excitation is then acquired and filtered with a vocal band filter which is designed of resonators similar to the formants of human speech.

 

Theoretically the most precise method to create artificial speech is to replicate the human speech production system directly.

 

There is a method called articulatory synthesis, which typically comprises replicas of the human articulators and replicas of vocal cords. The articulators are typically formed using a group of area functions of small tube samples. The vocal cord model is applied to generate a suitable stimulation signal, which for example may be a two-mass replica with two vertically vibrating masses. Articulatory synthesis keeps a guarantee of high-quality synthesized speech, but due to the complex nature of this text to speech technology, the final success has not been achieved yet.

 

More Technical Articles
Text-to-Speech and Voice Recognition Videos
Text-to-Speech Homepage

text-to-speech

Poll

Have you ever used Text-to-Speech technology?: