PAT and OVE synthesizers kept a conversation how the transfer function of the audio tube should be modeled, in parallel or in cascade. In 1972 John Holmes introduced his parallel formant synthesizer after examining these synthesizers for few years. He adjusted the synthesized sentence "I enjoy the simple life" by hand; it was so good that the normal listener could not distinguish between the synthesized and the original ones. He introduced a parallel formant synthesizer about a year later; it was developed with JSRU (Joint Speech Research Unit). This was an important part in the development of text to speech synthesis.
In 1958 the first articulatory synthesizer was introduced by George Rosen at the Massachusetts Institute of Technology (M.I.T.) Tape recording of control signals created by hand was used to manage the DAVO (Dynamic Analog of the Vocal tract). First tests with Linear Predictive Coding (LPC) were made in mid 1960s. Low-cost systems used Linear estimation for the first time such as TI Speak'n'Spell in 1980, compared to present systems its quality was really poor. However, with some development to basic model, the technique has been found very useful and it is used in many current systems.
In 1968 the first working text-to-speech system for English was developed in the Electrotehnical Laboratory, Japan by Noriko Umeda and others. It was based on an articulatory model and included a unit for syntactic analysis with complicated heuristics. The speech was quite accurate but dull and far away from the current systems quality. Text to speech synthesis still required much work.
In 1979 Allen, Hunnicutt, and Klatt illustrated their text-to-speech system and named it MITalk laboratory, they developed it at M.I.T. It was modified then operated later also in Telesensory Systems Inc. (TSI) commercial TTS system. Dennis Klatt introduced his famous Klattalk system two years later; this system used a new complicated voicing source. The basis for many synthesis systems today —such as DECtalk and Prose-2000— were derived from the technology used in MITalk and Klattalk systems.
In 1976 Kurzweil invented the first reading aid with optical scanner. The Kurzweil Reading Machines for the Blind were able to read very well even the text with multifont. However, the system was really costly for normal customers, so it was operated in libraries and service centers for visually impaired people. People were beginning to see the various applications of text to speech synthesis.
In late 1970's and early 1980's, significantly amount of commercial text-to-speech and speech synthesis products were commercially available. The first IC for speech synthesis is likely to be the Votrax chip which comprised cascade formant synthesizer and simple low-pass smoothing circuits. An inexpensive Votrax-based Type-n-Talk system was presented by Richard Gagnon in 1978. In 1980, Texas Instruments presented linear prediction coding (LPC) based Speak-n-Spell synthesizer comprising a low-cost linear prediction synthesis chip (TMS-5100). It was operated in an electronic reading aid for children and it succeeded to a significant limit. Echo low-cost diphone synthesizer was introduced by Street Electronics in 1982; it was built on a newer version of the same chip as in Speak-n-Spell (TMS-5220). During the same period Speech Plus Inc. presented the Prose-2000 text-to-speech system.
First commercial models of the common DECtalk and Infovox SA-101 synthesizers were introduced a year later.
Current speech synthesis technologies have very sophisticated techniques and algorithms. Hidden Markov models (HMM) are among the methods applied recently in speech synthesis. HMMs have been used with speech recognition since late 1970’s; they served speech synthesis systems for more than 3 decades. The various advances over the years have made today’s text to speech synthesis very common place.
More Technical Articles
Text-to-Speech and Voice Recognition Videos
Text-to-Speech Homepage