Internet Speech Markup Languages

Most synthesizers accept only plain text as an input. However, it is hard to analyze the text and develop correct pronunciation and prosody from written text alone. In some cases there is also need to include the speaker features or emotional status in the output speech, and this is where internet speech markup languages come in handy. With some additional information in the input data it is possible to manage these features of speech easily. For example, with some information about if the input phrase is in a question, imperative, or neutral figure, the controlling of prosody may become significantly easier. Some commercial systems let the user to place marks in the text to produce more natural sounding speech.

In normal HTML, certain markup tags like <p> ... </p> are used to define paragraphs and help the web-browser to build the correct output. These and other similar tags may be used to help a speech synthesizer create the correct output with different kind of pronunciations, voices and other features. For instance, to describe happiness, we may add tags <happy>...</happy> or to express a question <quest>...</quest>. Speaker's features and used language may be managed by same way with tags <gender=female> or <lang=fin>. Some words and common names have irregular pronunciation which may be rectified with same kind of tags. Local stress markers may also be used to stress a certain word in a phrase.

The first trial to develop TTS internet speech markup language was called SSML (Speech Synthesis Markup Language), obtained at the Centre for Speech Technology Research (CSTR) in the University of Edinburgh, England, in 1995 (Taylor et al. 1997). It comprised control tags for phase boundaries, language, and made possible to express a pronunciation of a specific word and add emphasis tags in the sentence. In the next example, pro defines the pronunciation of the word and format defines the used lexicon metric. With tag <phrase> it is even possible to modify the meaning of the whole sentence.

It is important to mention that the current development of the language is continuing with Bell Laboratories. The newest version of internet speech markup languages is called STML (Spoken Text Markup Language). SUN Microsystems is also contributing in the development process to merge their JSML (Java Speech Markup Language) to realize one widespread system in the near future. Currently, the manageable features are much wider than in SSML.

The structure of STML is simplest to apprehend from the example below. The used language and the default speaker of that language are adjusted simply with tags <language id> and <speaker id>. The tag <genre type> allows adjusting the type of text like plain prose, poetry, or lists. The tag <div type> specifies a specific text-genre-specific division with list items. With tag <emph> the emphasis degree of the following word is specified. The tag <phonetic> indicates that the enclosed region is a phonetic transcription in one of a predefined group of schemes. The tag <define> is used to identify the lexical pronunciation of a certain word. The tag <intonat> states the midline and amplitude of pitch scale with absolute scale in hertz or relative multiplier compared to normal pitch for the speaker. The tag <bound> is used to specify a certain boundary between 0 (weakest) and 5 (strongest). The <literal mode> is utilized for spelling mode and the <omitted> tag indicates the region that is emitted from output speech.

With internet speech markup languages you have the level of control necessary for today’s text-to-speech softwares.

 

More Technical Articles
Text-to-Speech and Voice Recognition Videos
Text-to-Speech Homepage

speech markup languages

Poll

Have you ever used Text-to-Speech technology?: