Creative TextAssist: An Insight

CREATIVE TEXTASSIST™: AN INSIGHT

How would you describe the quality of today's text-to-speech? Mechanical? Expressionless? Inhuman? Practically intolerable? As speech synthesis technology becomes more prevalent, the need for higher quality text-to-speech continues to grow. Text-to-speech which is understandable or discernible is no longer good enough. User want more -- text-to-speech which sounds natural.

Creative Technology's introduction of Creative TextAssist addresses these growing needs. Using Digital Equipment Corporation's premier Dectalk™ technology and added horsepower provided by Sound Blaster 16/AWE32, TextAssist provides the most realistic text-to-speech synthesis available anywhere, on any platform.

Text-to-speech Synthesis...a Premier

At first blush, one might envision the creation of a text-to-speech engine as being a fairly trivial task...just store recordings of the words on disk, concatenate the words to form sentences and play the sentences using audio hardware. Unfortunately, this approach is doomed to failed. It turns out that text-to-speech synthesis is not that easy. In fact, the ultimate goal of producing completely human-sounding text still lies beyond our grasp. However, advances in linguistics, acoustics, perceptual psychology, mathematical modeling, structured programming, and computational horsepower are bringing us closer to this goal.

An easy way to visualize text-to-speech synthesis is in two parts:

Very simply, the Language Module converts incoming text to a set of control parameters used by the synthesizer. The Synthesizer Module uses these control parameters to create speech waveforms which are played back on audio hardware. As one might expect, the Language Module is language-specific and requires a fair amount of reworking (if not a complete overhaul) when developing text-to-speech for a different language. The Synthesizer Module, on the other hand, can be used for speech production in any language since it uses language-independent synthesizer control parameters to produce its sounds. In the interest of promoting a better understanding of text-to-speech systems, both modules are briefly examined below:

The Language Module

The Language Module is regarded as the most difficult part of any text-to-speech system. It is comprised of a large set of rules which aim to account for the many characteristics and peculiarities of a given language. Designing this module is very challenging, especially when one considers that these systems must detect and map semantic, syntactic, lexical and phonological components of a language into synthesizer values which will later be perceived as changes in pitch, duration and volume. Not only must a word be pronounced correctly, but the stream of words which makes up the sentence much be perceived to be 'natural'. Following is the description of the main components:

Following is a quick blow-by-blow:

Sentence Parser
Breaks the sentence into separate words. Locates clause boundaries.
Word Parser
Breaks words down into their basic components, reformats numbers and expands abbreviations.
Exception Dictionary
Words are checked against exception dictionaries. If a match is found, a phoneme representation is used. Usually, a user dictionary is provided so the user can modify the pronunciation of mispronounced and application-specific words.
Letter-to-Sound Module
Assigns phonemes and lexical stress patterns to words not found in dictionaries.
Apply Prosody Rules
Phonemes from Exception Dictionaries and the Letter-to-Sound module are combined to form a phoneme stream. Settings for pitch, duration, and volume are applied to the stream to add intonation and give the phrase/sentence a natural contour.
Apply Phonetic Rules
Phonetic rules are used to substitute allophones for phonemes to fine-tune pronunciation.
Convert Phoneme Stream
Control parameters are generated from phoneme/allophone Control Parameters stream for subsequent output to synthesizer.

The Synthesizer Module

Typically, a speech synthesizer uses control parameters from a Language Module to generate speech waveforms which are passed to audio hardware for speech playback. Three distinctly different approaches are taken today:

Formant Synthesizer
Diphone Synthesizer
Articulation-Based Synthesizer

Most text-to-speech synthesizers today are either formant or diphone synthesizers. These two types dominate the commercial market.

Formant Synthesizers

Formant Synthesizers are closely tied to an acoustic theory called the Vocal Tract Transfer Function which views the vocal tract as a sophisticated "instrument". In this model, sound emanates from a sound source (the vibration of the vocal folds) and is modulated by the vocal tract formed by the pharynx, oral cavity, and lops. The term formant refers to peaks or resonances found in the frequency domain. Spectrograms which give a graphic representation of formant frequencies and bandwidths of a sound are used to fine-tune pronunciation. The most important figure to date in the development of formant synthesizers was Dennis Klatt, whose work laid the foundation for all formant synthesizers today. DECtalk, jointly developed by Klatt and Digital, is widely regarded as the finest text-to-speech synthesizer available today.

Creative Technology's TextAssist...Superior, Natural Voice Quality

Perhaps the single most impressive aspect of TextAssist is its superior voice quality. In the speech industry, the DECtalk engine is widely regarded as the highest-quality and most natural text-to-speech engine available. This quality is due to the extensive parameters passed to the formant (Klatt) Synthesizer. Dennis Klatt worked with Digital engineers to hand-tune the algorithms for American English text-to-speech. Creative Technology has exclusively licensed these algorithms from Digital to bring this technology to the mass market.

For Developers...

The TextAssist API will give developers direct access to the TextAssist text-to-speech engine. Functions for controlling basic speech operations, controlling speech quality, changing and defining new voices, concurrent playback of audio files during speech playback, dictionary editing capabilities and multimedia event synchronization during text-to-speech playback are supported. "Talking head animation": syllable and word synchronization of text-to-speech and on-screen text is also supported. For more information, please contact Creative Developer Support.