ABSTRACT This paper describes technical issues involved with and procedures used in the development of a 1000-element diphone female voice and subsequently of a 2500-element polyphone female voice for AT&T Network systems concatenative synthesis text-to-speech (TTS) system. Telephone bandpass filtering and speech coding techniques each reduce the intelligibility of natural human female speech relatively more than that of male speech. Both of these factors were hurdles to developing a satisfactory female TTS voice for telephony applications. Despite these difficulties, a highly intelligible female voice was developed by careful selection of elements for the acoustic inventory. The approach taken was to refine the acoustic inventory by iterative intelligibility testing to identify poor acoustic elements, to replace them with superior elements, and to repeat the process, covering as wide a range of acoustic elements and contexts as possible. This technique was very successful, and the percentage more errors for the female diphone TTS voice than for the male diphone TTS voice was reduced from 67% in early 1990 to 21% in late 1990. Expansion from the diphone to the polyphone TTS system in 1991 resulted in significant listener preferences of the polyphone over the diphone system for both the male TTS voice (60%-40%) and for the female voice (68%-32%). Intelligibility tests of the male and female polyphone TTS voices in 1992 indicated that TTS intelligibility was significantly lower (about 90% as intelligible) than that of the human voices on which the TTS systems were based. Intelligibility scores averaged across both human and TTS conditions were significantly higher for male voices than for female voices, but there was no significant interaction between sex and condition (human or TTS), indicating that the small differences in intelligibility between male and female voices were comparable for human and synthetic voices.
Buy this Article
|