A feedforward connectionist network traincd by backpropagation was used to detect 15 speech fcatures. The network was traincd ovcr 240 sen ten ccs (40 men and 40 women), and lested over 200 sentences (10 men and 10 women), all part of the MIT Icc Crcam database. Network input consisted of a smoothed spectral vector at 15-ms-intervals, plus two coetficicnts of amplitudę and spectral change. The network achieves a signal detection discrimination level (a-prime) of 0.87 compared to a leve! ofO.76 for a ten-nearest-neighbor system. Almost identical training and test performances indicates excellent generalization to new speakers and text. Processing costs are mainly signal processing and nctwork training; detection itself can be done in real timc. Performance is much better for broad features like sonorancc, which occur frequently, than for infrequcnt features like sibilance, partly because of their Iow fre-quency and partly because of other characteristics. [Work supported by USWest.]
2:00
3SP5. Neural networks in artlculatory speech analysis/synthesis. M. G. Rahim, W. B. Kleijn, and J. Schroeter (AT&T Bell l^aboratories, Murray Hill, NJ 07974)
A major difficulty in articulatory analysis/synthesis Ls the estimation of vocal-tracl parameters from input speech. The use of neural networks to extract these parameters is morę attractivc than codebook look-up due to the Iower computational complexity. For example, a multilayer perceptron (MLP) with two hidden layers, traincd and evaluated on a smali data set was shown to perform a reasonable mapping of acouslic-to-geometric parameters. Increasing the training data, however, re-vealcd ambiguity in the mapping that could not be resolved by a single network. This paper addresses the problem using an assembly of MLP’s, each designated to a spccific region in the articulatory space. Training data were generated by randomly sampling the parameters of an articulatory model of the vocal system. The resultant vocal-tract shapcs were clustered into 128 regions, and an MLP with one hidden layer was assigned to each of these regions for mapping 18 cepstral coefficients into ten tract areas, and a nasalization parameter. Networks were se-lected by dynamie programming, and were used to control a time-domain articulatory synthesizer. After training, significant perceptual and objcctive improvements were achieved relative to using a single MLP. Comparabie performance to codebook look-up with dynamie programming was obtained. This model, however, requires only 4% of the storage needed for the codebook, and performs the mapping faster by a factor of 20.
2:15
3SP6. Automatic speech recognition based on property detectors. T.
V. Ananthapadmanabha and H. N. Jayasimha (Voice and Speech Systems, 669,1 Floor, 20th Cross, II Błock, Rajajinagar, Bangalore 560 010, India)
Speaker-independent, large-vocabulary, continuous speech recognition by a machinę is a challenging problem for which over a decade of research has been madę without significant progress. In the existing systems, the same acoustic feature vector (LPC, cepstrum, filter bank, etc.) Ls used for all speech sounds and they heavily depend on contex-tual information for their succcss. This paper presents some results based on a radically different approach called “property detectors.” The approach of property detectors is well known in visual perception where it has been demonstrated that specialized detectors exist on the retina that trigger only for vertical, horizontal, or inclined lines. It has only been speculated that such specialized detectors could exist for speech. Recently, acoustic properties have been discovcred that uniquely char-acterize some phonemes like /a/, /i/, /u/, /e/, /o/, and /s/. A limitcd-vocabulary, speaker-independent airline Schedule announcement system was developed. This system was lested in a noisy hall with a large number of speakers, including female speakers, with different linguistic backgrounds. The system, though is in its early stage, gave a performance of about 85% accuracy. The approach based on property detectors appears promising.
2:30-2:45
Break
2:45
3SP7. Synthesis of raanner and voidng continua based on speech production models. Corine Bickley, Kenneth N. Stevens (Rcs. Lab. of Electron., MIT, Cambridge, MA 02139), and Rolf Carlson (MIT, Cambridge, MA 02139)
The goal of this project is to create natural-sounding synthetic consonant-vowel syllables for presentation to aphasic patients and nor-mal Controls in studies of perception of speech sounds and lexical ac-cess. Of particular interest are the manner distinctions that appear to form the basis for the processing of other phonctic dimensions by hu-man listeners. Continua of syllabic-nonsyllabic, sonorant-obstruent, continuant-noncontinuant, and voiced-voiceless sounds were con-structed using the KLSYN88 synthesizer. The endpoint stimuli were synthesized based on theoretical models of glottal and turbulencc noise sources and vocal-tract filtering, with some refinements to match the characteristics of a particular speaker. Intermediate stimuli were cre-ated to form continua that represent incremental changes in the synthesizer parameters. For all stimuli, the values of synthesis parameters modeled utteranccs that could be produced by a human talker. Identification functions for these continua for normal listeners showed rela-tively sharp boundaries between phonetic categories. The acoustic characteristics of the stimuli in the vicinity of the boundaries were examincd to determine the pattem of acoustic attributes responsible for the abrupt change in identification, such as rise limes of amplitudes, rates of change of formants, and relative amplitudes of noLse and glottal excitations. [Work supported in part by NIH grants DC00776 and DC00075.)
3:00
3SP8. Considerations on speaking style and speaker variability in speech synthesis. Len nart Nord and Bjom Granstróm (l>epL of Speech Commun. & Musie Acoust., Royal Inst. Tech., Box 70014, S-10044 Stockholm, Sweden)
In the exploration of speaking style and speaker variabilily, a mul-tispeaker database and a speech production model is used. The structure of the database, which includes professional as well as untrained speakers, makes it possible to extract relevant information by simple search procedures. In perceptual studies both H) and duration has had an indisputable effect on prosodics but the role of intensity and of segmen-tal variation has been less elear. This has resulted in an emphasis on the former attributes in current speech synthesis schemes. Intensity has a
J. Acoust. Soc. Am., Vol. 89. No. 4, Pt. 2, April 1991