SST 1998 Abstracts

Page numbers refer to nominal page numbers assigned to each paper for purposes of citation.

Dialogue Systems

Pages Authors Title and Abstract PDF
1--6 Qing Gao, Fang Zheng, ruin Wu, and Wenhu Wu Non-linear Probabilty Estimation Method used in HMM Modeling Frame Correlation


Abstract  In this paper we present a novel method to incorporate temporal correlation into a speech recognition system based on HMM. An obvious way to incorporate temporal correlation is to condition the probability of the current observation on the current state as well as on the previous observation and the previous state. But use this method directly must lead to unreliable parameter estimates for the number of parameters to be estimated may increase too excessively to limited train data. In this paper, we approximate the joint conditional PD by non¬linear estimation method. The HMM incorporated temporal correlation by non-linear estimation method, which we called it FC HMM does not need any additional parameters and it only brings a little additional computing quantity. The results in the experiment show that the top 1 recognition rate of FC HMM has been raised by 6 percent compared to the traditional HMM method.


7--10 Shari Kumagai Patterns of LinguoPalatal Contact during Japanese Vowel Devoicing


Abstract  It is widely claimed that close vowels in Japanese are devoiced when they occur between voiceless consonants. In this paper, voiceless vowels are represented symbolically as [V-] and voiced vowels as [V+]. The patterns of linguopalatal contact during C[V-]C units and the corresponding C[V+]C units are examined using the method of electropalatography (EPG). Our results show that C[V-]C units and the corresponding C[V+]C units often differ with respect to: (I) the amount (patterns) of tongue-palate contact from Cl (the preceding consonant) to C2 (the following consonant) and (2) the articulatory time internal from Cl to C2. Generally, the amount of linguopalatal contact is significantly greater at the front pan of the palate in C[V-]C units compared to the corresponding C[V+]C units. The articulatory time interval from CI to C2 is generally shorter in C[V-]C units compared to the corresponding C[V+]C units, though this is not always the case for all consonantal types. However, the articulatory gesture of the vowel appears to exist between voiceless consonants regardless of whether they are voiced or devoiced. Devoiced vowels have often been examined from the aspect of the opening gesture of the glottis since a turbulent noise during devoiced vowels is expected to be made at the glottis. However, our study seems to suggest that a turbulent noise can also be produced in the oral cavity - as well as at the glottis - by increasing the degree of tongue-palate contact. In principle, it is expected that the larger the tongue-palate contact is, the greater the turbulent noise will become due to the increased rate of airflow. This kind of linguopalatat contact appears to be a positive effort of a speaker rather than simply a matter of a shorter articulatory time interval in C[v-]C units: both factors seem to be related to the production of vowel devoicing, which seems to suggest that aerodynamic effects are involved.


11--16 Yu Xiao, Hu Guangrui Speech Separation based on the GMM pdf Estimation


Abstract  In this paper, the speech separation task will be regarded as a convolutive mixture Blind Source Separation (BSS) problem. The Maximum Entropy (ME) algorithm, the Minimum Mutual Information (MMI) algorithm and the Maximum Likelihood (ML) algorithm are main approaches of the algorithms solving the BSS problem. The relationship of these three algorithms has been analyzed in this paper. Based on the feedback network architecture, a new speech separation algorithm is proposed by using the Gaussian Mixture Model (GMM) pdf estimation in this paper. From the computer simulation results, it can be concluded that the proposed algorithm can get faster convergence rate and lower output Mean Square Error than the conventional ME algorithm.


17--22 Xiaoqiang Luo Growth Transform of A Sum of Rational Functions and Its Application in Estimating HMM Parameters


Abstract  Gopalakrishnan et a/ [1] described a method called "growth transform" to optimize rational functions over a domain, which has been found useful to train discriminatively Hidden Markov Models(HMM) in speech recognition [5, 6, 9]. A sum of rational functions is encountered when the contributions from other HMM states are weighted in estimating Gaussian parameters of a state, and the weights are optimized using cross-validation [8]. We will show that the growth transform of a sum of rational functions can be obtained by computing term-wise gradients and term-wise function values, as opposed to forming first a single rational function and then applying the result in [1]. This is computationally advantageous when the objective function consists of many rational terms and the dimensionality of the domain is high. We also propose a gradient directed search algorithm to find the appropriate transform constant C.


23--28 Mirjam Wester, Judith M. Kessens and Helmer Sink Two Automatic Approaches for Analyzing Connected Speech Processes in Dutch


Abstract  This paper describes two automatic approaches used to study connected speech processes (CSPs) in Dutch. The first approach was from a linguistic point of view - the top-down method. This method can be used for verification of hypotheses about CSPs. The second approach - the bottom-up method ¬uses a constrained phone recognizer to generate phone transcriptions, An alignment was carried out between the two transcriptions and a reference transcription. A comparison between the two methods showed that 68% agreement was achieved on the CSPs. Although phone accuracy is only 63%, the bottom-up approach is useful for studying CSPs. From the data generated using the bottom-up method, indications of which CSPs are present in the material can be found. These indications can be used to generate hypotheses which can then be tested using the top-down method.


29--34 J,W. Koolwaaii & J. de Veth The use of broad phonetic class models in speaker recognition


Abstract  In this paper we investigate the use of broad phonetic class (BPC) models in a text independent speaker recognition task. These models can be used to bring down the variability due to the intrinsic differences between mutual phonetic classes in the speech material used for training of the speaker models, Com¬bining BPC recognition with text independent speaker recognition moves a bit in the direction of text de¬pendent speaker recognition: a task which is known to reach better performance. The performance of BPC modelling is compared to our baseline system using ergodic 5-state HMMs. The question which BPC contains most speaker specific information is addressed. Also, it is investigated if and how the BPC alignment is correlated with the state alignment from the baseline system to check the assumption that states of an ergodic HMM can model broad phonetic classes [3).


35--38 Jorge Miquelez, Rocio Sosma, and Yolanda Blanco Analysis and Treatment of Esophageal Speech for the Enhancement of its Comprehension


Abstract  This paper resumes an analysis of esophageal speech, and the developing of a method for improving its intelligibility through speech synthesis. Esophageal speech is characterized by low average frequency, while the torment patterns are found to be similar of those of normal speakers. The treatment is different for voiced and unvoiced frames of the signal. While the unvoiced frames are hold like in the original speech, the voiced frames are re-synthesized using linear prediction. Various models of vocal sources have been tested, and the results were better with a polynomial model. The fundamental frequency is raised up to normal values, keeping the intonation.


39--44 Fernando Lacunza, Yolanda Blanco High quality Text-to-Speech system in Spanish for handicapped people


Abstract  This paper describes a high-quality Text-to-Speech system based on the concatenation of diphonemes with the MBR-PSOLA algorithm. Since it was designed as a substitute of natural voice for handicapped people, it must offer a easy to hear speech, with emotional and emphatic information embedded in it. This is obtained with the prosody generator, which uses phonological patterns and a grammatical database to vary three speech parameters: pitch, amplitude and duration. This system accepts plain text, which can be complemented with data about emotions and emphasis.


45--50 Corinna Ng ,Ross Wilkinson, Justin Zobelt Factors affecting Speech Retrieval


Abstract  Collections of speech documents can be searched using speech retrieval, in which the documents are processed by a speech recogniser to give text that can be searched by standard text retrieval techniques. Recognition is the translation of speech signals into either words or subword units such as phonemes. We investigated the use of a phoneme-based recogniser to obtain phoneme sequences. We found that phoneme recognition is worse than word recognition, because of lack of context and difficulty in phoneme boundary detection. Comparing the transcriptions of two different phoneme-based recogniser, we found that the effects of training using well-defined phoneme data, the lack of a language model, and lack of a context-dependent model affected recognition performance. Retrieval was based on n-grams. We found that trigrams performed better than quadgrams because the longer n-gram features contained too many transcription errors. Comparing the phonetic transcriptions from a word recogniser to transcriptions from a phoneme recogniser, we found that using 61 phones modelled with an algorithmic approach were better than using 40 phones modelled with a dictionary approach.


51--54 Johan Rid Perception of words with vowel reduction


Abstract  This study deals with listeners' ability to identify linguistic units from linguistically incomplete stimuli and relates this to the potentiality of vowel reduction in a word. Synthetic speech was used to produce stimuli that were similar to real words, but where the vowel in the pre-stress syllable was excluded. Listeners then performed a lexical decision test, where they had to decide whether a stimulus sounded like a word or not. The effects of the identity of the removed vowel and of features of the consonants adjacent to the removed vowel were then examined, as well as syllabic features. For type of vowel, lower word rates were found for words with the vowels /a/ and /o/, whereas words with nasals after the reduced vowel tended to result in higher word rates. Furthermore, words that still conformed to the phonotactic structure of Swedish after reduction got lower word rates than words that violated this, possibly because the conforming words are more eligible to resyllabification, which renders them as phonotactically legal nonsense words rather than real words.


55--60 Ingrid Ahmer, Robin W. King Automated Captioning of Television Programs: Development and Analysis of a Soundtrack Corpus


Abstract  The purpose of this research is to investigate methods for applying speech recognition techniques to improve the productivity of off-line captioning for television. We posit that existing corpora for training continuous speech recognisers are unrepresentative of the acoustic conditions of television soundtracks. To evaluate the use of application specific models to this task we have developed a soundtrack corpus (representing a single genre of television programming) for acoustic analysis and a text corpus (from the same genre) for language modelling. These corpora are built from existing data derived from the manual captioning process. Captions were used to automatically segment and label the acoustic soundtrack data at sentence level, with manual post-processing to classify and verify the data. The text corpus was derived using automatic processing from approximately 1 million words of caption text. The results confirm the acoustic profile of the task to be characteristically different to that of most other speech recognition tasks (with the soundtrack corpus being almost devoid of clean speech). The text corpus indicates that application specific language modelling will be effective for the chosen genre, although a lexicon providing complete lexical coverage is unattainable. There is a high correspondence between captions and soundtrack speech for the chosen genre, confirming the value of closed-captions for generating labelled acoustic data. The corpora provide a potentially valuable resource to support further research into automated speech recognition techniques.


61--66 Fabrice Lefevre, Claude Montacie and Made-Jose Caraty On the Influence of the Delta Coefficients in a HMM-based Speech Recognition System


Abstract  The delta coefficients are a conventional method to include temporal information in the speech recognition systems. In particular, they are widely used in the gaussian HMM-based systems. Some attempts were made to introduce the delta coefficients in the K-Nearest Neighbours (K-NN) HMM¬based system that we recently developed. An introduction of the delta coefficients directly in the representation space is shown not to be suitable with the K-NN probability density function (pdf) estimator. So, we investigate whether the delta coefficient could be used to improve the K-NN HMM¬based system in other ways. In this purpose, an analysis of the delta coefficients in the gaussian HMM-based systems is proposed. It leads to the conclusion that the delta coefficients influence also the recognition process.


67--72 Raymond Low and Roberto Tognori Speech Recognition using the Probabilistic Neural Network


Abstract  A novel technique for speaker independent automated speech recognition is proposed. We take a segment model approach to Automated Speech Recognition (ASR), considering the trajectory of an utterance in vector space, then classify using a modified Probabilistic Neural Network (PNN) and maximum likelihood rule. The system performs favourably with established techniques. Our system achieves in excess of 94% with isolated digit recognition, 88% with isolated alphabetic letters, and 83% with the confusable /e/ set. A favourable compromise between recognition accuracy and computer memory and speech can also be reached by performing clustering on the training data for the PNN.


73--78 Imed ZITOUNI A Language Modeling Based on a Hierarchical Approach: Mn


Abstract  In this work, we introduce the concept of hierarchical Mvn language model and we compare it to the based class multigram and interpolated class n-gram model. The originality of our approach is its capability to parse a string of class/tags into variable length dependent sequences. A few experimental tests were carried out on a class corpus extracted from the French 'Le Monde" word corpus labeled automat¬ically. In our experiments, Mvn outperforms based class multigram and interpolated class bigram but are comparable to the interpolated class trigram model.


79--84 Michiko Watanabe Temporal variables in lectures in the Japanese language


Abstract  In second language input studies, speaking speed is regarded as one of the most influential factors in comprehension. However, research in this area has mainly been conducted on written texts read aloud. The present study investigated temporal variables, such as articulation rate and ratio and frequency of tillers and silent pauses, in three university lectures given in Japanese. It was found that the total duration ratio of fillers was as great as that of silent pauses, It also became clear that, for individual speakers, articulation rate and frequency of fillers are relatively constant, while frequency of silent pauses varies depending on discourse section. Of total pause ratio, pause frequency and articulation rate, the latter correlated best with listener ratings of speech speed. The findings suggest that spontaneous speech requires methods of speech speed measurement different from those for read speech.


85--90 Matthew Aylett Building a Statistical Model of the Vowel Space for Phoneticians


Abstract  Vowel space data (A two dimensional F1 /F2 plot) is of interest to phoneticians for the purpose of com¬paring different accents, languages, speaker styles and individual speakers. Current automatic methods used by speech technologists do not generally produce traditional vowel space models (See [6] for an overview); instead they tend to produce hyper dimensional code books covering the entire speakers speech stream. This makes it difficult to relate results generated by these methods to observations in laboratory phonetics. In order to address these problems a model was developed based on a mixture Gaussian density function fitted using expectation maximisation on F1/F2 data producing a probability distribution in F1/F2 space. Speech was pre-processed using voicing to automatically excerpt vowel data without any need for segmentation and a parametric fit algorithm [7] was applied to calculate likely vowel targets. The result was a clear visualisation of a speaker's vowel space requiring no segmented or labelled speech.


91--96 Michelle Minnick Fox Computer-Mediated Input and the Acquisition of L2 Vowels


Abstract  Programs for testing and training of difficult vowel distinctions in American English were created for subjects to access via the Internet using a web browser. The testing and training data include many likely vowel confusions for speakers of different L1s. The training program focuses on one distinction at a time, and adjusts to concentrate on particular contexts or exemplars that are difficult for the individual subject. In the current study, 52 subjects participated in testing and 2 subjects participated in training. In the testing portion, results indicate that the L1 and the fluency level in English, as well as individual variability, have an effect on perceptual ability. In the training portion, subjects showed significant improvement on the contrasts on which they trained. Because these programs make extensive data collection over large populations and large distances easy, this method of research will facilitate further investigation of questions regarding second language acquisition.


97--102 Najam Malik and W. Harvey Holmes Speech Analysis by Subspace Methods of Spectral Line Estimation


Abstract  Over frames of short time duration, filtered speech may be described as a finite linear combination of sinusoidal components. In the case of a frame of voiced speech the frequencies are considered to be harmonics of a fundamental frequency. It can be assumed further that the speech samples are observed in additive white noise of zero mean, resulting in a standard signal-plus-noise model. This model has a nonlinear dependence on the frequencies of the sinusoids but is linear in their coefficients. We use subspace line spectral estimation methods of Pisarenko and Prony type to estimate the frequencies and use the results in voiced-unvoiced classification and pitch estimation, followed by analysis of the speech waveform into its sinusoidal components.


103--108 Petra Hansson Pausing in Swedish Spontaneous Speech


Abstract  Pauses in spontaneous speech have a less restricted distribution than pauses in read discourse; however, they are not distributed in a haphazard way. The majority of the perceived pauses in the examined Swedish spontaneous speech material, 73%, occurred in one of the following positions: between sentences, after discourse markers and conjunctions, and before accented content words. There is a range of acoustic correlates of perceived pauses in spontaneous speech, such as silent intervals, hesitation sounds, prepausal lengthening, glottalization and specific PO patterns. The acoustic manifestation of a pause, e.g. the duration of the pause and the FO pattern associated with the pause, is to some extent dependent on the pause's position and function.


109--114 Elisabeth Zetterholm Prosodyand voice quality in the expression of emotions


Abstract  Terms for voice quality or phonation types for use in normal speech often come from studies of pathological speech (laryngeal settings) and it is hard to describe voice quality, especially the variations of a normal voice. In normal speech we use different voice qualities both for linguistic distinctions in some languages, prosodically as a boundary signal, socially depending on social and regional variants and paralinguistically in attitudes and emotions. This paper shows some reference types of voice qualities, recorded by a trained phonetician, and their acoustic correlates. In a pilot study a male actor recorded four attitudinally neutral sentences using five different emotions which are being compared to his neutral voice. It is evident that voice quality, as well as rhythm and intonation, plays an important role in giving the impression of different emotions.


115--120 J. Lunn, A. A. Wrench, J. Mackenzie Beck Acoustic Analysis of /i/ in Glossectomees


Abstract  The production of /i/ is examined for pre- and post-operative patients who have undergone surgery in three distinct areas (anterior, posterior or lateral tongue) followed by radiotherapy and reconstruction. Results show F1 and F2 to be raised after surgery in all cases. Normalised measures of tongue height (F1-F0) and extension (F2-F1) revealed no significant change after surgery to the side of the tongue but in the other two categories, results indicated a change normally associated with both raising and fronting of the tongue. The paper compares these results with findings from other studies and considers possible mechanisms for the observed changes.


Contact ASSTA: Either email The ASSTA Secretary, or

G.P.O. Box 143, Canberra City, ACT, 2601.

Copyright © ASSTA