Catálogo de publicaciones - libros

Compartir en
redes sociales


Text, Speech and Dialogue: 8th International Conference, TSD 2005, Karlovy Vary, Czech Republic, September 12-15, 2005, Proceedings

Václav Matoušek ; Pavel Mautner ; Tomáš Pavelka (eds.)

En conferencia: 8º International Conference on Text, Speech and Dialogue (TSD) . Karlovy Vary, Czech Republic . September 12, 2005 - September 15, 2005

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

Language Translation and Linguistics; Artificial Intelligence (incl. Robotics); Information Storage and Retrieval; Information Systems Applications (incl. Internet)

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2005 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-28789-6

ISBN electrónico

978-3-540-31817-0

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

Información sobre derechos de publicación

© Springer-Verlag Berlin Heidelberg 2005

Tabla de contenidos

Sinusoidal Modeling Using Wavelet Packet Transform Applied to the Analysis and Synthesis of Speech Signals

Kihong Kim; Jinkeun Hong; Jongin Lim

The sinusoidal model has proven useful for representation and modification of speech and audio signal. One drawback, however, is that a sinusoidal model is typically derived using a fixed analysis frame size. It cannot guarantee an optimal spectral resolution to each sinusoidal parameter. In this paper, we propose a sinusoidal model using wavelet packet analysis, to obtain better frequency resolution at low frequencies and better time resolution at high frequencies and to estimate the sinusoidal parameters more accurately. Experiments show that the proposed model can achieve better performance than conventional model.

- Speech | Pp. 241-248

Speaker Identification Based on Subtractive Clustering Algorithm with Estimating Number of Clusters

Younjeong Lee; Ki Yong Lee; Jaeyeol Rheem

In this paper, we propose a new clustering algorithm that performs clustering the feature vectors for the speaker identification. Unlike typical clustering approaches, the proposed method does the clustering without the initial guesses of locations of the cluster centers and a priori information about the number of clusters. Cluster centers are obtained incrementally by adding one cluster center at a time through the subtractive clustering algorithm. The number of clusters is obtained by investigating the mutual relationship between clusters. The experimental results show the effectiveness of the proposed algorithm as compared with the conventional methods.

- Speech | Pp. 249-256

On Modelling Glottal Stop in Czech Text-to-Speech Synthesis

Jindřich Matoušek; Jiří Kala

This paper deals with the modelling of glottal stop for the purposes of Czech text-to-speech synthesis. Phonetic features of glottal stop are discussed here and a phonetic transcription rule for inserting glottal stop into the sequences of Czech phones is proposed. Two approaches to glottal stop modelling are introduced in the paper. The first one uses glottal stop as a stand-alone phone. The second one models glottal stop as an allophone of a vowel. Both approaches are evaluated from the point of view of both the automatic segmentation of speech and the quality of the resulting synthetic speech. Better results are obtained when glottal stop is modelled as a stand-alone phone.

- Speech | Pp. 257-264

Analysis of the Suitability of Common Corpora for Emotional Speech Modeling in Standard Basque

Eva Navas; Inmaculada Hernáez; Iker Luengo; Jon Sánchez; Ibon Saratxaga

This paper presents the analysis made to assess the suitability of neutral semantic corpora to study emotional speech. Two corpora have been used: one having neutral texts that were common to all emotions and the other having texts related to the emotion. Subjective and objective analysis have been performed. In the subjective test common corpus has achieved good recognition rates, although worse than those obtained with specific texts. In the objective analysis, differences among emotions are larger for common texts than for specific texts, indicating that in common corpus expression of emotions was more exaggerated. This is convenient for emotional speech synthesis, but no for emotion recognition. So, in this case, common corpus is suitable for the prosodic modeling of emotions to be used in speech synthesis, but for emotion recognition specific texts are more convenient.

- Speech | Pp. 265-272

Discrete and Fluent Voice Dictation in Czech Language

Jan Nouza

This paper describes two prototypes of voice dictation systems developed for Czech language. The first one has been designed for discrete dictation with the lexicon that includes up to 1 million most frequent Czech words and word-forms. The other is capable of processing fluent speech and it can work with a 100,000-word lexicon in real time on recent high-end PCs. The former has been successfully tested by handicapped persons who cannot enter and edit texts by standard input devices (keyboard and mouse).

- Speech | Pp. 273-280

Unit Selection for Speech Synthesis Based on Acoustic Criteria

Soufiane Rouibia; Olivier Rosec; Thierry Moudenc

This paper presents a new approach to unit selection for corpus-based speech synthesis, in which the units are selected according to acoustic criteria. In a training stage, an acoustic clustering is carried out using context dependent HMMs. In the synthesis stage, an acoustic target is generated and divided into segments corresponding to the required unit sequence. Then, the acoustic unit sequence that best matches the target is selected. Tests are carried out which show the relevance of the proposed method.

- Speech | Pp. 281-287

Generative Model for Decoding a Phoneme Recognizer Output

Mykola Sazhok

The paper presents a way to advance to a multi-level automatic speech understanding system implementation. Two levels are considered. On the first level a free (or relatively free) grammar phoneme recognition is applied and at the second level an output of the phonemic recognizer is automatically interpreted in a reasonable way. A Generative Model approach based model for phoneme recognizer output decoding is proposed. An experimental system is described.

- Speech | Pp. 288-293

Diction Based Prosody Modeling in Table-to-Speech Synthesis

Dimitris Spiliotopoulos; Gerasimos Xydas; Georgios Kouroupetroglou

Transferring a structure from the visual modality to the aural one presents a difficult challenge. In this work we are experimenting with prosody modeling for the synthesized speech representation of tabulated structures. This is achieved by analyzing naturally spoken descriptions of data tables and a following feedback by blind and sighted users. The derived prosodic phrase accent and pause break placement and values are examined in terms of successfully conveying semantically important visual information through prosody control in Table-to-Speech synthesis. Finally, the quality of the information provision of synthesized tables when utilizing the proposed prosody specification is studied against plain synthesis.

- Speech | Pp. 294-301

Phoneme Based Acoustics Keyword Spotting in Informal Continuous Speech

Igor Szöke; Petr Schwarz; Pavel Matějka; Lukáš Burget; Martin Karafiát; Jan Černocký

This paper describes several ways of acoustic keywords spotting (KWS), based on Gaussian mixture model (GMM) hidden Markov models (HMM) and phoneme posterior probabilities from FeatureNet. Context-independent and dependent phoneme models are used in the GMM/HMM system. The systems were trained and evaluated on informal continuous speech. We used different complexities of KWS recognition network and different types of phoneme models. We study the impact of these parameters on the accuracy and computational complexity, an conclude that phoneme posteriors outperform conventional GMM/HMM system.

- Speech | Pp. 302-309

Explicit Duration Modelling in HMM/ANN Hybrids

László Tóth; András Kocsor

In some languages like Finnish or Hungarian phone duration is a very important distinctive acoustic cue. The conventional HMM speech recognition framework, however, is known to poorly model the duration information. In this paper we compare different duration models within the framework of HMM/ANN hybrids. The tests are performed with two different hybrid models, the conventional one and the “averaging hybrid” recently proposed. Independent of the model configuration, we report that the usual exponential duration model has no detectable advantage over using no duration model at all. Similarly, applying the same fixed value for all state transition probabilities, as is usual with HMM/ANN systems, is found to have no influence on the performance. However, the practical trick of imposing a minimum duration on the phones turns out to be very useful. The key part of the paper is the introduction of the gamma distribution duration model, which proves clearly superior to the exponential one, yielding a 12-20% relative improvement in the word error rate, thus justifying the use of sophisticated duration models in speech recognition.

- Speech | Pp. 310-317