Catálogo de publicaciones - libros

Compartir en
redes sociales

Text, Speech and Dialogue: 10th International Conference, TSD 2007, Pilsen, Czech Republic, September 3-7, 2007. Proceedings

Václav Matoušek ; Pavel Mautner (eds.)

En conferencia: 10º International Conference on Text, Speech and Dialogue (TSD) . Pilsen, Czech Republic . September 3, 2007 - September 7, 2007

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

Language Translation and Linguistics; Artificial Intelligence (incl. Robotics); Data Mining and Knowledge Discovery; Information Storage and Retrieval; Information Systems Applications (incl. Internet)

Disponibilidad

Institución detectada	Año de publicación	Navegá	Descargá	Solicitá
No detectada	2007	SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-74627-0

ISBN electrónico

978-3-540-74628-7

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

2007

Información sobre derechos de publicación

Cobertura temática

Ciencias de la computación e información

Ingeniería eléctrica, electrónica e informática

Lenguas y literatura

Tabla de contenidos

Verificá que desde tu institución tengas acceso para descargar o solicitar el libro completo o alguno de sus capítulos.

doi: 10.1007/978-3-540-74628-7_51

User Modeling to Support the Development of an Auditory Help System

Flaithrí Neff; Aidan Kehoe; Ian Pitt

The implementations of online help in most commercial computing applications deployed today have a number of well documented limitations. Speech technology can be used to complement traditional online help systems and mitigate some of these problems. This paper describes a model used to guide the design and implementation of an experimental auditory help system, and presents results from a pilot test of that system.

- Speech | Pp. 390-397

doi: 10.1007/978-3-540-74628-7_52

Fast Discriminant Training of Semi-continuous HMM

G. Linarès; C. Lévy

In this paper, we introduce a fast estimate algorithm for discriminant training of semi-continuous HMM (Hidden Markov Models).

We first present the (FD) method proposed in [1] for weight re-estimate. Then, the weight update equation is formulated in the specific framework of semi-continuous models. Finally, we propose an approximated update function which requires a very low level of computational resources.

The first experiments validate this method by comparing our fast discriminant weighting (FDW) to the original one. We observe that, on a digit recognition task, FDW and FD estimate obtain similar results, when our method decreases significantly the computational time.

A second experiment evaluates FDW in Large Vocabulary Continuous Speech Recognition (LVCSR) task. We incorporate semi-continuous FDW models in a Broadcast News (BN) transcription system. Experiments are carried out in the framework of ESTER evaluation campaign ([12]). Results show that in particular context of very compact acoustic models, discriminant weights improve the system performance compared to both a baseline continuous system and a SCHMM trained by MLE algorithm.

- Speech | Pp. 398-405

doi: 10.1007/978-3-540-74628-7_53

Speech/Music Discrimination Using Mel-Cepstrum Modulation Energy

Bong-Wan Kim; Dae-Lim Choi; Yong-Ju Lee

In this paper, we propose Mel-cepstrum modulation energy (MCME) as an extension of modulation energy (ME) for a feature to discriminate speech and music data. MCME is extracted from the time trajectory of Mel-frequency cepstral coefficients (MFCC), while ME is based on the spectrum. As cepstral coefficients are mutually uncorrelated, we expect MCME to perform better than ME. To find out the best modulation frequency for MCME, we make experiments with 4 Hz to 20 Hz modulation frequency, and we compare the results with those obtained from the ME and the MFCC based cepstral flux. In the experiments, 8 Hz MCME shows the best discrimination performance, and it yields a discrimination error reduction rate of 71% compared with 4 Hz ME. Compared with the cepstral flux (CF), it shows an error reduction rate of 53%.

- Speech | Pp. 406-414

doi: 10.1007/978-3-540-74628-7_54

Parameterization of the Input in Training the HVS Semantic Parser

Jan Švec; Filip Jurčíček; Luděk Müller

The aim of this paper is to present an extension of the hidden vector state semantic parser. First, we describe the statistical semantic parsing and its decomposition into the semantic and the lexical model. Subsequently, we present the original hidden vector state parser. Then, we modify its lexical model so that it supports the use of the input sequence of feature vectors instead of the sequence of words. We compose the feature vector from the automatically generated linguistic features (lemma form and morphological tag of the original word). We also examine the effect of including the original word into the feature vector. Finally, we evaluate the modified semantic parser on the Czech Human-Human train timetable corpus. We found that the performance of the semantic parser improved significantly compared with the baseline hidden vector state parser.

- Speech | Pp. 415-422

doi: 10.1007/978-3-540-74628-7_55

A Comparison Using Different Speech Parameters in the Automatic Emotion Recognition Using Feature Subset Selection Based on Evolutionary Algorithms

Aitor Álvarez; Idoia Cearreta; Juan Miguel López; Andoni Arruti; Elena Lazkano; Basilio Sierra; Nestor Garay

Study of emotions in human-computer interaction is a growing research area. Focusing on automatic emotion recognition, work is being performed in order to achieve good results particularly in speech and facial gesture recognition. This paper presents a study where, using a wide range of speech parameters, improvement in emotion recognition rates is analyzed. Using an emotional multimodal bilingual database for Spanish and Basque, emotion recognition rates in speech have significantly improved for both languages comparing with previous studies. In this particular case, as in previous studies, machine learning techniques based on evolutive algorithms (EDA) have proven to be the best emotion recognition rate optimizers.

- Speech | Pp. 423-430

doi: 10.1007/978-3-540-74628-7_56

Benefit of Maximum Likelihood Linear Transform (MLLT) Used at Different Levels of Covariance Matrices Clustering in ASR Systems

Josef V. Psutka

The paper discusses the benefit of a Maximum Likelihood Linear Transform (MLLT) applied on selected groups of covariance matrices. The matrices were chosen and clustered using phonetic knowledge. Results of experiments are compared with outcomes obtained for diagonal and full covariance matrices of a baseline system and also for widely used transforms based on Linear Discriminant Analysis (LDA), Heteroscedastic LDA (HLDA) and Smoothed HLDA (SHLDA).

- Speech | Pp. 431-438

doi: 10.1007/978-3-540-74628-7_57

Information Retrieval Test Collection for Searching Spontaneous Czech Speech

Pavel Ircing; Pavel Pecina; Douglas W. Oard; Jianqiang Wang; Ryen W. White; Jan Hoidekr

This paper describes the design of the first large-scale IR test collection built for the Czech language. The creation of this collection also happens to be very challenging, as it is based on a continuous text stream from automatic transcription of spontaneous speech and thus lacks clearly defined document boundaries. All aspects of the collection building are presented, together with some general findings of initial experiments.

- Speech | Pp. 439-446

doi: 10.1007/978-3-540-74628-7_58

Inter-speaker Synchronization in Audiovisual Database for Lip-Readable Speech to Animation Conversion

Gergely Feldhoffer; Balázs Oroszi; György Takács; Attila Tihanyi; Tamás Bárdi

The present study proposes an inter-speaker audiovisual synchronization method to decrease the speaker dependency of our direct speech to animation conversion system. Our aim is to convert an everyday speaker’s voice to lip-readable facial animation for hearing impaired users. This conversion needs mixed training data: acoustic features from normal speakers coupled with visual features from professional lip-speakers. Audio and video data of normal and professional speakers were synchronized with Dynamic Time Warping method. Quality and usefulness of the synchronization were investigated in subjective test with measuring noticeable conflicts between the audio and visual part of speech stimuli. An objective test was done also, training neural network on the synchronized audiovisual data with increasing number of speakers.

- Speech | Pp. 447-454

doi: 10.1007/978-3-540-74628-7_59

Constructing Empirical Models for Automatic Dialog Parameterization

Mikhail Alexandrov; Xavier Blanco; Natalia Ponomareva; Paolo Rosso

Automatic classification of dialogues between clients and a ser vice center needs a preliminary dialogue parameterization. Such a pa rameterization is usually faced with essential difficulties when we deal with politeness, competence, satisfaction, and other similar characteris tics of clients. In the paper, we show how to avoid these difficulties using empirical formulae based on lexical-grammatical properties of a text. Such formulae are trained on given set of examples, which are evaluated manually by an expert(s) and the best formula is selected by the Ivakhnenko Method of Model Self-Organization. We test the suggested methodology on the real set of dialogues from Barcelona railway directory inquiries for estimation of passenger’s politeness.

- Speech | Pp. 455-463

doi: 10.1007/978-3-540-74628-7_60

The Effect of Lexicon Composition in Pronunciation by Analogy

Tasanawan Soonklang; R. I. Damper; Yannick Marchand

Pronunciation by analogy (PbA) is a data-driven approach to phonetic transcription that generates pronunciations for unknown words by exploiting the phonological knowledge implicit in the dictionary that provides the primary source of pronunciations. Unknown words typically include low-frequency ‘common’ words, proper names or neologisms that have not yet been listed in the lexicon. It is received wisdom in the field that knowledge of the class of a word (common versus proper name) is necessary for correct transcription, but in a practical text-to-speech system, we do not know the class of the unknown word . So if we have a dictionary of common words and another of proper names, we do not know which one to use for analogy unless we attempt to infer the class of unknown words. Such inference is likely to be error prone. Hence it is of interest to know the cost of such errors (if we are using separate dictionaries) and/or the cost of simply using a single, undivided dictionary, effectively ignoring the problem. Here, we investigate the effect of lexicon composition: common words only, proper names only or a mixture. Results suggest that high-transcription accuracy may be achievable without prior classification.

- Speech | Pp. 464-471