Catálogo de publicaciones - libros

Compartir en
redes sociales


Text, Speech and Dialogue: 10th International Conference, TSD 2007, Pilsen, Czech Republic, September 3-7, 2007. Proceedings

Václav Matoušek ; Pavel Mautner (eds.)

En conferencia: 10º International Conference on Text, Speech and Dialogue (TSD) . Pilsen, Czech Republic . September 3, 2007 - September 7, 2007

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

Language Translation and Linguistics; Artificial Intelligence (incl. Robotics); Data Mining and Knowledge Discovery; Information Storage and Retrieval; Information Systems Applications (incl. Internet)

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2007 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-74627-0

ISBN electrónico

978-3-540-74628-7

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

Información sobre derechos de publicación

© Springer-Verlag Berlin Heidelberg 2007

Tabla de contenidos

Logic-Based Rhetorical Structuring for Natural Language Generation in Human-Computer Dialogue

Vladimir Popescu; Jean Caelen; Corneliu Burileanu

Rhetorical structuring is field approached mostly by research in natural language (pragmatic) interpretation. However, in natural language generation (NLG) the rhetorical structure plays an important part, in monologues and dialogues as well. Hence, several approaches in this direction exist. In most of these, the rhetorical structure is calculated and built in the framework of Rhetorical Structure Theory (RST), or Centering Theory [7], [5]. In language interpretation, a more recent formal account of rhetorical structuring has emerged, namely Segmented Discourse Representation Theory (SDRT), which alleviates some of the issues and weaknesses inherent in previous theories [1]. Research has been initiated in rhetorical structuring for NLG using SDRT, mostly concerning monologues [3]. Most of the approaches in using and / or approximating SDRT in computer implementations lean on dynamic semantics, derived from Discourse Representation Theory (DRT) in order to compute rhetorical relations [9]. Some efforts exist in approximating SDRT using less expressive (and expensive) logics, such as First Order Logic (FOL) or Dynamic Predicate Logic (DPL), but these efforts concern language interpretation [10]. This paper describes a rhetorical structuring component of a natural language generator for human-computer dialogue, using SDRT, approximated via the usage of FOL, doubled by a domain-independent discourse ontology. Thus, the paper is structured as follows: the first section situates the research in context and motivates the approach; the second section describes the discourse ontology; the third section describes the approximations done on vanilla SDRT, in order for it to be used for language generation purposes; the fourth section describes an algorithm for updating the discourse structure for a current dialogue; the fifth section provides a detailed example of rhetorical relation computation. The sixth section concludes the paper and gives pointers to future research and improvements.

- Speech | Pp. 309-317

Text-Independent Speaker Identification Using Temporal Patterns

Tobias Bocklet; Andreas Maier; Elmar Nöth

In this work we present an approach for text-independent speaker recognition. As features we used Mel Frequency Cepstrum Coefficients (MFCCs) and Temporal Patterns (TRAPs). For each speaker we trained Gaussian Mixture Models (GMMs) with different numbers of densities. The used database was a 36 speakers database with very noisy close-talking recordings. For the training a Universal Background Model (UBM) is built by the EM-Algorithm and all available training data. This UBM is then used to create speaker-dependent models for each speaker. This can be done in two ways: Taking the UBM as an initial model for EM-Training or Maximum-A-Posteriori (MAP) adaptation. For the 36 speaker database the use of TRAPs instead of MFCCs leads to a frame-wise recognition improvement of 12.0 %. The adaptation with MAP enhanced the recognition rate by another 14.2 %.

- Speech | Pp. 318-325

Recording and Annotation of Speech Corpus for Czech Unit Selection Speech Synthesis

Jindřich Matoušek; Jan Romportl

The paper gives a brief summarisation of preparation and recording of a phonetically and prosodically rich speech corpus for Czech unit selection text-to-speech synthesis. Special attention is paid to the process of two-phase orthographic annotations of recorded sentences with regard to their coherence.

- Speech | Pp. 326-333

Sk-ToBI Scheme for Phonological Prosody Annotation in Slovak

Milan Rusko; Róbert Sabo; Martin Dzúr

Research and development in speech synthesis and recognition calls for a phonological intonation annotation scheme for the particular language. Inspired by the successful ToBI (Tones and Break Indices) for American English [1] and GToBI [2] for German, this paper introduces a new intonation annotation scheme for Slovak, Sk-ToBI. In spite of the fact that Slovak prosodic rules differ from those of English or German, we decided to follow the main principals of ToBI and to define a special Slovak version of Tones and Break Indices annotation scheme. The speech material belonging to different styles, which was used for the preliminary study of accents in Slovak is shortly described and the conventions of Sk-ToBI annotation are presented.

- Speech | Pp. 334-341

Towards Automatic Transcription of Large Spoken Archives in Agglutinating Languages – Hungarian ASR for the MALACH Project

Péter Mihajlik; Tibor Fegyó; Bottyán Németh; Zoltán Tüske; Viktor Trón

The paper describes automatic speech recognition experiments and results on the spontaneous Hungarian MALACH speech corpus. A novel morph-based lexical modeling approach is compared to the traditional word-based one and to another, previously best performing morph-based one in terms of word and letter error rates. The applied language and acoustic modeling techniques are also detailed. Using unsupervised speaker adaptations along with morph based lexical models 14.4%-8.1% absolute word error rate reductions have been achieved on a 2 speakers, 2 hours test set as compared to the speaker independent baseline results.

- Speech | Pp. 342-349

Non-uniform Speech/Audio Coding Exploiting Predictability of Temporal Evolution of Spectral Envelopes

Petr Motlicek; Hynek Hermansky; Sriram Ganapathy; Harinath Garudadri

We describe novel speech/audio coding technique designed to operate at medium bit-rates. Unlike classical state-of-the-art coders that are based on short-term spectra, our approach uses relatively long temporal segments of audio signal in critical-band-sized sub-bands. We apply auto-regressive model to approximate Hilbert envelopes in frequency sub-bands. Residual signals (Hilbert carriers) are demodulated and thresholding functions are applied in spectral domain. The Hilbert envelopes and carriers are quantized and transmitted to the decoder. Our experiments focused on designing speech/audio coder to provide broadcast radio-like quality audio around 15 − 25kbps. Obtained objective quality measures, carried out on standard speech recordings, were compared to the state-of-the-art 3GPP-AMR speech coding system.

- Speech | Pp. 350-357

Filled Pauses in Speech Synthesis: Towards Conversational Speech

Jordi Adell; Antonio Bonafonte; David Escudero

Speech synthesis techniques have already reached a high level of naturalness. However, they are often evaluated on text reading tasks. New applications will request for conversational speech instead and disfluencies are crucial in such a style. The present paper presents a system to predict filled pauses and synthesise them. Objective results show that they can be inserted with 96% precision and 58% recall. Perceptual results even shown that its insertion increases naturalness of synthetic speech.

- Speech | Pp. 358-365

Exploratory Analysis of Word Use and Sentence Length in the Spoken Dutch Corpus

Pascal Wiggers; Leon J. M. Rothkrantz

We present an analysis of word use and sentence length in different types of Dutch speech, ranging from conversations over discussions and formal speech to read speech. We find that the distributions of sentence length and personal pronouns are characteristic for the type of speech. In addition, we analyzed differences in word use between male and female speakers and between speakers with high and low education levels. We find that male speaker use more fillers, while women use more pronouns and adverbs. Furthermore, gender specific differences turn out to be stronger than differences in language use between groups with different education levels.

- Speech | Pp. 366-373

Design of Tandem Architecture Using Segmental Trend Features

Young-Sun Yun; Yunkeun Lee

This paper investigates the tandem architecture (TA) based on segmental features. The segmental feature based recognition system has been reported to show better results than the conventional feature based system in previous studies. In this paper we tried to merge the segmental feature with the tandem architecture which uses both hidden Markov models and neural networks. In general, segmental features can be separated into the trend and location. Since the trend means variation of segmental features and since it occupies a large portion of segmental features, the trend information was used as an independent or additional feature for the speech recognition system. We applied the trend information of segmental features to TA and used posterior probabilities, which are the output of the neural network, as inputs of the recognition system. Experiments were performed on Aurora2 database to examine the potentiality of the trend feature based TA. The results of our experiments verified that the proposed system outperforms the conventional system on very low SNR environments. These findings led us to conclude that the trend information on TA can be additionally used for the traditional MFCC features.

- Speech | Pp. 374-381

An Automatic Retraining Method for Speaker Independent Hidden Markov Models

András Bánhalmi; Róbert Busa-Fekete; András Kocsor

When training speaker-independent HMM-based acoustic models, a lot of manually transcribed acoustic training data must be available from a good many different speakers. These training databases have a great variation in the pitch of the speakers, articulation and the speed of talking. In practice, the speaker-independent models are used for bootstrapping the speaker-dependent models built by speaker adaptation methods. Thus the performance of the adaptation methods is strongly influenced by the performance of the speaker- independent model and by the accuracy of the automatic segmentation which also depends on the base model. In practice, the performance of the speaker-independent models can vary a great deal on the test speakers. Here our goal is to reduce this performance variability by increasing the performance value for the speakers with low values, at the price of allowing a small drop in the highest performance values. For this purpose we propose a new method for the automatic retraining of speaker-independent HMMs.

- Speech | Pp. 382-389