Catálogo de publicaciones - libros

Compartir en
redes sociales


Chinese Spoken Language, Processing: 5th International Symposium, ISCSLP 2006, Singapore, December 13-16, 2006, Proceedings

Qiang Huo ; Bin Ma ; Eng-Siong Chng ; Haizhou Li (eds.)

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

Language Translation and Linguistics; Artificial Intelligence (incl. Robotics); Mathematical Logic and Formal Languages; Data Mining and Knowledge Discovery; Algorithm Analysis and Problem Complexity; Document Preparation and Text Processing

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2006 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-49665-6

ISBN electrónico

978-3-540-49666-3

Editor responsable

Springer Nature

País de edición

China

Fecha de publicación

Información sobre derechos de publicación

© Springer-Verlag Berlin Heidelberg 2006

Tabla de contenidos

Prosodic Word Prediction Using a Maximum Entropy Approach

Honghui Dong; Jianhua Tao; Bo Xu

As the basic prosodic unit, the prosodic word influences the naturalness and the intelligibility greatly. Although the research shows that the lexicon word are greatly different from the prosodic word, the lexicon word still provides the important cues for the prosodic word forming. The rhythm constraint is another important factor for the prosodic word prediction. Some lexicon word length patterns trend to be combined together. Based on the mapping relationship and the difference between the lexicon words and the prosodic words, the process of the prosodic word prediction is divided into two parts, grouping the lexicon word to the prosodic word and splitting the lexicon word into prosodic words. This paper proposes a maximum entropy method to model these two parts, respectively. The experiment results show that this maximum entropy model is competent for the prosodic word prediction task. In the word grouping model, a feature selection algorithm is used to induce more efficient features for the model, which not only decrease the feature number greatly, but also improve the model performance at the same time. And, the splitting model can correctly detect the prosodic word boundary in the lexicon word. The f-score of the prosodic word boundary prediction reaches 95.55%.

- Speech Synthesis and Generation | Pp. 169-178

Predicting Prosody from Text

Keh-Jiann Chen; Chiu-yu Tseng; Chia-hung Tai

In order to improve unlimited TTS, a framework to organize the multiple perceived units into discourse is proposed in [1]. To make an unlimited TTS system, we must transform the original text to the text with corresponding boundary breaks. So we describe how we predicate prosody from Text in this paper. We use the corpora with boundary breaks which follow the prosody framework. Then we use the lexical and syntactic information to predict prosody from text. The result shows that the weighted precision in our model is better than some speakers. We have shown our model can predict a reasonable prosody form text.

- Speech Synthesis and Generation | Pp. 179-188

Nonlinear Emotional Prosody Generation and Annotation

Jianhua Tao; Jian Yu; Yongguo Kang

Emotion is an important element in expressive speech synthesis. The paper makes the brief analysis on prosody parameters, stresses, rhythms and paralinguistic information in different emotional speech, and labels the speech with rich annotation information in multi-layers. Then, a CART model is used to do the emotional prosody generation. Unlike the traditional linear modification method, which makes direct modification of F0 contours and syllabic durations from acoustic distributions of emotional speech, such as, F0 topline, F0 baseline, durations and intensities, the CART models try to map the subtle prosody distributions between neutral and emotional speech within various context information. Experiments show that, with the CART model, the traditional context information is able to generate a good emotional prosody outputs, however the results could be improved if more rich information, such as stresses, breaks and jitter information, are integrated into the context information.

- Speech Synthesis and Generation | Pp. 189-199

A Unified Framework for Text Analysis in Chinese TTS

Guohong Fu; Min Zhang; GuoDong Zhou; Kang-Kuong Luke

This paper presents a robust text analysis system for Chinese text-to-speech synthesis. In this study, a lexicon word or a continuum of non-hanzi characters with the same category (e.g. a digit string) are defined as a morpheme, which is the basic unit forming a Chinese word. Based on this definition, the three key issues concerning the interpretation of real Chinese text, namely lexical disambiguation, unknown word resolution and non-standard word (NSW) normalization can be unified in a single framework and reformulated as a two-pass tagging task on a sequence of morphemes. Our system consists of four main components: (1) a pre-segmenter for sentence segmentation and morpheme segmentation; and (2) a lexicalized HMM-based chunker for identifying unknown words and guessing their part-of-speech categories; and (3) a HMM-based tagger for converting orthographic morphemes to their Chinese phonetic representation (viz. pinyin), given their word-formation patterns and part-of-speech information; (4) a post-processing for interpreting phonetic tags and fine-tuning pronunciation order for some special NSWs if necessary. The evaluation on a pinyin-notated corpus built from the Peking University corpus shows that our system can achieve correct interpretation for most words.

- Speech Synthesis and Generation | Pp. 200-210

Speech Synthesis Based on a Physiological Articulatory Model

Qiang Fang; Jianwu Dang

In this paper, a framework for speech synthesis is proposed to realize the process of speech production of human, which is based on a physiological articulatory model. Within this framework, it begins with given articulatory targets, then muscle activation patterns are estimated according to the targets by accounting for both the equilibrium characteristics and muscle dynamics, consequently, the articulatory model is driven to generate a time-varying vocal tract shape corresponding to the targets by contracting the corresponding muscles. Thereafter, a transmission line model is implemented for the time-varying vocal tract to produce speech sound. At last, a primary experiment is carried out to synthesize the single vowels and diphthongs of Chinese with the physiological articulatory model based synthesizer. The result shows that the spectra of the synthetic sound for single vowels are consistent with those of the real speech, and proper acoustic characteristics are obtained in most cases for diphthongs.

- Speech Synthesis and Generation | Pp. 211-222

An HMM-Based Mandarin Chinese Text-To-Speech System

Yao Qian; Frank Soong; Yining Chen; Min Chu

In this paper we present our Hidden Markov Model (HMM)-based, Mandarin Chinese Text-to-Speech (TTS) system. Mandarin Chinese or Putonghua, “the common spoken language”, is a tone language where each of the 400 plus base syllables can have up to 5 different lexical tone patterns. Their segmental and supra-segmental information is first modeled by 3 corresponding HMMs, including: (1) spectral envelop and gain; (2) voiced/unvoiced and fundamental frequency; and (3) segment duration. The corresponding HMMs are trained from a read speech database of 1,000 sentences recorded by a female speaker. Specifically, the spectral information is derived from short-time LPC spectral analysis. Among all LPC parameters, Line Spectrum Pair (LSP) has the closest relevance to the natural resonances or the “formants” of a speech sound and it is selected to parameterize the spectral information. Furthermore, the property of clustered LSPs around a spectral peak justify augmenting LSPs with their dynamic counterparts, both in time and frequency, in both HMM modeling and parameter trajectory synthesis. One hundred sentences synthesized by 4 LSP-based systems have been subjectively evaluated with an AB comparison test. The listening test results show that LSP and its dynamic counterpart, both in time and frequency, are preferred for the resultant higher synthesized speech quality.

- Speech Synthesis and Generation | Pp. 223-232

HMM-Based Emotional Speech Synthesis Using Average Emotion Model

Long Qin; Zhen-Hua Ling; Yi-Jian Wu; Bu-Fan Zhang; Ren-Hua Wang

This paper presents a technique for synthesizing emotional speech based on an emotion-independent model which is called “average emotion” model. The average emotion model is trained using a multi-emotion speech database. Applying a MLLR-based model adaptation method, we can transform the average emotion model to present the target emotion which is not included in the training data. A multi-emotion speech database including four emotions, “neutral”, “happiness”, “sadness”, and “anger”, is used in our experiment. The results of subjective tests show that the average emotion model can effectively synthesize neutral speech and can be adapted to the target emotion model using very limited training data.

- Speech Synthesis and Generation | Pp. 233-240

A Hakka Text-To-Speech System

Hsiu-Min Yu; Hsin-Te Hwang; Dong-Yi Lin; Sin-Horng Chen

In this paper, the implementation of a Hakka text-to-speech (TTS) system is presented. The system is designed based on the same principle of developing a Mandarin and a Min-Nan TTS systems proposed previously. It takes 671 base-syllables as basic synthesis units and uses a recurrent neural network (RNN)-based prosody generator to generate proper prosodic parameters for synthesizing natural output speech. The whole system is implemented by software and runs in real-time on PC. Informal subjective listening test confirmed that the system performed well. All synthetic speeches sounded well for well-tokenized texts and fair for texts with automatic tokenization.

- Speech Synthesis and Generation | Pp. 241-247

Adaptive Null-Forming Algorithm with Auditory Sub-bands

Heng Zhang; Qiang Fu; Yonghong Yan

This paper presents a modified noise reduction algorithm for speech enhancement based on the scheme of null-forming. A fixed infinite-duration impulse response (IIR) filter is designed to calibrate the mismatch of the microphone pair. To weaken the performance degradation caused by narrow-band effect, the signal is decomposed into several specified sub-bands with auditory characters. This increases the signal to noise ratio (SNR) considerably while preserving the auditory effect. Experiments are carried out to show the effectiveness of these processes.

- Speech Enhancement | Pp. 248-257

Multi-channel Noise Reduction in Noisy Environments

Junfeng Li; Masato Akagi; Yôiti Suzuki

Multi-channel noise reduction has been widely researched to reduce acoustic noise signals and to improve the performance of many speech applications in noisy environments. In this paper, we first introduce the state-of-the-art multi-channel noise reduction methods, especially beamforming based methods, and discuss their performance limitations. Subsequently, we present a multi-channel noise reduction system we are developing that consists of localized noise suppression by microphone array and non-localized noise suppression by post-filtering. Experimental results are also presented to show the benefits of our developed noise reduction system with respect to the traditional algorithms in terms of speech recognition rate. Some suggestions are finally presented for the further research.

- Speech Enhancement | Pp. 258-269