Catálogo de publicaciones - libros

Compartir en
redes sociales


Chinese Spoken Language, Processing: 5th International Symposium, ISCSLP 2006, Singapore, December 13-16, 2006, Proceedings

Qiang Huo ; Bin Ma ; Eng-Siong Chng ; Haizhou Li (eds.)

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

Language Translation and Linguistics; Artificial Intelligence (incl. Robotics); Mathematical Logic and Formal Languages; Data Mining and Knowledge Discovery; Algorithm Analysis and Problem Complexity; Document Preparation and Text Processing

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2006 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-49665-6

ISBN electrónico

978-3-540-49666-3

Editor responsable

Springer Nature

País de edición

China

Fecha de publicación

Información sobre derechos de publicación

© Springer-Verlag Berlin Heidelberg 2006

Tabla de contenidos

Focus, Lexical Stress and Boundary Tone: Interaction of Three Prosodic Features

Lu Zhang; Yi-Qing Zu; Run-Qiang Yan

This paper studies how focus, lexical stress and rising boundary tone act on F0 of the last preboundary word. We find that when the word is non focused, the rising boundary tone takes control almost from the beginning of the word and flattens F0 peak of the lexical stress. When the word is focused, the rising boundary tone is only dominant after F0 peak of lexical stress is formed. This peak is even higher than F0 height required by the rising boundary tone at the end of the word. Furthermore, the location of lexical stress restrains the height at F0 peak and high end to be reached. The interaction of these three factors on a single word leads to F0 competition due to limited articulatory dimensions. The study helps to build prosodic model for high quality speech synthesis.

- Topics in Speech Science | Pp. 67-75

A Robust Voice Activity Detection Based on Noise Eigenspace Projection

Dongwen Ying; Yu Shi; Frank Soong; Jianwu Dang; Xugang Lu

A robust voice activity detector (VAD) is expected to increase the accuracy of ASR in noisy environments. This study focuses on how to extract robust information for designing a robust VAD. To do so, we construct a noise eigenspace by the principal component analysis of the noise covariance matrix. Projecting noise speech onto the eigenspace, it is found that available information with higher SNR is generally located in the channels with smaller eigenvalues. According to this finding, the available components of the speech are obtained by sorting the noise eigenspace. Based on the extracted high-SNR components, we proposed a robust voice activity detector. The threshold for deciding the available channels is determined using a histogram method. A probability-weighted speech presence is used to increase the reliability of the VAD. The proposed VAD is evaluated using TIMIT database mixed with a number of noises. Experiments showed that our algorithm performs better than traditional VAD algorithms.

- Speech Analysis | Pp. 76-86

Pitch Mean Based Frequency Warping

Jian Liu; Thomas Fang Zheng; Wenhu Wu

In this paper, a novel pitch mean based frequency warping (PMFW) method is proposed to reduce the pitch variability in speech signals at the front-end of speech recognition. The warp factors used in this process are calculated based on the average pitch of a speech segment. Two functions to describe the relations between the frequency warping factor and the pitch mean are defined and compared. We use a simple method to perform frequency warping in the Mel-filter bank frequencies based on different warping factors. To solve the problem of mismatch in bandwidth between the original and the warped spectra, the Mel-filters selection strategy is proposed. At last, the PMFW mel-frequency cepstral coefficient (MFCC) is extracted based on the regular MFCC with several modifications. Experimental results show that the new PMFW MFCCs are more distinctive than the regular MFCCs.

- Speech Analysis | Pp. 87-94

A Study of Knowledge-Based Features for Obstruent Detection and Classification in Continuous Mandarin Speech

Kuang-Ting Sung; Hsiao-Chuan Wang

A study on acoustic-phonetic features for the obstruent detection and classification based on the knowledge of Mandarin speech is proposed. Seneff auditory model is used as the front-end processor for extracting acoustic-phonetic features. These features are rich in their information content in a hierarchical decision process to detect and classify the Mandarin obstruents. The preliminary experiments showed that accuracy of obstruent detection is about 84%. An algorithm based on the information of feature distribution is applied to further classify the obstruents into stops, fricatives, and affricates. The average accuracy of obstruent classification is about 80%. The proposed approach based on the feature distribution is simple and effective. It could be a very promising method for improving the phone detection in continuous speech recognition.

- Speech Analysis | Pp. 95-105

Speaker-and-Environment Change Detection in Broadcast News Using Maximum Divergence Common Component GMM

Yih-Ru Wang

In this paper, the supervised maximum-divergence common component GMM (MD-CCGMM) model was used to the speaker-and-environment change detection in broadcast news signal. In order to discriminate the speaker-and-environment change in broadcast news, the MD-CCGMM signal model will maximize the likelihood of CCGMM signal modeling and the divergence measure of different audio signal segments simultaneously. Performance of the MD-CCGMM model was examined using a four-hour TV broadcast news database. A result of 16.0% Equal Error Rate (EER) was achieved by using the divergence measure of CCGMM model. When using supervised MD-CCGMM model, 14.6% Equal Error Rate can be achieved

- Speech Analysis | Pp. 106-115

UBM Based Speaker Segmentation and Clustering for 2-Speaker Detection

Jing Deng; Thomas Fang Zheng; Wenhu Wu

In this paper, a speaker segmentation method based on log-likelihood ratio score (LLRS) over universal background model (UBM) and a speaker clustering method based on difference of log-likelihood scores between two speaker models are proposed. During the segmentation process, the LLRS between two adjacent speech segments over UBM is used as a distance measure Cwhile during the clustering process Cthe difference of log-likelihood scores between two speaker models is used as a speaker classification criterion. A complete system for NIST 2002 2-speaker task is presented using the methods mentioned above. Experimental results on NIST 2002 Switchboard Cellular speaker segmentation corpus, 1-speaker evaluation corpus and 2- speaker evaluation corpus show the potentiality of the proposed algorithms.

- Speech Analysis | Pp. 116-125

Design of Cubic Spline Wavelet for Open Set Speaker Classification in Marathi

Hemant A. Patil; T. K. Basu

In this paper, a new method of feature extraction based on design of cubic spline wavelet has been described. based speaker classification in Marathi language has been attempted in the open set mode using polynomial classifier. The method consists of dividing the speech signal into nonuniform subbands in approximate Mel-scale using an admissible wavelet packet filterbank and modeling each dialectal zone with the 2 and 3 order polynomial expansions of feature vector. Confusion matrices are also shown for different dialectal zones.

- Speech Analysis | Pp. 126-137

Rhythmic Organization of Mandarin Utterances — A Two-Stage Process

Min Chu; Yunjia Wang

This paper investigates the rhythmic organization of Mandarin utterances through both corpus analyses and experimental studies. We propose to add a new prosodic unit, the principle prosodic unit (PPU), into the prosodic hierarchy of Mandarin utterances. The key characteristic of PPU is that inner-unit words normally have to be spoken closely, while inter-unit grouping is rather flexible. Because of this, we further suggest that the rhythmic organization of Mandarin utterances is a two-stage process. In the first stage, syllables are grouped into prosodic words, and then to PPUs. The forming of PPUs is restricted by the local syntactic constraint and the length constraint. In the second stage, though the rhythmic constraint still has influences, the grouping of PPUs into phrases is rather flexible. Normally, multiple equally good solutions exist for a sentence in this stage.

- Speech Synthesis and Generation | Pp. 138-148

Prosodic Boundary Prediction Based on Maximum Entropy Model with Error-Driven Modification

Xiaonan Zhang; Jun Xu; Lianhong Cai

Prosodic boundary prediction is the key to improving the intelligibility and naturalness of synthetic speech for a TTS system. This paper investigated the problem of automatic segmentation of prosodic word and prosodic phrase, which are two fundamental layers in the hierarchical prosodic structure of Mandarin Chinese. Maximum Entropy (ME) Model was used at the front end for both prosodic word and prosodic phrase prediction, but with different feature selection schemes. A multi-pass prediction approach was adopted. Besides, an error-driven rule-based modification module was introduced into the back end to amend the initial prediction. Experiments showed that this combined approach outperformed many other methods like C4.5 and TBL.

- Speech Synthesis and Generation | Pp. 149-160

Prosodic Words Prediction from Lexicon Words with CRF and TBL Joint Method

Heng Kang; Wenju Liu

Predicting prosodic words boundaries will directly influence the naturalness of synthetic speech, because prosodic word is at the lowest level of prosody hierarchy. In this paper, a Chinese prosodic phrasing method based on CRF and TBL model is proposed. First a CRF model is trained to predict the prosodic words boundaries from lexicon words. After that we apply a TBL based error driven learning approach to refine the results. The experiments shows that this joint method performs much better than HMM.

- Speech Synthesis and Generation | Pp. 161-168