Catálogo de publicaciones - libros

Compartir en
redes sociales


Chinese Spoken Language, Processing: 5th International Symposium, ISCSLP 2006, Singapore, December 13-16, 2006, Proceedings

Qiang Huo ; Bin Ma ; Eng-Siong Chng ; Haizhou Li (eds.)

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

Language Translation and Linguistics; Artificial Intelligence (incl. Robotics); Mathematical Logic and Formal Languages; Data Mining and Knowledge Discovery; Algorithm Analysis and Problem Complexity; Document Preparation and Text Processing

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2006 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-49665-6

ISBN electrónico

978-3-540-49666-3

Editor responsable

Springer Nature

País de edición

China

Fecha de publicación

Información sobre derechos de publicación

© Springer-Verlag Berlin Heidelberg 2006

Tabla de contenidos

Minimum Phone Error (MPE) Model and Feature Training on Mandarin Broadcast News Task

Jia-Yu Chen; Chia-Yu Wan; Yi Chen; Berlin Chen; Lin-shan Lee

The Minimum Phone Error (MPE) criterion for discriminative training was shown to be able to offer acoustic models with significantly improved performance. This concept was then further extended to Feature-space Minimum Phone Error (fMPE) and offset fMPE for training feature parameters as well. This paper reviews the concept of MPE and reports the experiments and results in performing MPE, fMPE and offset fMPE on the task of Mandarin Broadcast News, and significant improvements were obtained similar to the results reported for other languages and other tasks by other sites. In addition, a new concept of dimension-weighted offset fMPE is proposed in this work and even better performance than offset fMPE was obtained.

- Acoustic Modeling for Automatic Speech Recognition | Pp. 270-281

State-Dependent Phoneme-Based Model Merging for Dialectal Chinese Speech Recognition

Linquan Liu; Thomas Fang Zheng; Wenhu Wu

Aiming at building a dialectal Chinese speech recognizer from a standard Chinese speech recognizer with a small amount of dialectal Chinese speech, a novel, simple but effective acoustic modeling method, named (SDPBMM) method, is proposed and evaluated, where a tied-state of standard triphone(s) will be merged with a state of the dialectal monophone that is identical with the central phoneme in the triphone(s). It can be seen that the proposed method has a good performance however it will introduce a Gaussian mixtures expansion problem. To deal with it, an acoustic model distance measure, named , is proposed based on the difference measurement of Gaussian mixture models and then implemented to downsize the model size almost without causing any performance degradation for dialectal speech. With a small amount of only 40-minute Shanghai-dialectal Chinese speech, the proposed SDPBMM achieves a significant absolute syllable error rate (SER) reduction of 5.9% for dialectal Chinese and almost no performance degradation for standard Chinese. In combination with a certain existing adaptation method, another absolute SER reduction of 1.9% can be further achieved.

- Acoustic Modeling for Automatic Speech Recognition | Pp. 282-293

Non-uniform Kernel Allocation Based Parsimonious HMM

Peng Liu; Jian-Lai Zhou; Frank Soong

In conventional Gaussian mixture based Hidden Markov Model (HMM), all states are usually modeled with a uniform, fixed number of Gaussian kernels. In this paper, we propose to allocate kernels non-uniformly to construct a more parsimonious HMM. Different number of Gaussian kernels are allocated to states in a non-uniform and parsimonious way so as to optimize the Minimum Description Length (MDL) criterion, which is a combination of data likelihood and model complexity penalty. By using the likelihoods obtained in Baum-Welch training, we develop an effcient backward kernel pruning algorithm, and it is shown to be optimal under two mild assumptions. Two databases, Resource Management and Microsoft Mandarin Speech Toolbox, are used to test the proposed parsimonious modeling algorithm. The new parsimonious models improve the baseline word recognition error rate by 11.1% and 5.7%, relatively. Or at the same performance level, a 35-50% model compressions can be obtained.

- Acoustic Modeling for Automatic Speech Recognition | Pp. 294-302

Consistent Modeling of the Static and Time-Derivative Cepstrums for Speech Recognition Using HSPTM

Yiu-Pong Lai; Man-Hung Siu

Most speech models represent the static and derivative cepstral features with separate models that can be inconsistent with each other. In our previous work, we proposed the hidden spectral peak trajectory model (HSPTM) in which the static cepstral trajectories are derived from a set of hidden trajectories of the spectral peaks (captured as spectral poles) in the time-frequency domain. In this work, the HSPTM is generalized such that both the static and derivative features are derived from a single set of hidden pole trajectories using the well-known relationship between the spectral poles and cepstral coefficients. As the pole trajectories represent the resonance frequencies across time, they can be interpreted as formant tracks in voiced speech which have been shown to contain important cues for phonemic identification. To preserve the common recognition framework, the likelihood functions are still defined in the cepstral domain with the acoustic models defined by the static and derivative cepstral trajectories. However, these trajectories are no longer separately estimated but jointly derived, and thus are ensured to be consistent with each other. Vowel classification experiments were performed on the TIMIT corpus, using low complexity models (2-mixture). They showed 3% (absolute) classification error reduction compared to the standard HMM of the same complexity.

- Acoustic Modeling for Automatic Speech Recognition | Pp. 303-314

Vector Autoregressive Model for Missing Feature Reconstruction

Xiong Xiao; Haizhou Li; Eng Siong Chng

This paper proposes a Vector Autoregressive (VAR) model as a new technique for missing feature reconstruction in ASR. We model the spectral features using multiple VAR models. A VAR model predicts missing features as a linear function of a block of feature frames. We also propose two schemes for VAR training and testing. The experiments on AURORA-2 database have validated the modeling methodology and shown that the proposed schemes are especially effective for low SNR speech signals. The best setting has achieved a recognition accuracy of 88.2% at -5dB SNR on subway noise task when oracle data mask is used.

- Robust Speech Recognition | Pp. 315-324

Auditory Contrast Spectrum for Robust Speech Recognition

Xugang Lu; Jianwu Dang

Traditional speech representations are based on power spectrum which is obtained by energy integration from many frequency bands. Such representations are sensitive to noise since noise energy distributed in a wide frequency band may deteriorate speech representations. Inspired by the contrast sensitive mechanism in auditory neural processing, in this paper, we propose an auditory contrast spectrum extraction algorithm which is a relative representation of auditory temporal and frequency spectrum. In this algorithm, speech is first processed using a temporal contrast processing which enhances speech temporal modulation envelopes in each auditory filter band and suppresses steady low contrast envelopes. The temporal contrast enhanced speech is then integrated to form speech spectrum which is named as temporal contrast spectrum. The temporal contrast spectrum is then analyzed in spectral scale spaces. Since speech and noise spectral profiles are different, we apply a lateral inhibition function to choose a spectral profile subspace in which noise component is reduced more while speech component is not deteriorated. We project the temporal contrast spectrum to the optimal scale space in which cepstral feature is extracted. We apply this cepstral feature for robust speech recognition experiments on AURORA-2J corpus. The recognition results show that there is 61.12% improvement of relative performance for clean training and 27.45% improvement of relative performance for multi-condition training.

- Robust Speech Recognition | Pp. 325-334

Signal Trajectory Based Noise Compensation for Robust Speech Recognition

Zhi-Jie Yan; Jian-Lai Zhou; Frank Soong; Ren-Hua Wang

This paper presents a novel signal trajectory based noise compensation algorithm for robust speech recognition. Its performance is evaluated on the Aurora 2 database. The algorithm consists of two processing stages: 1) noise spectrum is estimated using trajectory auto-segmentation and clustering, so that spectral subtraction can be performed to roughly estimate the clean speech trajectories; 2) these trajectories are regenerated using trajectory HMMs, where the constraint between static and dynamic spectral information is imposed to refine the noise subtracted trajectories both in “level” and “shape”. Experimental results show that the recognition performance after spectral subtraction is improved with or without trajectory regeneration, but the HMM regenerated trajectories yields the best performance improvement. After spectral subtraction, the average relative error rate reductions of clean and multi-condition training are 23.21% and 5.58%, respectively. And the proposed trajectory regeneration algorithm further improves them to 42.59% and 15.80%.

- Robust Speech Recognition | Pp. 335-345

An HMM Compensation Approach Using Unscented Transformation for Noisy Speech Recognition

Yu Hu; Qiang Huo

The performance of current HMM-based automatic speech recognition (ASR) systems degrade significantly in real-world applications where there exist mismatches between training and testing conditions caused by factors such as mismatched signal capturing and transmission channels and additive environmental noises. Among many approaches proposed previously to cope with the above robust ASR problem, two notable HMM compensation approaches are the so-called Parallel Model Combination (PMC) and Vector Taylor Series (VTS) approaches, respectively. In this paper, we introduce a new HMM compensation approach using a technique called Unscented Transformation (UT). As a first step, we have studied three implementations of the UT approach with different computational complexities for noisy speech recognition, and evaluated their performance on Aurora2 connected digits database. The UT approaches achieve significant improvements in recognition accuracy compared to log-normal-approximation-based PMC and first-order-approximation-based VTS approaches.

- Robust Speech Recognition | Pp. 346-357

Noisy Speech Recognition Performance of Discriminative HMMs

Jun Du; Peng Liu; Frank Soong; Jian-Lai Zhou; Ren-Hua Wang

Discriminatively trained HMMs are investigated in both clean and noisy environments in this study. First, a recognition error is defined at different levels including string, word, phone and acoustics. A high resolution error measure in terms of minimum divergence (MD) is specifically proposed and investigated along with other error measures. Using two speaker-independent continuous digit databases, Aurora2(English) and CNDigits (Mandarin Chinese), the recognition performance of recognizers, which are trained in terms of different error measures and using different training modes, is evaluated under different noise and SNR conditions. Experimental results show that discriminatively trained models performed better than the maximum likelihood baseline systems. Specifically, for MD trained systems, relative error reductions of 17.62% and 18.52% were obtained applying multi-training on Aurora2 and CNDigits, respectively.

- Robust Speech Recognition | Pp. 358-369

Distributed Speech Recognition of Mandarin Digits String

Yih-Ru Wang; Bo-Xuan Lu; Yuan-Fu Liao; Sin-Horng Chen

In this paper, the performance of the pitch detection algorithm in ETSI ES-202-212 XAFE standard is evaluated on a Mandarin digit string recognition task. Experimental results showed that the performance of the pitch detection algorithm degraded seriously when the SNR of speech signal was lower than 10dB. This makes the recognizer using pitch information perform inferior to the original recognizer without using pitch information in low SNR environments. A modification of the pitch detection algorithm is therefore proposed to improve the performance of pitch detection in low SNR environments. The recognition performance can be improved for most SNR levels by integrating the recognizers with and without using pitch information. Overall recognition rates of 82.1% and 86.8% were achieved for clean and multi-condition training cases.

- Robust Speech Recognition | Pp. 370-379