Catálogo de publicaciones - libros
Chinese Spoken Language, Processing: 5th International Symposium, ISCSLP 2006, Singapore, December 13-16, 2006, Proceedings
Qiang Huo ; Bin Ma ; Eng-Siong Chng ; Haizhou Li (eds.)
Resumen/Descripción – provisto por la editorial
No disponible.
Palabras clave – provistas por la editorial
Language Translation and Linguistics; Artificial Intelligence (incl. Robotics); Mathematical Logic and Formal Languages; Data Mining and Knowledge Discovery; Algorithm Analysis and Problem Complexity; Document Preparation and Text Processing
Disponibilidad
Institución detectada | Año de publicación | Navegá | Descargá | Solicitá |
---|---|---|---|---|
No detectada | 2006 | SpringerLink |
Información
Tipo de recurso:
libros
ISBN impreso
978-3-540-49665-6
ISBN electrónico
978-3-540-49666-3
Editor responsable
Springer Nature
País de edición
China
Fecha de publicación
2006
Información sobre derechos de publicación
© Springer-Verlag Berlin Heidelberg 2006
Tabla de contenidos
doi: 10.1007/11939993_31
Minimum Phone Error (MPE) Model and Feature Training on Mandarin Broadcast News Task
Jia-Yu Chen; Chia-Yu Wan; Yi Chen; Berlin Chen; Lin-shan Lee
The Minimum Phone Error (MPE) criterion for discriminative training was shown to be able to offer acoustic models with significantly improved performance. This concept was then further extended to Feature-space Minimum Phone Error (fMPE) and offset fMPE for training feature parameters as well. This paper reviews the concept of MPE and reports the experiments and results in performing MPE, fMPE and offset fMPE on the task of Mandarin Broadcast News, and significant improvements were obtained similar to the results reported for other languages and other tasks by other sites. In addition, a new concept of dimension-weighted offset fMPE is proposed in this work and even better performance than offset fMPE was obtained.
- Acoustic Modeling for Automatic Speech Recognition | Pp. 270-281
doi: 10.1007/11939993_32
State-Dependent Phoneme-Based Model Merging for Dialectal Chinese Speech Recognition
Linquan Liu; Thomas Fang Zheng; Wenhu Wu
Aiming at building a dialectal Chinese speech recognizer from a standard Chinese speech recognizer with a small amount of dialectal Chinese speech, a novel, simple but effective acoustic modeling method, named (SDPBMM) method, is proposed and evaluated, where a tied-state of standard triphone(s) will be merged with a state of the dialectal monophone that is identical with the central phoneme in the triphone(s). It can be seen that the proposed method has a good performance however it will introduce a Gaussian mixtures expansion problem. To deal with it, an acoustic model distance measure, named , is proposed based on the difference measurement of Gaussian mixture models and then implemented to downsize the model size almost without causing any performance degradation for dialectal speech. With a small amount of only 40-minute Shanghai-dialectal Chinese speech, the proposed SDPBMM achieves a significant absolute syllable error rate (SER) reduction of 5.9% for dialectal Chinese and almost no performance degradation for standard Chinese. In combination with a certain existing adaptation method, another absolute SER reduction of 1.9% can be further achieved.
- Acoustic Modeling for Automatic Speech Recognition | Pp. 282-293
doi: 10.1007/11939993_33
Non-uniform Kernel Allocation Based Parsimonious HMM
Peng Liu; Jian-Lai Zhou; Frank Soong
In conventional Gaussian mixture based Hidden Markov Model (HMM), all states are usually modeled with a uniform, fixed number of Gaussian kernels. In this paper, we propose to allocate kernels non-uniformly to construct a more parsimonious HMM. Different number of Gaussian kernels are allocated to states in a non-uniform and parsimonious way so as to optimize the Minimum Description Length (MDL) criterion, which is a combination of data likelihood and model complexity penalty. By using the likelihoods obtained in Baum-Welch training, we develop an effcient backward kernel pruning algorithm, and it is shown to be optimal under two mild assumptions. Two databases, Resource Management and Microsoft Mandarin Speech Toolbox, are used to test the proposed parsimonious modeling algorithm. The new parsimonious models improve the baseline word recognition error rate by 11.1% and 5.7%, relatively. Or at the same performance level, a 35-50% model compressions can be obtained.
- Acoustic Modeling for Automatic Speech Recognition | Pp. 294-302
doi: 10.1007/11939993_34
Consistent Modeling of the Static and Time-Derivative Cepstrums for Speech Recognition Using HSPTM
Yiu-Pong Lai; Man-Hung Siu
Most speech models represent the static and derivative cepstral features with separate models that can be inconsistent with each other. In our previous work, we proposed the hidden spectral peak trajectory model (HSPTM) in which the static cepstral trajectories are derived from a set of hidden trajectories of the spectral peaks (captured as spectral poles) in the time-frequency domain. In this work, the HSPTM is generalized such that both the static and derivative features are derived from a single set of hidden pole trajectories using the well-known relationship between the spectral poles and cepstral coefficients. As the pole trajectories represent the resonance frequencies across time, they can be interpreted as formant tracks in voiced speech which have been shown to contain important cues for phonemic identification. To preserve the common recognition framework, the likelihood functions are still defined in the cepstral domain with the acoustic models defined by the static and derivative cepstral trajectories. However, these trajectories are no longer separately estimated but jointly derived, and thus are ensured to be consistent with each other. Vowel classification experiments were performed on the TIMIT corpus, using low complexity models (2-mixture). They showed 3% (absolute) classification error reduction compared to the standard HMM of the same complexity.
- Acoustic Modeling for Automatic Speech Recognition | Pp. 303-314
doi: 10.1007/11939993_35
Vector Autoregressive Model for Missing Feature Reconstruction
Xiong Xiao; Haizhou Li; Eng Siong Chng
This paper proposes a Vector Autoregressive (VAR) model as a new technique for missing feature reconstruction in ASR. We model the spectral features using multiple VAR models. A VAR model predicts missing features as a linear function of a block of feature frames. We also propose two schemes for VAR training and testing. The experiments on AURORA-2 database have validated the modeling methodology and shown that the proposed schemes are especially effective for low SNR speech signals. The best setting has achieved a recognition accuracy of 88.2% at -5dB SNR on subway noise task when oracle data mask is used.
- Robust Speech Recognition | Pp. 315-324
doi: 10.1007/11939993_36
Auditory Contrast Spectrum for Robust Speech Recognition
Xugang Lu; Jianwu Dang
Traditional speech representations are based on power spectrum which is obtained by energy integration from many frequency bands. Such representations are sensitive to noise since noise energy distributed in a wide frequency band may deteriorate speech representations. Inspired by the contrast sensitive mechanism in auditory neural processing, in this paper, we propose an auditory contrast spectrum extraction algorithm which is a relative representation of auditory temporal and frequency spectrum. In this algorithm, speech is first processed using a temporal contrast processing which enhances speech temporal modulation envelopes in each auditory filter band and suppresses steady low contrast envelopes. The temporal contrast enhanced speech is then integrated to form speech spectrum which is named as temporal contrast spectrum. The temporal contrast spectrum is then analyzed in spectral scale spaces. Since speech and noise spectral profiles are different, we apply a lateral inhibition function to choose a spectral profile subspace in which noise component is reduced more while speech component is not deteriorated. We project the temporal contrast spectrum to the optimal scale space in which cepstral feature is extracted. We apply this cepstral feature for robust speech recognition experiments on AURORA-2J corpus. The recognition results show that there is 61.12% improvement of relative performance for clean training and 27.45% improvement of relative performance for multi-condition training.
- Robust Speech Recognition | Pp. 325-334
doi: 10.1007/11939993_37
Signal Trajectory Based Noise Compensation for Robust Speech Recognition
Zhi-Jie Yan; Jian-Lai Zhou; Frank Soong; Ren-Hua Wang
This paper presents a novel signal trajectory based noise compensation algorithm for robust speech recognition. Its performance is evaluated on the Aurora 2 database. The algorithm consists of two processing stages: 1) noise spectrum is estimated using trajectory auto-segmentation and clustering, so that spectral subtraction can be performed to roughly estimate the clean speech trajectories; 2) these trajectories are regenerated using trajectory HMMs, where the constraint between static and dynamic spectral information is imposed to refine the noise subtracted trajectories both in “level” and “shape”. Experimental results show that the recognition performance after spectral subtraction is improved with or without trajectory regeneration, but the HMM regenerated trajectories yields the best performance improvement. After spectral subtraction, the average relative error rate reductions of clean and multi-condition training are 23.21% and 5.58%, respectively. And the proposed trajectory regeneration algorithm further improves them to 42.59% and 15.80%.
- Robust Speech Recognition | Pp. 335-345
doi: 10.1007/11939993_38
An HMM Compensation Approach Using Unscented Transformation for Noisy Speech Recognition
Yu Hu; Qiang Huo
The performance of current HMM-based automatic speech recognition (ASR) systems degrade significantly in real-world applications where there exist mismatches between training and testing conditions caused by factors such as mismatched signal capturing and transmission channels and additive environmental noises. Among many approaches proposed previously to cope with the above robust ASR problem, two notable HMM compensation approaches are the so-called Parallel Model Combination (PMC) and Vector Taylor Series (VTS) approaches, respectively. In this paper, we introduce a new HMM compensation approach using a technique called Unscented Transformation (UT). As a first step, we have studied three implementations of the UT approach with different computational complexities for noisy speech recognition, and evaluated their performance on Aurora2 connected digits database. The UT approaches achieve significant improvements in recognition accuracy compared to log-normal-approximation-based PMC and first-order-approximation-based VTS approaches.
- Robust Speech Recognition | Pp. 346-357
doi: 10.1007/11939993_39
Noisy Speech Recognition Performance of Discriminative HMMs
Jun Du; Peng Liu; Frank Soong; Jian-Lai Zhou; Ren-Hua Wang
Discriminatively trained HMMs are investigated in both clean and noisy environments in this study. First, a recognition error is defined at different levels including string, word, phone and acoustics. A high resolution error measure in terms of minimum divergence (MD) is specifically proposed and investigated along with other error measures. Using two speaker-independent continuous digit databases, Aurora2(English) and CNDigits (Mandarin Chinese), the recognition performance of recognizers, which are trained in terms of different error measures and using different training modes, is evaluated under different noise and SNR conditions. Experimental results show that discriminatively trained models performed better than the maximum likelihood baseline systems. Specifically, for MD trained systems, relative error reductions of 17.62% and 18.52% were obtained applying multi-training on Aurora2 and CNDigits, respectively.
- Robust Speech Recognition | Pp. 358-369
doi: 10.1007/11939993_40
Distributed Speech Recognition of Mandarin Digits String
Yih-Ru Wang; Bo-Xuan Lu; Yuan-Fu Liao; Sin-Horng Chen
In this paper, the performance of the pitch detection algorithm in ETSI ES-202-212 XAFE standard is evaluated on a Mandarin digit string recognition task. Experimental results showed that the performance of the pitch detection algorithm degraded seriously when the SNR of speech signal was lower than 10dB. This makes the recognizer using pitch information perform inferior to the original recognizer without using pitch information in low SNR environments. A modification of the pitch detection algorithm is therefore proposed to improve the performance of pitch detection in low SNR environments. The recognition performance can be improved for most SNR levels by integrating the recognizers with and without using pitch information. Overall recognition rates of 82.1% and 86.8% were achieved for clean and multi-condition training cases.
- Robust Speech Recognition | Pp. 370-379