Catálogo de publicaciones - libros
Chinese Spoken Language, Processing: 5th International Symposium, ISCSLP 2006, Singapore, December 13-16, 2006, Proceedings
Qiang Huo ; Bin Ma ; Eng-Siong Chng ; Haizhou Li (eds.)
Resumen/Descripción – provisto por la editorial
No disponible.
Palabras clave – provistas por la editorial
Language Translation and Linguistics; Artificial Intelligence (incl. Robotics); Mathematical Logic and Formal Languages; Data Mining and Knowledge Discovery; Algorithm Analysis and Problem Complexity; Document Preparation and Text Processing
Disponibilidad
Institución detectada | Año de publicación | Navegá | Descargá | Solicitá |
---|---|---|---|---|
No detectada | 2006 | SpringerLink |
Información
Tipo de recurso:
libros
ISBN impreso
978-3-540-49665-6
ISBN electrónico
978-3-540-49666-3
Editor responsable
Springer Nature
País de edición
China
Fecha de publicación
2006
Información sobre derechos de publicación
© Springer-Verlag Berlin Heidelberg 2006
Tabla de contenidos
doi: 10.1007/11939993_41
Unsupervised Speaker Adaptation Using Reference Speaker Weighting
Tsz-Chung Lai; Brian Mak
Recently, we revisited the fast adaptation method called (RSW), and suggested a few modifications. We then showed that the algorithmically simplest technique actually outperformed conventional adaptation techniques like MAP and MLLR for 5- or 10-second supervised adaptation on the Wall Street Journal 5K task. In this paper, we would like to further investigate the performance of RSW in unsupervised adaptation mode, which is the more natural way of doing adaptation in practice. Moreover, various analyses were carried out on the reference speakers computed by the method.
- Speech Adaptation/Normalization | Pp. 380-389
doi: 10.1007/11939993_42
Automatic Construction of Regression Class Tree for MLLR Via Model-Based Hierarchical Clustering
Shih-Sian Cheng; Yeong-Yuh Xu; Hsin-Min Wang; Hsin-Chia Fu
In this paper, we propose a model-based hierarchical clustering algorithm that automatically builds a regression class tree for the well-known speaker adaptation technique – Maximum Likelihood Linear Regression (MLLR). When building a regression class tree, the mean vectors of the Gaussian components of the model set of a speaker independent CDHMM-based speech recognition system are collected as the input data for clustering. The proposed algorithm comprises two stages. First, the input data (i.e., all the Gaussian mean vectors of the CDHMMs) is iteratively partitioned by a divisive hierarchical clustering strategy, and the Bayesian Information Criterion (BIC) is applied to determine the number of clusters (i.e., the base classes of the regression class tree). Then, the regression class tree is built by iteratively merging these base clusters using an agglomerative hierarchical clustering strategy, which also uses BIC as the merging criterion. We evaluated the proposed regression class tree construction algorithm on a Mandarin Chinese continuous speech recognition task. Compared to the regression class tree implementation in HTK, the proposed algorithm is more effective in building the regression class tree and can determine the number of regression classes automatically.
- Speech Adaptation/Normalization | Pp. 390-398
doi: 10.1007/11939993_43
A Minimum Boundary Error Framework for Automatic Phonetic Segmentation
Jen-Wei Kuo; Hsin-Min Wang
This paper presents a novel framework for HMM-based automatic phonetic segmentation that improves the accuracy of placing phone boundaries. In the framework, both training and segmentation approaches are proposed according to the minimum boundary error (MBE) criterion, which tries to minimize the expected boundary errors over a set of possible phonetic alignments. This framework is inspired by the recently proposed minimum phone error (MPE) training approach and the minimum Bayes risk decoding algorithm for automatic speech recognition. To evaluate the proposed MBE framework, we conduct automatic phonetic segmentation experiments on the TIMIT acoustic-phonetic continuous speech corpus. MBE segmentation with MBE-trained models can identify 80.53% of human-labeled phone boundaries within a tolerance of 10 ms, compared to 71.10% identified by conventional ML segmentation with ML-trained models. Moreover, by using the MBE framework, only 7.15% of automatically labeled phone boundaries have errors larger than 20 ms.
- General Topics in Speech Recognition | Pp. 399-409
doi: 10.1007/11939993_44
Advances in Mandarin Broadcast Speech Transcription at IBM Under the DARPA GALE Program
Yong Qin; Qin Shi; Yi Y. Liu; Hagai Aronowitz; Stephen M. Chu; Hong-Kwang Kuo; Geoffrey Zweig
This paper describes the technical and system building advances in the automatic transcription of Mandarin broadcast speech made at IBM in the first year of the DARPA GALE program. In particular, we discuss the application of (MPE) discriminative training and a new topic-adaptive language modeling technique. We present results on both the RT04 evaluation data and two larger community-defined test sets designed to cover both the broadcast news and the broadcast conversation domain. It is shown that with the described advances, the new transcription system achieves a 26.3% relative reduction in character error rate over our previous best-performing system, and is competitive with published numbers on these datasets. The results are further analyzed to give a comprehensive account of the relationship between the errors and the properties of the test data.
- Large Vocabulary Continuous Speech Recognition | Pp. 410-421
doi: 10.1007/11939993_45
Improved Large Vocabulary Continuous Chinese Speech Recognition by Character-Based Consensus Networks
Yi-Sheng Fu; Yi-Cheng Pan; Lin-shan Lee
Word-based consensus networks have been verified to be very useful in minimizing word error rates (WER) for large vocabulary continuous speech recognition for western languages. By considering the special structure of Chinese language, this paper points out that character-based rather then word-based consensus networks should work better for Chinese language. This was verified by extensive experimental results also reported in the paper.
- Large Vocabulary Continuous Speech Recognition | Pp. 422-434
doi: 10.1007/11939993_46
All-Path Decoding Algorithm for Segmental Based Speech Recognition
Yun Tang; Wenju Liu; Bo Xu
In conventional speech processing, researchers adopt a dividable assumption, that the speech utterance can be divided into non-overlapping feature sequences and each segment represents an acoustic event or a label. And the probability of a label sequence on an utterance approximates to the probability of the best utterance segmentation for this label sequence. But in the real case, feature sequences of acoustic events may be overlapped partially, especially for the neighboring phonemes within a syllable. And the best segmentation approximation even reinforces the distortion by the dividable assumption. In this paper, we propose an all-path decoding algorithm, which can fuse the information obtained by different segmentations (or paths) without paying obvious computation load, so the weakness of the dividable assumption could be alleviated. Our experiments show, the new decoding algorithm can improve the system performance effectively in tasks with heavy insertion and deletion errors.
- Large Vocabulary Continuous Speech Recognition | Pp. 435-444
doi: 10.1007/11939993_47
Improved Mandarin Speech Recognition by Lattice Rescoring with Enhanced Tone Models
Huanliang Wang; Yao Qian; Frank Soong; Jian-Lai Zhou; Jiqing Han
Tone plays an important lexical role in spoken tonal languages like Mandarin Chinese. In this paper we propose a two-pass search strategy for improving tonal syllable recognition performance. In the first pass, instantaneous F0 information is employed along with corresponding cepstral information in a 2-stream HMM based decoding. The F0 stream, which incorporates both discrete voiced/unvoiced information and continuous F0 contour, is modeled with a multi-space distribution. With just the first-pass decoding, we recently reported a relative improvement of 24% reduction of tonal syllable recognition errors on a Mandarin Chinese database [5]. In the second pass, F0 information over a horizontal, longer time span is used to build explicit tone models for rescoring the lattice generated in the first pass. Experimental results on the same Mandarin database show that an additional 8% relative error reduction of tonal syllable recognition is obtained by the second-pass search, lattice rescoring with enhanced tone models.
- Large Vocabulary Continuous Speech Recognition | Pp. 445-453
doi: 10.1007/11939993_48
On Using Entropy Information to Improve Posterior Probability-Based Confidence Measures
Tzan-Hwei Chen; Berlin Chen; Hsin-Min Wang
In this paper, we propose a novel approach that reduces the confidence error rate of traditional posterior probability-based confidence measures in large vocabulary continuous speech recognition systems. The method enhances the discriminability of confidence measures by applying entropy information to the posterior probability-based confidence measures of word hypotheses. The experiments conducted on the Chinese Mandarin broadcast news database MATBN show that entropy-based confidence measures outperform traditional posterior probability-based confidence measures. The relative reductions in the confidence error rate are 14.11% and 9.17% for experiments conducted on field reporter speech and interviewee speech, respectively.
- Large Vocabulary Continuous Speech Recognition | Pp. 454-463
doi: 10.1007/11939993_49
Vietnamese Automatic Speech Recognition: The FLaVoR Approach
Quan Vu; Kris Demuynck; Dirk Van Compernolle
Automatic speech recognition for languages in Southeast Asia, including Chinese, Thai and Vietnamese, typically models both acoustics and languages at the syllable level. This paper presents a new approach for recognizing those languages by exploiting information at the word level. The new approach, adapted from our FLaVoR architecture[1], consists of two layers. In the first layer, a pure acoustic-phonemic search generates a dense phoneme network enriched with meta data. In the second layer, a word decoding is performed in the composition of a series of finite state transducers (FST), combining various knowledge sources across sub-lexical, word lexical and word-based language models. Experimental results on the Vietnamese Broadcast News corpus showed that our approach is both effective and flexible.
- Large Vocabulary Continuous Speech Recognition | Pp. 464-474
doi: 10.1007/11939993_50
Language Identification by Using Syllable-Based Duration Classification on Code-Switching Speech
Dau-cheng Lyu; Ren-yuan Lyu; Yuang-chin Chiang; Chun-nan Hsu
Many approaches to automatic spoken language identification (LID) on monolingual speech are successfully, but LID on the code-switching speech identifying at least 2 languages from one acoustic utterance challenges these approaches. In [6], we have successfully used one-pass approach to recognize the Chinese character on the Mandarin-Taiwanese code-switching speech. In this paper, we introduce a classification method (named syllable-based duration classification) based on three clues: recognized common tonal syllable tonal syllable, the corresponding duration and speech signal to identify specific language from code-switching speech. Experimental results show that the performance of the proposed LID approach on code-switching speech exhibits closely to that of parallel tonal syllable recognition LID system on monolingual speech.
- Multilingual Recognition and Identification | Pp. 475-484