Catálogo de publicaciones - libros

Compartir en
redes sociales


Chinese Spoken Language, Processing: 5th International Symposium, ISCSLP 2006, Singapore, December 13-16, 2006, Proceedings

Qiang Huo ; Bin Ma ; Eng-Siong Chng ; Haizhou Li (eds.)

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

Language Translation and Linguistics; Artificial Intelligence (incl. Robotics); Mathematical Logic and Formal Languages; Data Mining and Knowledge Discovery; Algorithm Analysis and Problem Complexity; Document Preparation and Text Processing

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2006 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-49665-6

ISBN electrónico

978-3-540-49666-3

Editor responsable

Springer Nature

País de edición

China

Fecha de publicación

Información sobre derechos de publicación

© Springer-Verlag Berlin Heidelberg 2006

Tabla de contenidos

Some Improvements in Phrase-Based Statistical Machine Translation

Zhendong Yang; Wei Pang; Jinhua Du; Wei Wei; Bo Xu

In statistical machine translation, many of the top-performing systems are phrase-based systems. This paper describes a phrase-based translation system and some improvements. We use more information to compute translation probability. The scaling factors of the log-linear models are estimated by the minimum error rate training that uses an evaluation criteria to balance BLEU and NIST scores. We extract phrase-template from initial phrases to deal with data sparseness and distortion problem through decoding. By re-ranking the n-best list of translations generated firstly, the system gets the final output. Some experiments concerned show that all these refinements are beneficial to get better results.

- Machine Translation of Speech | Pp. 704-711

Automatic Spoken Language Translation Template Acquisition Based on Boosting Structure Extraction and Alignment

Rile Hu; Xia Wang

In this paper, we propose a new approach for acquiring translation templates automatically from unannotated bilingual spoken language corpora. Two basic algorithms are adopted: a grammar induction algorithm, and an alignment algorithm using Bracketing Transduction Grammar. The approach is unsupervised, statistical, data-driven, and employs no parsing procedure. The acquisition procedure consists of two steps. First, semantic groups and phrase structure groups are extracted from both the source language and the target language through a boosting procedure, in which a synonym dictionary is used to generate the seed groups of the semantic groups. Second, an alignment algorithm based on Bracketing Transduction Grammar aligns the phrase structure groups. The aligned phrase structure groups are post-processed, yielding translation templates. Preliminary experimental results show that the algorithm is effective.

- Machine Translation of Speech | Pp. 712-723

HKUST/MTS: A Very Large Scale Mandarin Telephone Speech Corpus

Yi Liu; Pascale Fung; Yongsheng Yang; Christopher Cieri; Shudong Huang; David Graff

The paper describes the design, collection, transcription and analysis of 200 hours of HKUST Mandarin Telephone Speech Corpus (HKUST/MTS) from over 2100 Mandarin speakers in mainland China under the DARPA EARS framework. The corpus includes speech data, transcriptions and speaker demographic information. The speech data include 1206 ten-minute natural Mandarin conversations between either strangers or friends. Each conversation focuses on a single topic. All calls are recorded over public telephone networks. All calls are manually annotated with standard Chinese characters (GBK) as well as specific mark-ups for spontaneous speech. A file with speaker demographic information is also provided. The corpus is the largest and first of its kind for Mandarin conversational telephone speech, providing abundant and diversified samples for Mandarin speech recognition and other application-dependent tasks, such as topic detection, information retrieval, keyword spotting, speaker recognition, etc. In a 2004 evaluation test by NIST, the corpus is found to improve system performance quite significantly.

- Spoken Language Resources and Annotation | Pp. 724-735

The Paradigm for Creating Multi-lingual Text-To-Speech Voice Databases

Min Chu; Yong Zhao; Yining Chen; Lijuan Wang; Frank Soong

Voice database is one of the most important parts in TTS systems. However, creating a high quality new TTS voice is not an easy task even for a professional team. The whole process is rather complicated and contains plenty minutiae that should be handled carefully. In fact, in many stages, human interference such as manually checking or labeling is necessary. In multi-lingual situations, it is more challenge to find qualified people to do this kind of interference. That’s why most state-of-the-art TTS systems can provide only a few voices. In this paper, we outline a uniform paradigm for creating multi-lingual TTS voice databases. It focuses on technologies that can either improve the scalability of data collection or reduce human interference such as manually checking or labeling. With this paradigm, we decrease the complexity and work load of the task.

- Spoken Language Resources and Annotation | Pp. 736-747

Multilingual Speech Corpora for TTS System Development

Hsi-Chun Hsiao; Hsiu-Min Yu; Yih-Ru Wang; Sin-Horng Chen

In this paper, four speech corpora collected in the Speech Lab of NCTU in recent years are discussed. They include a Mandarin tree-bank speech corpus, a Min-Nan speech corpus, a Hakka speech corpus, and a Chinese-English mixed speech corpus. Currently, they are used separately to develop a corpus-based Mandarin TTS system, a Min-Nan TTS system, a Hakka TTS system, and a Chinese-English bilingual TTS system. These systems will be integrated in the future to construct a multilingual TTS system covering the four primary languages used in Taiwan.

- Spoken Language Resources and Annotation | Pp. 748-759

Construct Trilingual Parallel Corpus on Demand

Muyun Yang; Hongfei Jiang; Tiejun Zhao; Sheng Li

This paper describes the effort of constructing the Olympic Oriented Trilingual Corpus for the development of NLP applications for Beijing 2008. Designed to support the real NLP applications instead of pure research purpose, this corpus is challenged by multilingual, multi domain and multi system requirements in its construction. The key issue, however, lies in the determination of the proper corpus scale in relation to the time and cost allowed. To solve this problem, this paper proposes to observe the better system performance in the sub-domain than in the whole corpus as the signal of least corpus needed. The hypothesis is that the multi-domain corpus should be sufficient to reveal the domain features at least. So far a Chinese English Japanese tri-lingual corpus totaling 2.4 million words has been accomplished as the first stage result, in which information on domains, locations and topics of the language materials has been annotated in XML.

- Spoken Language Resources and Annotation | Pp. 760-767

The Contribution of Lexical Resources to Natural Language Processing of CJK Languages

Jack Halpern

The role of lexical resources is often understated in NLP research. The complexity of Chinese, Japanese and Korean (CJK) poses special challenges to developers of NLP tools, especially in the area of word segmentation (WS), information retrieval (IR), named entity extraction (NER), and machine translation (MT). These difficulties are exacerbated by the lack of comprehensive lexical resources, especially for proper nouns, and the lack of a standardized orthography, especially in Japanese. This paper summarizes some of the major linguistic issues in the development NLP applications that are dependent on lexical resources, and discusses the central role such resources should play in enhancing the accuracy of NLP tools.

- Spoken Language Resources and Annotation | Pp. 768-780

Multilingual Spoken Language Corpus Development for Communication Research

Toshiyuki Takezawa

A multilingual spoken language corpus is indispensable for spoken language communication research such as speech-to-speech translation. To promote multilingual spoken language research and development, unified structure and annotation, such as tagging, is indispensable for both speech and natural language processing. We describe our experience with multilingual spoken language corpus development at our research institution, focusing in particular on speech recognition and natural language processing for speech translation of travel conversations.

- Spoken Language Resources and Annotation | Pp. 781-791

Development of Multi-lingual Spoken Corpora of Indian Languages

K. Samudravijaya

This paper describes a recently initiated effort for collection and transcription of read as well as spontaneous speech data in four Indian languages. The completed preparatory work include the design of phonetically rich sentences, data acquisition setup for recording speech data over telephone channel, a Wizard of Oz setup for acquiring speech data of a spoken dialogue of a caller with the machine in the context of a remote information retrieval task. An account of care taken to collect speech data that is as close to real world as possible is given. The current status of the programme and the set of actions planned to achieve the goal is given.

- Spoken Language Resources and Annotation | Pp. 792-801