Catálogo de publicaciones - libros

Compartir en
redes sociales

Text, Speech and Dialogue: 8th International Conference, TSD 2005, Karlovy Vary, Czech Republic, September 12-15, 2005, Proceedings

Václav Matoušek ; Pavel Mautner ; Tomáš Pavelka (eds.)

En conferencia: 8º International Conference on Text, Speech and Dialogue (TSD) . Karlovy Vary, Czech Republic . September 12, 2005 - September 15, 2005

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

Language Translation and Linguistics; Artificial Intelligence (incl. Robotics); Information Storage and Retrieval; Information Systems Applications (incl. Internet)

Disponibilidad

Institución detectada	Año de publicación	Navegá	Descargá	Solicitá
No detectada	2005	SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-28789-6

ISBN electrónico

978-3-540-31817-0

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

2005

Información sobre derechos de publicación

Cobertura temática

Ciencias de la computación e información

Ingeniería eléctrica, electrónica e informática

Lenguas y literatura

Tabla de contenidos

Verificá que desde tu institución tengas acceso para descargar o solicitar el libro completo o alguno de sus capítulos.

doi: 10.1007/11551874_11

New Meta-grammar Constructs in Czech Language Parser

Aleš Horák; Vladimír Kadlec

In this paper, we present and summarize the latest development of the Czech sentence parsing system . The presented system uses the meta-grammar formalism, which enables to define the grammar with a maintainable number of meta-rules. At the same time, these meta-rules are translated into rules for efficient and fast head driven chart parsing supplemented with evaluation of additional contextual constraints. The paper includes a comprehensive description of the meta-grammar constructs as well as actual running times of the system tested on corpus data.

- Text | Pp. 85-92

doi: 10.1007/11551874_12

Anaphora in Czech: Large Data and Experiments with Automatic Anaphora Resolution

Lucie Kučová; Zdeněk Žabokrtský

The aim of this paper is two-fold. First, we want to present a part of the annotation scheme of the Prague Dependency Treebank 2.0 related to the annotation of coreference on the tectogrammatical layer of sentence representation (more than 45,000 textual and grammatical coreference links in almost 50,000 manually annotated Czech sentences). Second, we report a new pronoun resolution system developed and tested using the treebank data, the success rate of which is 60.4 %.

- Text | Pp. 93-98

doi: 10.1007/11551874_13

Valency Lexicon of Czech Verbs VALLEX: Recent Experiments with Frame Disambiguation

Markéta Lopatková; Ondřej Bojar; Jiří Semecký; Václava Benešová; Zdeněk Žabokrtský

VALLEX is a linguistically annotated lexicon aiming at a description of syntactic information which is supposed to be useful for NLP. The lexicon contains roughly 2500 manually annotated Czech verbs with over 6000 valency frames (summer 2005). In this paper we introduce VALLEX and describe an experiment where VALLEX frames were assigned to 10,000 corpus instances of 100 Czech verbs – the pairwise inter-annotator agreement reaches 75%. The part of the data where three human annotators agreed were used for an automatic word sense disambiguation task, in which we achieved the precision of 78.5%.

- Text | Pp. 99-106

doi: 10.1007/11551874_14

AARLISS – An Algorithm for Anaphora Resolution in Long-Distance Inter Sentential Scenarios

Miroslav Martinovic; Anthony Curley; John Gaskins

We present a novel approach for boosting the performance of pronominal anaphora resolution algorithms when search for antecedents has to span over a multi-sentential text passage. The approach is based on the identification of sentences which are “most semantically related” to the sentence with anaphora. The context sharing level between each possible referent sentence and the anaphoric sentence gets established utilizing an open-domain external knowledge base. Sentences with scores higher than a threshold level are considered the “most semantically related” and ranked accordingly. The qualified sentences accompanied with their context sharing scores represent a new, reduced in size, and a more semantically focused search space. Their respective scores are utilized as separate preference factors in a final phase of the resolution process – the antecedent selection. We pioneer three implementations for the algorithm with their corresponding evaluation data.

- Text | Pp. 107-114

doi: 10.1007/11551874_15

Detection and Correction of Malapropisms in Spanish by Means of Internet Search

Igor A. Bolshakov; Sofia N. Galicia-Haro; Alexander Gelbukh

Malapropisms are real-word errors that lead to syntactically correct but semantically implausible text. We report an experiment on detection and correction of Spanish malapropisms. Malapropos words semantically destroy collocations (syntactically connected word pairs) they are in. Thus we detect possible malapropisms as words that do not form semantically plausible collocations with neighboring words. As correction candidates, we select words similar to the suspected one but forming plausible collocations with neighboring words. To judge semantic plausibility of a collocation, we use Google statistics of occurrences of the word combination and of the two words taken apart. Since collocation components can be separated by other words in a sentence, Google statistics is gathered for the most probable distance between them. The statistics is recalculated to a specially defined Semantic Compatibility Index (SCI). Heuristic rules are proposed to signal malapropisms when SCI values are lower than a predetermined threshold and to retain a few highly SCI-ranked correction candidates. Our experiments gave promising results.

- Text | Pp. 115-122

doi: 10.1007/11551874_16

The Szeged Treebank

Dóra Csendes; János Csirik; Tibor Gyimóthy; András Kocsor

The major aim of the Szeged Treebank project was to create a high-quality database of syntactic structures for Hungarian that can serve as a golden standard to further research in linguistics and computational language processing. The treebank currently contains full syntactic parsing of about 82,000 sentences, which is the result of accurate manual annotation. Current paper describes the linguistic theory as well as the actual method used in the annotation process. In addition, the application of the treebank for the training of automated syntactic parsers is also presented.

- Text | Pp. 123-131

doi: 10.1007/11551874_17

Automatic Lemmatizer Construction with Focus on OOV Words Lemmatization

Jakub Kanis; Luděk Müller

This paper deals with the automatic construction of a lemmatizer from a Full Form – Lemma (FFL) training dictionary and with lemmatization of new, in the FFL dictionary unseen, i.e. out-of-vocabulary (OOV) words. Three methods of lemmatization of three kinds of OOV words (missing full forms, unknown words, and compound words) are introduced. These methods were tested on Czech test data. The best result (recall: 99.3 % and precision: 75.1 %) has been achieved by a combination of these methods. The lexicon-free lemmatizer based on the method of lemmatization of unknown words (lemmatization patterns method) is introduced too.

- Text | Pp. 132-139

doi: 10.1007/11551874_18

Modeling Syntax of Free Word-Order Languages: Dependency Analysis by Reduction

Markéta Lopatková; Martin Plátek; Vladislav Kuboň

This paper explains the principles of dependency analysis by reduction and its correspondence to the notions of dependency and dependency tree. The explanation is illustrated by examples from Czech, a language with a relatively high degree of word-order freedom. The paper sums up the basic features of methods of dependency syntax. The method serves as a basis for the verification (and explanation) of the adequacy of formal and computational models of those methods.

- Text | Pp. 140-147

doi: 10.1007/11551874_19

Morphological Meanings in the Prague Dependency Treebank 2.0

Magda Razímová; Zdeněk Žabokrtský

In this paper we report our work on the system of grammatemes (mostly semantically-oriented counterparts of morphological categories such as number, degree of comparison, or tense), the concept of which was introduced in Functional Generative Description, and is now further elaborated in the context of Prague Dependency Treebank 2.0. We present also a new hierarchical typology of tectogrammatical nodes.

- Text | Pp. 148-155

doi: 10.1007/11551874_20

Automatic Acquisition of a Slovak Lexicon from a Raw Corpus

Benoît Sagot

This paper presents an automatic methodology we used in an experiment to acquire a morphological lexicon for the Slovak language, and the lexicon we obtained. This methodology extends and refines approaches which have proven efficient, e.g., for the acquisition of French verbs or Croatian and Russian nouns, adjectives and verbs. It only relies on a raw corpus and on a morphological description of the language. The underlying idea is to build all possible lemmas that can explain all words found in the corpus, according to the morphological description, and to rank these hypothetical lemmas according to their likelihood given the corpus. Of course, hand-validation and iteration of the whole process is needed to achieve a high-quality lexicon, but the human involvement required is orders of magnitude lower than the cost of the fully manual development of such a resource. Moreover, this technique can be easily applied to other languages with a rich morphology that lack large-coverage lexical resources.

- Text | Pp. 156-163