Catálogo de publicaciones - libros
Text, Speech and Dialogue: 8th International Conference, TSD 2005, Karlovy Vary, Czech Republic, September 12-15, 2005, Proceedings
Václav Matoušek ; Pavel Mautner ; Tomáš Pavelka (eds.)
En conferencia: 8º International Conference on Text, Speech and Dialogue (TSD) . Karlovy Vary, Czech Republic . September 12, 2005 - September 15, 2005
Resumen/Descripción – provisto por la editorial
No disponible.
Palabras clave – provistas por la editorial
Language Translation and Linguistics; Artificial Intelligence (incl. Robotics); Information Storage and Retrieval; Information Systems Applications (incl. Internet)
Disponibilidad
Institución detectada | Año de publicación | Navegá | Descargá | Solicitá |
---|---|---|---|---|
No detectada | 2005 | SpringerLink |
Información
Tipo de recurso:
libros
ISBN impreso
978-3-540-28789-6
ISBN electrónico
978-3-540-31817-0
Editor responsable
Springer Nature
País de edición
Reino Unido
Fecha de publicación
2005
Información sobre derechos de publicación
© Springer-Verlag Berlin Heidelberg 2005
Tabla de contenidos
doi: 10.1007/11551874_11
New Meta-grammar Constructs in Czech Language Parser
Aleš Horák; Vladimír Kadlec
In this paper, we present and summarize the latest development of the Czech sentence parsing system . The presented system uses the meta-grammar formalism, which enables to define the grammar with a maintainable number of meta-rules. At the same time, these meta-rules are translated into rules for efficient and fast head driven chart parsing supplemented with evaluation of additional contextual constraints. The paper includes a comprehensive description of the meta-grammar constructs as well as actual running times of the system tested on corpus data.
- Text | Pp. 85-92
doi: 10.1007/11551874_12
Anaphora in Czech: Large Data and Experiments with Automatic Anaphora Resolution
Lucie Kučová; Zdeněk Žabokrtský
The aim of this paper is two-fold. First, we want to present a part of the annotation scheme of the Prague Dependency Treebank 2.0 related to the annotation of coreference on the tectogrammatical layer of sentence representation (more than 45,000 textual and grammatical coreference links in almost 50,000 manually annotated Czech sentences). Second, we report a new pronoun resolution system developed and tested using the treebank data, the success rate of which is 60.4 %.
- Text | Pp. 93-98
doi: 10.1007/11551874_13
Valency Lexicon of Czech Verbs VALLEX: Recent Experiments with Frame Disambiguation
Markéta Lopatková; Ondřej Bojar; Jiří Semecký; Václava Benešová; Zdeněk Žabokrtský
VALLEX is a linguistically annotated lexicon aiming at a description of syntactic information which is supposed to be useful for NLP. The lexicon contains roughly 2500 manually annotated Czech verbs with over 6000 valency frames (summer 2005). In this paper we introduce VALLEX and describe an experiment where VALLEX frames were assigned to 10,000 corpus instances of 100 Czech verbs – the pairwise inter-annotator agreement reaches 75%. The part of the data where three human annotators agreed were used for an automatic word sense disambiguation task, in which we achieved the precision of 78.5%.
- Text | Pp. 99-106
doi: 10.1007/11551874_14
AARLISS – An Algorithm for Anaphora Resolution in Long-Distance Inter Sentential Scenarios
Miroslav Martinovic; Anthony Curley; John Gaskins
We present a novel approach for boosting the performance of pronominal anaphora resolution algorithms when search for antecedents has to span over a multi-sentential text passage. The approach is based on the identification of sentences which are “most semantically related” to the sentence with anaphora. The context sharing level between each possible referent sentence and the anaphoric sentence gets established utilizing an open-domain external knowledge base. Sentences with scores higher than a threshold level are considered the “most semantically related” and ranked accordingly. The qualified sentences accompanied with their context sharing scores represent a new, reduced in size, and a more semantically focused search space. Their respective scores are utilized as separate preference factors in a final phase of the resolution process – the antecedent selection. We pioneer three implementations for the algorithm with their corresponding evaluation data.
- Text | Pp. 107-114
doi: 10.1007/11551874_15
Detection and Correction of Malapropisms in Spanish by Means of Internet Search
Igor A. Bolshakov; Sofia N. Galicia-Haro; Alexander Gelbukh
Malapropisms are real-word errors that lead to syntactically correct but semantically implausible text. We report an experiment on detection and correction of Spanish malapropisms. Malapropos words semantically destroy collocations (syntactically connected word pairs) they are in. Thus we detect possible malapropisms as words that do not form semantically plausible collocations with neighboring words. As correction candidates, we select words similar to the suspected one but forming plausible collocations with neighboring words. To judge semantic plausibility of a collocation, we use Google statistics of occurrences of the word combination and of the two words taken apart. Since collocation components can be separated by other words in a sentence, Google statistics is gathered for the most probable distance between them. The statistics is recalculated to a specially defined Semantic Compatibility Index (SCI). Heuristic rules are proposed to signal malapropisms when SCI values are lower than a predetermined threshold and to retain a few highly SCI-ranked correction candidates. Our experiments gave promising results.
- Text | Pp. 115-122
doi: 10.1007/11551874_16
The Szeged Treebank
Dóra Csendes; János Csirik; Tibor Gyimóthy; András Kocsor
The major aim of the Szeged Treebank project was to create a high-quality database of syntactic structures for Hungarian that can serve as a golden standard to further research in linguistics and computational language processing. The treebank currently contains full syntactic parsing of about 82,000 sentences, which is the result of accurate manual annotation. Current paper describes the linguistic theory as well as the actual method used in the annotation process. In addition, the application of the treebank for the training of automated syntactic parsers is also presented.
- Text | Pp. 123-131
doi: 10.1007/11551874_17
Automatic Lemmatizer Construction with Focus on OOV Words Lemmatization
Jakub Kanis; Luděk Müller
This paper deals with the automatic construction of a lemmatizer from a Full Form – Lemma (FFL) training dictionary and with lemmatization of new, in the FFL dictionary unseen, i.e. out-of-vocabulary (OOV) words. Three methods of lemmatization of three kinds of OOV words (missing full forms, unknown words, and compound words) are introduced. These methods were tested on Czech test data. The best result (recall: 99.3 % and precision: 75.1 %) has been achieved by a combination of these methods. The lexicon-free lemmatizer based on the method of lemmatization of unknown words (lemmatization patterns method) is introduced too.
- Text | Pp. 132-139
doi: 10.1007/11551874_18
Modeling Syntax of Free Word-Order Languages: Dependency Analysis by Reduction
Markéta Lopatková; Martin Plátek; Vladislav Kuboň
This paper explains the principles of dependency analysis by reduction and its correspondence to the notions of dependency and dependency tree. The explanation is illustrated by examples from Czech, a language with a relatively high degree of word-order freedom. The paper sums up the basic features of methods of dependency syntax. The method serves as a basis for the verification (and explanation) of the adequacy of formal and computational models of those methods.
- Text | Pp. 140-147
doi: 10.1007/11551874_19
Morphological Meanings in the Prague Dependency Treebank 2.0
Magda Razímová; Zdeněk Žabokrtský
In this paper we report our work on the system of grammatemes (mostly semantically-oriented counterparts of morphological categories such as number, degree of comparison, or tense), the concept of which was introduced in Functional Generative Description, and is now further elaborated in the context of Prague Dependency Treebank 2.0. We present also a new hierarchical typology of tectogrammatical nodes.
- Text | Pp. 148-155
doi: 10.1007/11551874_20
Automatic Acquisition of a Slovak Lexicon from a Raw Corpus
Benoît Sagot
This paper presents an automatic methodology we used in an experiment to acquire a morphological lexicon for the Slovak language, and the lexicon we obtained. This methodology extends and refines approaches which have proven efficient, e.g., for the acquisition of French verbs or Croatian and Russian nouns, adjectives and verbs. It only relies on a raw corpus and on a morphological description of the language. The underlying idea is to build all possible lemmas that can explain all words found in the corpus, according to the morphological description, and to rank these hypothetical lemmas according to their likelihood given the corpus. Of course, hand-validation and iteration of the whole process is needed to achieve a high-quality lexicon, but the human involvement required is orders of magnitude lower than the cost of the fully manual development of such a resource. Moreover, this technique can be easily applied to other languages with a rich morphology that lack large-coverage lexical resources.
- Text | Pp. 156-163