Catálogo de publicaciones - libros

Compartir en
redes sociales

Accessing Multilingual Information Repositories: 6th Workshop of the Cross-Language Evaluation Forum, CLEF 2005, Vienna, Austria, 21-23 September, 2005, Revised Selected Papers


Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2006 SpringerLink

Tabla de contenidos

What Happened in CLEF 2005

Carol Peters

The organization of the CLEF 2005 evaluation campaign is described and details are provided concerning the tracks, test collections, evaluation infrastructure and participation.

- What Happened in CLEF 2005 | Pp. 1-10

CLEF 2005: Ad Hoc Track Overview

Giorgio M. Di Nunzio; Nicola Ferro; Gareth J. F. Jones; Carol Peters

We describe the objectives and organization of the CLEF 2005 ad hoc track and discuss the main characteristics of the tasks offered to test monolingual, bilingual, and multilingual textual document retrieval. The performance achieved for each task is presented and a statistical analysis of results is given. The mono- and bilingual tasks followed the pattern of previous years but included target collections for two new-to-CLEF languages: Bulgarian and Hungarian. The multilingual tasks concentrated on exploring the reuse of existing test collections from an earlier CLEF campaign. The objectives were to attempt to measure progress in multilingual information retrieval by comparing the results for CLEF 2005 submissions with those of participants in earlier workshops, and also to encourage participants to explore multilingual list merging techniques.

- Part I. Multilingual Textual Document Retrival (Ad Hoc) | Pp. 11-36

Ad-Hoc Mono- and Bilingual Retrieval Experiments at the University of Hildesheim

René Hackl; Thomas Mandl; Christa Womser-Hacker

This paper reports information retrieval experiments carried out within the CLEF 2005 ad-hoc multi-lingual track. The experiments focus on the two new languages Bulgarian and Hungarian. No relevance assessments are available for these collections yet. Optimization was mainly based on French data from CLEF 2004. Based on experience from last year, one of our main objectives was to improve and refine the n-gram-based indexing and retrieval algorithms within our system.

- Cross-Language and More | Pp. 37-43

MIRACLE at Ad-Hoc CLEF 2005: Merging and Combining Without Using a Single Approach

José M. Goñi-Menoyo; José C. González-Cristóbal; Julio Villena-Román

This paper presents the 2005 Miracle’s team approach to the Ad-Hoc Information Retrieval tasks. The goal for the experiments this year was twofold: to continue testing the effect of combination approaches on information retrieval tasks, and improving our basic processing and indexing tools, adapting them to new languages with strange encoding schemes. The starting point was a set of basic components: stemming, transforming, filtering, proper nouns extraction, paragraph extraction, and pseudo-relevance feedback. Some of these basic components were used in different combinations and order of application for document indexing and for query processing. Second-order combinations were also tested, by averaging or selective combination of the documents retrieved by different approaches for a particular query. In the multilingual track, we concentrated our work on the merging process of the results of monolingual runs to get the overall multilingual result, relying on available translations. In both cross-lingual tracks, we have used available translation resources, and in some cases we have used a combination approach.

- Cross-Language and More | Pp. 44-53

The XLDB Group at the CLEF 2005 Ad-Hoc Task

Nuno Cardoso; Leonardo Andrade; Alberto Simões; Mário J. Silva

This paper presents the participation of the XLDB Group in the CLEF 2005 ad-hoc monolingual and bilingual subtasks for Portuguese. We participated with an improved and extended configuration of the tumba! search engine software. We detail the new features and evaluate their performance.

- Cross-Language and More | Pp. 54-60

Thomson Legal and Regulatory Experiments at CLEF-2005

Isabelle Moulinier; Ken Williams

For the 2005 Cross-Language Evaluation Forum, Thomson Legal and Regulatory participated in the Hungarian, French, and Portuguese monolingual search tasks as well as French-to-Portuguese bilingual retrieval. Our Hungarian participation focused on comparing the effectiveness of different approaches toward morphological stemming. Our French and Portuguese monolingual efforts focused on different approaches to Pseudo-Relevance Feedback (PRF), in particular the evaluation of a scheme for selectively applying PRF only in the cases most likely to produce positive results. Our French-to-Portuguese bilingual effort applies our previous work in query translation to a new pair of languages and uses corpus-based language modeling to support term-by-term translation. We compare our approach to an off-the-self machine translation system that translates the query as a whole and find the latter approach to be more performant. All experiments were performed using our proprietary search engine. We remain encouraged by the overall success of our efforts, with our main submissions for each of the four tasks performing above the overall CLEF median. However, none of the specific enhancement techniques we attempted in this year’s forum showed significant improvements over our initial result.

- Cross-Language and More | Pp. 61-68

Using the X-IOTA System in Mono- and Bilingual Experiments at CLEF 2005

Loïc Maisonnasse; Gilles Sérasset; Jean-Pierre Chevallet

This document describes the CLIPS experiments in the CLEF 2005 campaign. We used a surface-syntactic parser in order to extract new indexing terms. These terms are considered syntactic dependencies. Our goal was to evaluate their relevance for an information retrieval task. We used them in different forms in different information retrieval models, in particular in a language model. For the bilingual task, we tried two simple tests of Spanish and German to French retrieval; for the translation we used a lemmatizer and a dictionary.

- Cross-Language and More | Pp. 69-78

Bilingual and Multilingual Experiments with the IR-n System

Elisa Noguera; Fernando Llopis; Rafael Muñoz; Rafael M. Terol; Miguel A. García-Cumbreras; Fernando Martínez-Santiago; Arturo Montejo-Raez

Our paper describes the participation of the IR-n system at CLEF-2005. This year, we participated in the bilingual task (English-French and English-Portuguese) and the multilingual task (English, French, Italian, German, Dutch, Finish and Swedish). We introduced the method of combined passages for the bilingual task. Futhermore we have applied the method of logic forms in the same task. For the multilingual task we had a joint participation with the University of Alicante and University of Jaén. We want to emphasize the good score achieved in the bilingual task improving around 45% in terms of average precision.

- Cross-Language and More | Pp. 79-82

Dictionary-Based Amharic-French Information Retrieval

Atelach Alemu Argaw; Lars Asker; Rickard Cöster; Jussi Karlgren; Magnus Sahlgren

We present four approaches to the Amharic – French bilingual track at CLEF 2005. All experiments use a dictionary based approach to translate the Amharic queries into French Bags-of-words, but while one approach uses word sense discrimination on the translated side of the queries, the other one includes all senses of a translated word in the query for searching. We used two search engines: The SICS experimental engine and Lucene, hence four runs with the two approaches. Non-content bearing words were removed both before and after the dictionary lookup. TF/IDF values supplemented by a heuristic function was used to remove the stop words from the Amharic queries and two French stopwords lists were used to remove them from the French translations. In our experiments, we found that the SICS search engine performs better than Lucene and that using the word sense discriminated keywords produce a slightly better result than the full set of non discriminated keywords.

- Cross-Language and More | Pp. 83-92

A Hybrid Approach to Query and Document Translation Using a Pivot Language for Cross-Language Information Retrieval

Kazuaki Kishida; Noriko Kando

This paper reports experimental results for cross-language infor-mation retrieval (CLIR) from German to French, in which a hybrid approach to query and document translation was attempted, i.e., combining the results of query translation (German to French) and of document translation (French to German). In order to reduce the complexity of computation when translating a large amount of texts, we performed pseudo-translation, i.e., a simple replacement of terms by a bilingual dictionary (for query translation, a machine translation system was used). In particular, since English was used as an intermediary language for both translation directions between German and French, English translations at the middle stage were employed as document representations in order to reduce the number of translation steps. By omitting a translation step (English to German), the performance was improved. Unfortunately, our hybrid approach did not show better performance than a simple query translation. This may be due to the low performance of document translation, which was carried out by a simple replacement of terms using a bilingual dictionary with no term disambiguation.

- Cross-Language and More | Pp. 93-101

Conceptual Indexing for Multilingual Information Retrieval

Jacques Guyot; Saïd Radhouani; Gilles Falquet

We present a translation-free technique for multilingual information retrieval. This technique is based on an ontological representation of documents and queries. For each language, we use a dictionary (set of lexical reference for concepts) to map a term to its corresponding concept. The same mapping is applied to each document and each query. Then, we use a classic vector space model based on concept for indexing and querying the document corpus. The main advantages of our approach are: no merging phase is required; no dependency on automatic translators between all pairs of languages; and adding a new language only requires a new mapping dictionary to be added into the multilingual ontology. Experimental results on the CLEF 2005 multi8 collection show that this approach is efficient, even with relatively small and low fidelity dictionaries and without word sense disambiguation.

- Cross-Language and More | Pp. 102-112

SINAI at CLEF 2005: Multi-8 Two-Years-on and Multi-8 Merging-Only Tasks

Fernando Martínez-Santiago; Miguel A. García-Cumbreras; L. A. Ureña-López

This year, we participated in and CLEF tasks. Our main interest has been to test several standard CLIR techniques and investigate how they affect the final performance of the multilingual system. Specifically, we have evaluated the information retrieval (IR) model used to obtain each monolingual result, the merging algorithm, the translation approach and the application of query expansion techniques. The obtained results show that by means of improving merging algorithms and translation resources we reach better results than improving other CLIR modules such as IR engines or the expansion of queries.

- Cross-Language and More | Pp. 113-120

CLEF 2005: Multilingual Retrieval by Combining Multiple Multilingual Ranked Lists

Luo Si; Jamie Callan

We participated in two tasks: Multi-8 two-years-on retrieval and Multi-8 results merging. For the multi-8 two-years-on retrieval work, algorithms are proposed to combine simple multilingual ranked lists into a more accurate ranked list. Empirical study shows that the approach of combining multilingual retrieval results can substantially improve the accuracies over single multilingual ranked lists. The Multi-8 results merging task is viewed as similar to the results merging task of federated search. Query-specific and language-specific models are proposed to calculate comparable document scores for a small amount of documents and estimate logistic models by using information of these documents. The logistic models are used to estimate comparable scores for all documents and thus the documents can be sorted into a final ranked list. Experimental results demonstrate the advantage of the query-specific and language-specific models against several other alternatives.

- Cross-Language and More | Pp. 121-130

Monolingual, Bilingual, and GIRT Information Retrieval at CLEF-2005

Jacques Savoy; Pierre-Yves Berger

For our fifth participation in the CLEF evaluation campaigns, our first objective was to propose an effective and general stopword list as well as a light stemming procedure for the Hungarian, Bulgarian and Portuguese (Brazilian) languages. Our second objective was to obtain a better picture of the relative merit of various search engines when processing documents in those languages. To do so we evaluated our scheme using two probabilistic models and five vector-processing approaches. In the bilingual track, we evaluated both the machine translation and bilingual dictionary approaches applied to automatically translate a query submitted in English into various target languages. Finally, using the GIRT corpora (available in English, German and Russian), we investigated the variations in retrieval effectiveness that resulted when we included or excluded manually assigned keywords attached to the bibliographic records (mainly comprising a title and an abstract).

- Cross-Language and More | Pp. 131-140

Socio-Political Thesaurus in Concept-Based Information Retrieval

Mikhail Ageev; Boris Dobrov; Natalia Loukachevitch

In CLEF 2005 experiments we used a bilingual Russian-English Socio-Political Thesaurus that we developed over more than 10 years as a tool for automatic text processing in information retrieval tasks. The same resource and the same algorithms were used for the ad-hoc and domain–specific task.

- Cross-Language and More | Pp. 141-150

The Performance of a Machine Translation-Based English-Indonesian CLIR System

Mirna Adriani; Ihsan Wahyu

We describe our participation in the Indonesian-English bilingual task of the 2005 Cross-Language Evaluation Forum (CLEF). We translated an Indonesian query set into English using a commercial machine translation tool called and attempted to improve retrieval effectiveness using a query expansion technique. However, since our initial retrieval effectiveness was low, the query expansion technique had a negative impact on performance.

- Cross-Language and More | Pp. 151-154

Exploring New Languages with HAIRCUT at CLEF 2005

Paul McNamee

JHU/APL has long espoused the use of language-neutral methods for cross-language information retrieval. This year we participated in the ad hoc cross-language track and submitted both monolingual and bilingual runs. We undertook our first investigations in the Bulgarian and Hungarian languages. In our bilingual experiments we used several non-traditional CLEF query languages such as Greek, Hungarian, and Indonesian, in addition to several western European languages. We found that character n-grams remain an attractive option for representing documents and queries in these new languages. In our monolingual tests n-grams were more effective than unnormalized words for retrieval in Bulgarian (+30%) and Hungarian (+63%). Our bilingual runs made use of , statistical translation of character n-grams using aligned corpora, when parallel data were available, and web-based machine translation, when no suitable data could be found.

- Cross-Language and More | Pp. 155-164

Dublin City University at CLEF 2005: Multi-8 Two-Years-On Merging Experiments

Adenike M. Lam-Adesina; Gareth J. F. Jones

This year Dublin City University participated in the CLEF 2005 Mulit-8 Two-Years-On multilingual merging task. The objective of our experiments was to test a range of standard techniques for merging ranked lists of retrieved documents to see if consistent trends emerge for lists generated using different information retrieval systems. Our results show that the success of merging techniques can be dependent on the retrieval system used, and in consequence the best merging techniques to adopt cannot be recommended independent of knowing the retrieval system to be used.

- Cross-Language and More | Pp. 165-169

Applying Light Natural Language Processing to Ad-Hoc Cross Language Information Retrieval

Christina Lioma; Craig Macdonald; Ben He; Vassilis Plachouras; Iadh Ounis

In the CLEF 2005 Ad-Hoc Track we addressed the problem of retrieving information in morphologically rich languages, by experimenting with language-specific morphosyntactic processing and light Natural Language Processing (NLP). The diversity of the languages processed, namely Bulgarian, French, Italian, English, and Greek, allowed us to measure the effect of system-specific features upon the retrieval of these languages, and to juxtapose that effect to the role of language resources in Cross Language Information Retrieval (CLIR) in general.

- Cross-Language and More | Pp. 170-178

Four Stemmers and a Funeral: Stemming in Hungarian at CLEF 2005

Anna Tordai; Maarten de Rijke

We developed algorithmic stemmers for Hungarian and used them for the ad-hoc monolingual task for CLEF 2005. Our goal was to determine what degree of stemming is the most effective. Although on average the stemmers did not perform as well as the the best -gram, we found that stemming over a broad range of suffixes especially on nouns is highly useful.

- Monolingual Experiments | Pp. 179-186

ENSM-SE at CLEF 2005: Using a Fuzzy Proximity Matching Function

Annabelle Mercier; Amélie Imafouo; Michel Beigbeder

Starting from the idea that the closer the query terms in a document are to each other the more relevant the document, we propose an information retrieval method that uses the degree of fuzzy proximity of key terms in a document to compute the relevance of the document to the query. Our model handles Boolean queries but, contrary to the traditional extensions of the basic Boolean information retrieval model, does not use a proximity operator explicitly. A single parameter makes it possible to control the proximity degree required. We explain how we construct the queries and report the results of our experiments in the ad-hoc monolingual French task of the CLEF 2005 evaluation campaign.

- Monolingual Experiments | Pp. 187-193

Bulgarian and Hungarian Experiments with Hummingbird SearchServer at CLEF 2005

Stephen Tomlinson

Hummingbird participated in the Bulgarian and Hungarian monolingual information retrieval tasks of the Ad-Hoc Track of the Cross-Language Evaluation Forum (CLEF) 2005. In the ad hoc retrieval tasks, the system was given 50 natural language queries, and the goal was to find all of the relevant documents (with high precision) in a particular document set. We conducted diagnostic experiments with different techniques for matching word variations and handling stopwords. We found that the experimental stemmers significantly increased mean average precision for both languages. Analysis of individual topics found that the algorithmic Bulgarian and Hungarian stemmers encountered some unanticipated stopword collisions. A comparison to an experimental 4-gram technique suggested that Hungarian stemming would further benefit from decompounding.

- Monolingual Experiments | Pp. 194-203

Combining Passages in the Monolingual Task with the IR-n System

Fernando Llopis; Elisa Noguera

The paper describes our participation in the monolingual tasks at CLEF 2005. We submitted results for the following languages: French, Portuguese, Bulgarian and Hungarian, using a passage retrieval system. We focused on a version of this system that combines passages of different size to improve retrieval performance. After an analysis of our experiments and of the official results at CLEF, we find that our passage retrieval combination model achieves considerably improved scores.

- Monolingual Experiments | Pp. 204-207

Weighting Query Terms Based on Distributional Statistics

Jussi Karlgren; Magnus Sahlgren; Rickard Cöster

This year, the SICS team has concentrated on query processing and on the internal topical structure of the query, specifically compound translation. Compound translation is non-trivial due to dependencies between compound elements. This year, we have investigated topical dependencies between query terms: if a query term happens to be non-topical or noise, it should be discarded or given a low weight when ranking retrieved documents; if a query term shows high topicality its weight should be boosted. The two experiments described here are based on the analysis of the distributional character of query terms: one using similarity of occurrence context between query terms globally across the entire collection; the other using the likelihood of individual terms to appear topically in individual texts. Both – complementary – boosting schemes tested delivered improved results.

- Monolingual Experiments | Pp. 208-211

Domain-Specific Track CLEF 2005: Overview of Results and Approaches, Remarks on the Assessment Analysis

Michael Kluck; Maximilian Stempfhuber

The challenge of the CLEF domain-specific track is to map user queries in one language to documents in different languages adapting the systems used to the vocabulary and wording of the social science domain. In addition to a general overview of this track and its tasks, some details on the approaches of the participating groups and their results are reported. One of the outcomes is the considerable improvement in results if the retrieval systems make use of the thesauri provided or the intellectually assigned descriptors. Other findings for IR in a domain-specific context are also given. Finally, considerations on the topic creation and assessment processes are made on the basis of empirical data mainly from the GIRT corpus.

- Part II. Domain-Specific Information Retrieval (Domain-Specific) | Pp. 212-221

A Baseline for NLP in Domain-Specific IR

Johannes Leveling

The information retrieval (IR) methods employed for the third participation of the University of Hagen in the domain-specific task of the Cross Language Evaluation Campaign (CLEF 2005) provide a baseline for experiments with natural language processing (NLP) methods in domain-specific IR than methods employed in our previous participations. The baseline consists of a combination of state-of-the-art IR methods with NLP methods for document and query processing.

Our monolingual experiments with German documents combine several methods to achieve better performance, including an entry vocabulary module (EVM), query expansion with semantically related concepts, and a blind feedback technique. The monolingual experiments focus on comparing two techniques for constructing database queries: creating a and creating a semantic network by means of deep linguistic analysis of the query.

For the bilingual experiments, the English topics are translated into German queries with several machine translation (MT) services publicly available. Each set of translated topics is processed separately with the same techniques as in the monolingual experiments. Evaluation results for official experiments with a staged logistic regression and additional experiments with BM25 are presented.

- Part II. Domain-Specific Information Retrieval (Domain-Specific) | Pp. 222-225

Domain-Specific CLIR of English, German and Russian Using Fusion and Subject Metadata for Query Expansion

Vivien Petras; Fredric Gey; Ray R. Larson

This paper describes the combined submissions of the Berkeley group for the domain-specific track at CLEF 2005. The data fusion technique being tested is the fusion of multiple probabilistic searches against different XML components using both Logistic Regression (LR) algorithms and a version of the Okapi BM-25 algorithm. We also combine multiple translations of queries in cross-language searching. The second technique analyzed is query enhancement with domain-specific metadata (thesaurus terms). We describe our technique of Entry Vocabulary Modules, which associates query words with thesaurus terms and suggest its use for monolingual as well as bilingual retrieval. Different weighting and merging schemes for adding keywords to queries as well as translation techniques are described.

- Part II. Domain-Specific Information Retrieval (Domain-Specific) | Pp. 226-237

Evaluating a Conceptual Indexing Method by Utilizing WordNet

Mustapha Baziz; Mohand Boughanem; Nathalie Aussenac-Gilles

This paper describes our participation to the English Girt Task of CLEF 2005 Campaign. A method for conceptual indexing based on WordNet is used. Both documents and queries are mapped onto WordNet. Identified concepts belonging to WordNet synsets are extracted from documents and queries and those having a single sense are expanded. All runs are carried out using a conceptual indexing approach. Results prove a primacy of using queries from the title field of the topics and a slight gain of using stemming compared to the non stemming cases.

H3.3 []: Information Search and Retrieval;

H.3.1 [] –

Algorithms, Experimentation.

- Part II. Domain-Specific Information Retrieval (Domain-Specific) | Pp. 238-246

Domain Specific Mono- and Bilingual English to German Retrieval Experiments with a Social Science Document Corpus

René Hackl; Thomas Mandl

This paper reports experiments in CLEF 2005’s domain-specific retrieval track carried out at the University of Hildesheim. The experiments were based on previous experiences with the GIRT document corpus and were run in parallel to the multi-lingual experiments for CLEF 2005. We optimized the parameters of the system with one corpus from 2004 and applied these settings to the domain specific task. In that manner, the robustness of our approach over different document collection was assessed.

- Part II. Domain-Specific Information Retrieval (Domain-Specific) | Pp. 247-250

Overview of the CLEF 2005 Interactive Track

Julio Gonzalo; Paul Clough; Alessandro Vallin

The CLEF Interactive Track (iCLEF) is devoted to the comparative study of user-inclusive cross-language search strategies. In 2005, we have studied two cross-language search tasks: retrieval of answers and retrieval of annotated images. In both tasks, no further translation or post-processing is needed after performing the tasks to fulfill the information need.

In the interactive Question Answering task, users are asked to find the answer to a number of questions in a foreign-language document collection, and write the answers in their own native language. In the interactive image retrieval task, a picture is shown to the user, and then the user is asked to find the picture in the collection.

This paper summarizes the task design, experimental methodology, and the results obtained by the research groups participating in the track.

- Part III. Interactive Cross-Language Information Retrieval (iCLEF) | Pp. 251-262

Use of Free On-Line Machine Translation for Interactive Cross-Language Question Answering

Angel Zazo; Carlos G. Figuerola; José Luis A. Berrocal; Viviana Fernández Marcial

Free on-line machine translation systems are employed more and more by Internet users. In this paper we have explored the use of these systems for Cross-Language Question Answering, in two aspects: in the formulation of queries and in the presentation of information. Two topic-document language pairs were used, Spanish-English and Spanish-French. For each of these, two groups of users were created, depending on the level of reading skills in document language. When machine translation of the queries was used directly in the search, the number of correct answers was quite high. Users only corrected 8% of the translations proposed. As regards the possibility of using machine translation to translate into Spanish the text passages shown to the user, we expected the search of the users with little knowledge of the target language to improve notably, but we found that this possibility was of little help in finding the correct answers for the questions posed in the experiment.

- Part III. Interactive Cross-Language Information Retrieval (iCLEF) | Pp. 263-272

“How Much Context Do You Need?”: An Experiment About the Context Size in Interactive Cross-Language Question Answering

Borja Navarro; Lorenza Moreno-Monteagudo; Elisa Noguera; Sonia Vázquez; Fernando Llopis; Andrés Montoyo

The main topic of this paper is the context size needed for an efficient Interactive Cross-language Question Answering system. We compare two approaches: the first one (baseline system) shows the user whole passages (maximum context: 10 sentences). The second one (experimental system) shows only a clause (minimum context). As cross-language system, the main problem is that the language of the question (Spanish) and the language of the answer context (English) are different. The results show that large context is better. However, there are specific relations between the context size and the knowledge about the language of the answer: users with poor level of English prefer context with few words.

- Part III. Interactive Cross-Language Information Retrieval (iCLEF) | Pp. 273-282

UNED at iCLEF 2005: Automatic Highlighting of Potential Answers

Víctor Peinado; Fernando López-Ostenero; Julio Gonzalo; Felisa Verdejo

In this paper, we describe UNED’s participation in the iCLEF 2005 track. We have compared two strategies for finding an answer using an interactive question answering system: i) a search system over full documents and ii) a search system over passages (document’s paragraphs). We have added an interesting feature to both system in order to facilitate reading: the possibility to enable/disable the highlighting of named entities such as proper nouns, temporal references and numbers likely to contain the right answer.

Our Document Searcher obtained better overall accuracy (0.53 vs. 0.45) but our subjects found browsing passages simpler and faster. However, most of them presented a similar search behavior (regarding time consumption, confidence in their answers and query refinements) using both systems. All our users considered helpful the highlighting of named entities and they all made extensive use of this possibility as a quick way of discriminating between relevant and non relevant documents and finding a valid answer.

- Part III. Interactive Cross-Language Information Retrieval (iCLEF) | Pp. 283-292

Effect of Connective Functions in Interactive Image Retrieval

Julio Villena-Román; Raquel M. Crespo-García; José Carlos González Cristóbal

This paper presents the participation of the MIRACLE team at the ImageCLEF 2005 interactive search task, in which we compare the efficiency of AND monolingual queries (which have to be precise and use the exact vocabulary, which may be difficult in a specialised search task) versus relevanceguided OR bilingual queries (a fuzzier and noisier search but which doesn’t require precise vocabulary and exact translations). User preferences and strategies in the context of cross-lingual interactive image retrieval are also analysed.

- Part III. Interactive Cross-Language Information Retrieval (iCLEF) | Pp. 293-296

Using Concept Hierarchies in Text-Based Image Retrieval: A User Evaluation

Daniela Petrelli; Paul Clough

This paper describes our results from the image retrieval task of iCLEF 2005 based on a comparative user evaluation of two interfaces: one displaying search results as a list; the other organising retrieved images into a hierarchy of concepts displayed on the interface as an interactive menu. Based on a known-item retrieval task, data was analysed with respect to effectiveness, efficiency and user satisfaction. Effectiveness and efficiency were calculated at both the set cut-off time of 5 minutes, and the time after finding the target image (final time). Results showed the list was marginally more effective than the menu at 5 minutes, but the two were equal at final time indicating the menu requires more time to be used effectively. The list was more efficient at both 5 minutes and final time (difference not statistically significant) and users preferred using the menu indicating this could be a potentially interesting and engaging feature for image retrieval.

- Part III. Interactive Cross-Language Information Retrieval (iCLEF) | Pp. 297-306

Overview of the CLEF 2005 Multilingual Question Answering Track

Alessandro Vallin; Bernardo Magnini; Danilo Giampiccolo; Lili Aunimo; Christelle Ayache; Petya Osenova; Anselmo Peñas; Maarten de Rijke; Bogdan Sacaleanu; Diana Santos; Richard Sutcliffe

The general aim of the third CLEF Multilingual Question Answering Track was to set up a common and replicable evaluation framework to test both monolingual and cross-language Question Answering (QA) systems that process queries and documents in several European languages. Nine target languages and ten source languages were exploited to enact 8 monolingual and 73 cross-language tasks. Twenty-four groups participated in the exercise. Overall results showed a general increase in performance in comparison to last year. The best performing monolingual system irrespective of target language answered 64.5% of the questions correctly (in the monolingual Portuguese task), while the average of the best performances for each target language was 42.6%. The cross-language step instead entailed a considerable drop in performance. In addition to accuracy, the organisers also measured the relation between the correctness of an answer and a system’s stated confidence in it, showing that the best systems did not always provide the most reliable confidence score. We provide an overview of the 2005 QA track, detail the procedure followed to build the test sets and present a general analysis of the results.

- Part IV. Multiple Language Question Answering (QA@CLEF) | Pp. 307-331

A Fast Forward Approach to Cross-Lingual Question Answering for English and German

Robert Strötgen; Thomas Mandl; René Schneider

This paper describes the development of a question answering system for mono-lingual and cross-lingual tasks for English and German. We developed the question answering system from a document and retrieval perspective. The system consists of question and answering taxonomies, named entity recognition, term expansion modules, a multi-lingual search engine based on Lucene and a passage extraction and ranking component. The overall architecture and heuristics applied during development are described. We discuss the results at CLEF 2005 and show potential future work.

- Part IV. Multiple Language Question Answering (QA@CLEF) | Pp. 332-336

The ŒDipe System at CLEF-QA 2005

Romaric Besançon; Mehdi Embarek; Olivier Ferret

This article presents Œdipe, the question answering system that was used by the LIC2M for its participation in the CLEF-QA 2005 evaluation. The LIC2M participates more precisely in the monolingual track dedicated to the French language. The main characteristic of Œdipe is its simplicity: it mainly relies on the association of a linguistic pre-processor that normalizes words and recognizes named entities and the principles of the Vector Space model.

- Part IV. Multiple Language Question Answering (QA@CLEF) | Pp. 337-346

An XML-Based System for Spanish Question Answering

David Tomás; José L. Vicedo; Maximiliano Saiz; Rubén Izquierdo

As Question Answering is a major research topic at the University of Alicante, this year two separate groups participated in the QA@CLEF track using different approaches. This paper describes the work of group. Thinking of future developments, we have designed a modular framework based on XML that will easily let us integrate, combine and test system components based on different approaches. In this context, several modifications have been introduced, such as a new machine learning based question classification module. We took part in the monolingual Spanish task.

- Part IV. Multiple Language Question Answering (QA@CLEF) | Pp. 347-350

A Logic Programming Based Approach to QA@CLEF05 Track

Paulo Quaresma; Irene Rodrigues

In this paper the methodology followed to build a question-answering system for the Portuguese language is described. The system modules are built using computational linguistic tools such as: a Portuguese parser based on constraint grammars for the syntactic analysis of the documents sentences and the user questions; a semantic interpreter that rewrites sentences syntactic analysis into discourse representation structures in order to obtain the corpus documents and user questions semantic representation; and finally, a semantic/pragmatic interpreter in order to obtain a knowledge base with facts extracted from the documents using ontologies (general and domain specific) and logic inference. This article includes the system evaluation under the CLEF’05 question and answering track.

- Part IV. Multiple Language Question Answering (QA@CLEF) | Pp. 351-360

Extending Knowledge and Deepening Linguistic Processing for the Question Answering System InSicht

Sven Hartrumpf

The German question answering (QA) system InSicht participated in QA@CLEF for the second time. It relies on complete sentence parsing, inferences, and semantic representation matching. This year, the system was improved in two main directions. First, the background knowledge was extended by large semantic networks and large rule sets. Second, linguistic processing was deepened by treating a phenomenon that appears prominently on the level of text semantics: coreference resolution. A new source of lexico-semantic relations and equivalence rules has been established based on compound analyses from document parses. These analyses were used in three ways: to project lexico-semantic relations from compound parts to compounds, to establish a subordination hierarchy for compounds, and to derive equivalence rules between nominal compounds and their analytic counterparts. The lack of coreference resolution in InSicht was one major source of missing answers in QA@CLEF 2004. Therefore the coreference resolution module CORUDIS was integrated into the parsing during document processing. The central step in the QA system InSicht, matching semantic networks derived from the question parse (one by one) with document sentence networks, was generalized. Now, a question network can be split at certain semantic relations (e.g. relations for local or temporal specifications). To evaluate the different extensions, the QA system was run on all 400 German questions from QA@CLEF 2004 and 2005 with varying setups. Some extensions showed positive effects, but currently they are minor and not statistically significant. The paper ends with a discussion why improvements are not larger, yet.

- Part IV. Multiple Language Question Answering (QA@CLEF) | Pp. 361-369

Question Answering for Dutch Using Dependency Relations

Gosse Bouma; Jori Mur; Gertjan van Noord; Lonneke van der Plas; Jörg Tiedemann

Joost is a question answering system for Dutch which makes extensive use of dependency relations. It answers questions either by table look-up, or by searching for answers in paragraphs returned by IR. Syntactic similarity is used to identify and rank potential answers. Tables were constructed by mining the CLEF corpus, which has been syntactically analyzed in full.

- Part IV. Multiple Language Question Answering (QA@CLEF) | Pp. 370-379

Term Translation Validation by Retrieving Bi-terms

Brigitte Grau; Anne-Laure Ligozat; Isabelle Robba; Anne Vilnat

For our second participation to the Question Answering task of CLEF, we kept last year’s system named MUSCLEF, which uses two different translation strategies implemented in two modules. The multilingual module MUSQAT analyzes the French questions, translates “interesting parts”, and then uses these translated terms to search the reference collection. The second strategy consists in translating the question into English and applying QALC our existing English module. Our purpose in this paper is to analyze term translations and propose a mechanism for selecting correct ones. The manual evaluation of bi-terms translations leads us to the conclusion that the bi-term translations found in the corpus can confirm the mono-term translations.

- Part IV. Multiple Language Question Answering (QA@CLEF) | Pp. 380-389

Exploiting Linguistic Indices and Syntactic Structures for Multilingual Question Answering: ITC-irst at CLEF 2005

Hristo Tanev; Milen Kouylekov; Bernardo Magnini; Matteo Negri; Kiril Simov

We participated at four Question Answering tasks at CLEF 2005: the Italian monolingual (), Italian-English (), Bulgarian monolingual (), and Bulgarian-English () bilingual task. While we did not change the approach in the Italian task (), we experimented with several new approaches based on linguistic structures and statistics in the , , and tasks.

- Part IV. Multiple Language Question Answering (QA@CLEF) | Pp. 390-399

The TALP-QA System for Spanish at CLEF 2005

Daniel Ferrés; Samir Kanaan; Alicia Ageno; Edgar González; Horacio Rodríguez; Jordi Turmo

This paper describes the TALP-QA system in the context of the CLEF 2005 Spanish Monolingual Question Answering (QA) evaluation task. TALP-QA is a multilingual open-domain QA system that processes both factoid (normal and temporally restricted) and definition questions. The approach to factoid questions is based on in-depth NLP tools and resources to create semantic information representation. Answers to definition questions are selected from the phrases that match a pattern from a manually constructed set of definitional patterns.

- Part IV. Multiple Language Question Answering (QA@CLEF) | Pp. 400-409

Priberam’s Question Answering System for Portuguese

Carlos Amaral; Helena Figueira; André Martins; Afonso Mendes; Pedro Mendes; Cláudia Pinto

This paper describes the work done by Priberam in the development of a question answering (QA) system for Portuguese. The system was built using the company’s natural language processing (NLP) workbench and information retrieval technology. Special focus is given to question analysis, document and sentence retrieval, as well as answer extraction stages. The paper discusses the system’s performance in the context of the QA@CLEF 2005 evaluation.

- Part IV. Multiple Language Question Answering (QA@CLEF) | Pp. 410-419

A Full Data-Driven System for Multiple Language Question Answering

Manuel Montes-y-Gómez; Luis Villaseñor-Pineda; Manuel Pérez-Coutiño; José Manuel Gómez-Soriano; Emilio Sanchís-Arnal; Paolo Rosso

This paper describes a full data-driven system for question answering. The system uses pattern matching and statistical techniques to identify the relevant passages as well as the candidate answers for factoid and definition questions. Since it does not consider any sophisticated linguistic analysis of questions and answers, it can be applied to different languages without requiring major adaptation changes. Experimental results on Spanish, Italian and French demonstrate that the proposed approach can be a convenient strategy for monolingual and multilingual question answering.

- Part IV. Multiple Language Question Answering (QA@CLEF) | Pp. 420-428

Experiments on Cross–Linguality and Question–Type Driven Strategy Selection for Open–Domain QA

Günter Neumann; Bogdan Sacaleanu

We describe the extensions made to our 2004 QA@CLEF German/English QA-system, toward a fully German-English/English-German cross-language system with answer validation through web usage. Details concerning the processing of factoid, definition and temporal questions are given and the results obtained in the monolingual German, bilingual English-German and German-English tasks are briefly presented and discussed.

- Part IV. Multiple Language Question Answering (QA@CLEF) | Pp. 429-438

QUASAR: The Question Answering System of the Universidad Politécnica de Valencia

José Manuel Gómez-Soriano; Davide Buscaldi; Empar Bisbal Asensi; Paolo Rosso; Emilio Sanchis Arnal

This paper describes the QUASAR Question Answering Information System developed by the RFIA group at the Departamento de Sistemas Informáticos y Computación of the Universidad Politécnica of Valencia for the 2005 edition of the CLEF Question Answering exercise. We participated in three monolingual tasks: Spanish, Italian and French, and in two cross-language tasks: Spanish to English and English to Spanish. Since this was our first participation, we focused our work on the passage-based search engine while using simple pattern matching rules for the Answer Extraction phase. As regards the cross-language tasks, we had to resort to the most common web translation tools.

- Part IV. Multiple Language Question Answering (QA@CLEF) | Pp. 439-448

Towards an Offline XML-Based Strategy for Answering Questions

David Ahn; Valentin Jijkoun; Karin Müller; Maarten de Rijke; Erik Tjong Kim Sang

The University of Amsterdam participated in the Question Answering (QA) Track of CLEF 2005 with two runs. In comparison with previous years, our focus this year was adding to our multi-stream architecture a new stream that uses offline XML annotation of the corpus. We describe the new work on our QA system, present the results of our official runs, and note areas for improvement based on an error analysis.

- Part IV. Multiple Language Question Answering (QA@CLEF) | Pp. 449-456

AliQAn, Spanish QA System at CLEF-2005

S. Roger; S. Ferrández; A. Ferrández; J. Peral; F. Llopis; A. Aguilar; D. Tomás

Question Answering is a major research topic at the University of Alicante. For this reason, this year two groups participated in the QA@CLEF track using different approaches. In this paper we describe the work of group. This paper describes AliQAn, a monolingual open-domain Question Answering (QA) System developed in the Department of Language Processing and Information Systems at the University of Alicante for CLEF-2005 Spanish monolingual QA evaluation task. Our approach is based fundamentally on the use of syntactic pattern recognition in order to identify possible answers. Besides this, Word Sense Disambiguation (WSD) is applied to improve the system. The results achieved (overall accuracy of 33%) are shown and discussed in the paper.

- Part IV. Multiple Language Question Answering (QA@CLEF) | Pp. 457-466

20th Century Esfinge (Sphinx) Solving the Riddles at CLEF 2005

Luís Costa

Esfinge is a general domain Portuguese question answering system. It tries to take advantage of the steadily growing and constantly updated information freely available in the World Wide Web in its question answering tasks. The system participated last year for the first time in the monolingual QA track. However, the results were compromised by several basic errors, which were corrected shortly after. This year, Esfinge participation was expected to yield better results and allow experimentation with a Named Entity Recognition System, as well as try a multilingual QA track for the first time. This paper describes how the system works, presents the results obtained by the official runs in considerable detail, as well as results of experiments measuring the import of different parts of the system, by reporting the decrease in performance when the system is executed without some of its components/features.

- Part IV. Multiple Language Question Answering (QA@CLEF) | Pp. 467-476

Question Answering Experiments for Finnish and French

Lili Aunimo; Reeta Kuuskoski

This paper presents a question answering (QA) system called . approach to QA is based on question classification, semantic annotation and answer extraction pattern matching. performance is evaluated by conducting experiments in the following tasks: monolingual Finnish and French and bilingual Finnish-English QA. is the first system ever reported to perform monolingual textual QA in the Finnish language. This is also the task in which its performance is best: 23 % of all questions are answered correctly. performance in the monolingual French task is a little inferior to its performance in the monolingual Finnish task, and when compared to the other systems evaluated with the same data in the same task, its performance is near the average. In the bilingual Finnish-English task, was the only participating system, and – as is expected – its performance was inferior to those attained in the monolingual tasks.

- Part IV. Multiple Language Question Answering (QA@CLEF) | Pp. 477-487

MIRACLE’s Cross-Lingual Question Answering Experiments with Spanish as a Target Language

César de Pablo-Sánchez; Ana González-Ledesma; José Luis Martínez-Fernández; José María Guirao; Paloma Martínez; Antonio Moreno

Our second participation in CLEF-QA consited in six runs with Spanish as a target language. The source languages were Spanish, English an Italian. miraQA uses a simple representation of the question that is enriched with semantic information like typed Named Entities. Runs used different strategies for answer extraction and selection, achieving at best a 25’5% accuracy. The analysis of the errors suggests that improvements in answer selection are the most critical.

- Part IV. Multiple Language Question Answering (QA@CLEF) | Pp. 488-491

The Role of Lexical Features in Question Answering for Spanish

Manuel Pérez-Coutiño; Manuel Montes-y-Gómez; Aurelio López-López; Luis Villaseñor-Pineda

This paper describes the prototype developed in the Language Technologies Laboratory at INAOE for the Spanish monolingual QA evaluation task at CLEF 2005. The proposed approach copes with the QA task according to the type of question to solve (factoid or definition). In order to identify possible answers to factoid questions, the system applies a methodology centered in the use of lexical features. On the other hand, the system is supported by a pattern recognition method in order to identify answers to definition questions. The paper shows the methods applied at different stages of the system, with special emphasis on those used for answering factoid questions. Then the results achieved with this approach are discussed.

- Part IV. Multiple Language Question Answering (QA@CLEF) | Pp. 492-501

Cross-Language French-English Question Answering Using the DLT System at CLEF 2005

Richard F. E. Sutcliffe; Michael Mulcahy; Igal Gabbay; Aoife O’Gorman; Darina Slattery

This paper describes the main components of the system built by the DLT Group at Limerick for participation in the QA Task at CLEF. The document indexing we used was again sentence-by-sentence but this year the Lucene Engine was adopted. We also experimented with retrieval query expansion using Local Context Analysis. Results were broadly similar to last year.

- Part IV. Multiple Language Question Answering (QA@CLEF) | Pp. 502-509

Finding Answers to Indonesian Questions from English Documents

Mirna Adriani; Rinawati

We present a report on our participation in the Indonesian-English question-answering task of the 2005 Cross-Language Evaluation Forum (CLEF). In this work we translated an Indonesian query set into English using a commercial machine translation tool called We used linguistic tools to find the answer to a question. The answer is extracted from a relevant passage and is identified as having the relevant tagging as the query.

- Part IV. Multiple Language Question Answering (QA@CLEF) | Pp. 510-516

BulQA: Bulgarian–Bulgarian Question Answering at CLEF 2005

Kiril Simov; Petya Osenova

This paper describes the architecture of a Bulgarian– Bulgarian question answering system — . The system relies on a partially parsed corpus for answer extraction. The questions are also analyzed partially. Then on the basis of the analysis some queries to the corpus are created. After the retrieval of the documents that potentially contain the answer, each of them is further processed with one of several additional grammars. The grammar depends on the question analysis and the type of the question. At present these grammars can be viewed as patterns for the type of questions, but our goal is to develop them further into a deeper parsing system for Bulgarian.

- Part IV. Multiple Language Question Answering (QA@CLEF) | Pp. 517-526

The Query Answering System PRODICOS

Laura Monceaux; Christine Jacquin; Emmanuel Desmontils

In this paper, we present the PRODICOS query answering system which was developed by the TALN team from the LINA institute. We present the various modules constituting our system and for each of them the evaluation is shown. Afterwards, for each of them, the evaluation is put forward to justify the results obtained. Then, we present the main improvement based on the use of semantic data.

- Part IV. Multiple Language Question Answering (QA@CLEF) | Pp. 527-534

The CLEF 2005 Cross–Language Image Retrieval Track

Paul Clough; Henning Müller; Thomas Deselaers; Michael Grubinger; Thomas M. Lehmann; Jeffery Jensen; William Hersh

This paper outlines efforts from the 2005 CLEF cross– language image retrieval campaign (ImageCLEF). Aim of the CLEF track is to explore the use of both text and content–based retrieval methods for cross–language image retrieval. Four tasks were offered in ImageCLEF: ad–hoc retrieval from an historic photographic collection, ad–hoc retrieval from a medical collection, an automatic image annotation task, and a user–centered (interactive) evaluation task. 24 research groups from a variety of backgrounds and nationalities (14 countries) participated in ImageCLEF. This paper presents the ImageCLEF tasks, submissions from participating groups and a summary of the main findings.

- Part V. Cross-Language Retrieval In Image Collections (ImageCLEF) | Pp. 535-557

Linguistic Estimation of Topic Difficulty in Cross-Language Image Retrieval

Michael Grubinger; Clement Leung; Paul Clough

Selecting suitable topics in order to assess system effectiveness is a crucial part of any benchmark, particularly those for retrieval systems. This includes establishing a range of example search requests (or topics) in order to test various aspects of the retrieval systems under evaluation. In order to assist with selecting topics, we present a measure of topic difficulty for cross-language image retrieval. This measure has enabled us to ground the topic generation process within a methodical and reliable framework for ImageCLEF 2005. This document describes such a measure for topic difficulty, providing concrete examples for every aspect of topic complexity and an analysis of topics used in the ImageCLEF 2003, 2004 and 2005 ad-hoc task.

- Part V. Cross-Language Retrieval In Image Collections (ImageCLEF) | Pp. 558-566

Dublin City University at CLEF 2005: Experiments with the ImageCLEF St Andrew’s Collection

Gareth J. F. Jones; Kieran McDonald

The aim of the Dublin City University’s participation in the CLEF 2005 ImageCLEF St Andrew’s Collection task was to explore an alternative approach to exploiting text annotation and content-based retrieval in a novel combined way for pseudo relevance feedback (PRF). This method combines evidence from retrieved lists generated using text-based and content-based retrieval to determine which documents will be assumed relevant for the PRF process. Unfortunately the experimental results show that while standard text-based PRF improves upon a no feedback text-only baseline, at present our new approach to combining evidence from text-based and content-based retrieval does not give further improvement.

- Part V. Cross-Language Retrieval In Image Collections (ImageCLEF) | Pp. 567-573

A Probabilistic, Text and Knowledge-Based Image Retrieval System

Rubén Izquierdo-Beviá; David Tomás; Maximiliano Saiz-Noeda; José Luis Vicedo

This paper describes the development of an image retrieval system that combines probabilistic and ontological information. The process is divided in two different stages: indexing and retrieval. Three information flows have been created with different kind of information each one: word forms, stems and stemmed bigrams. The final result combines the results obtained in the three streams. Knowledge is added to the system by means of an ontology created automatically from the St. Andrews Corpus. The system has been evaluated at CLEF05 image retrieval task.

- Part V. Cross-Language Retrieval In Image Collections (ImageCLEF) | Pp. 574-577

UNED at ImageCLEF 2005: Automatically Structured Queries with Named Entities over Metadata

Víctor Peinado; Fernando López-Ostenero; Julio Gonzalo; Felisa Verdejo

In this paper, we present our participation in the ImageCLEF 2005 ad-hoc task. After a pool of preliminary tests in which we evaluated the impact of different-size dictionaries using three distinct approaches, we proved that the biggest differences were obtained by recognizing named entities and launching structured queries over the metadata. Thus, we decided to refine our named entities recognizer and repeat the three approaches with the 2005 topics, achieving the best result among all cross-language European Spanish to English runs.

- Part V. Cross-Language Retrieval In Image Collections (ImageCLEF) | Pp. 578-581

Easing Erroneous Translations in Cross-Language Image Retrieval Using Word Associations

Masashi Inoue

When short queries and short image annotations are used in text-based cross-language image retrieval, small changes in word usage due to translation errors may decrease the retrieval performance because of an increase in lexical mismatches. In the ImageCLEF2005 ad-hoc task, we investigated the use of learned word association models that represent how pairs of words are related to absorb such mismatches. We compared a precision-oriented simple word-matching retrieval model and a recall-oriented word association retrieval model. We also investigated combinations of these by introducing a new ranking function that generated comparable output values from both models. Experimental results on English and German topics were discouraging, as the use of word association models degraded the performance. On the other hand, word association models helped retrieval for Japanese topics whose translation quality was low.

- Part V. Cross-Language Retrieval In Image Collections (ImageCLEF) | Pp. 582-591

A Corpus-Based Relevance Feedback Approach to Cross-Language Image Retrieval

Yih-Chen Chang; Wen-Cheng Lin; Hsin-Hsi Chen

This paper regards images with captions as a cross-media parallel corpus, and presents a corpus-based relevance feedback approach to combine the results of visual and textual runs. Experimental results show that this approach performs well. Comparing with the mean average precision (MAP) of the initial visual retrieval, the MAP is increased from 8.29% to 34.25% after relevance feedback from cross-media parallel corpus. The MAP of cross-lingual image retrieval is increased from 23.99% to 39.77% if combining the results of textual run and visual run with relevance feedback. Besides, the monolingual experiments also show the consistent effects of this approach. The MAP of monolingual retrieval is improved from 39.52% to 50.53% when merging the results of the text and image queries.

- Part V. Cross-Language Retrieval In Image Collections (ImageCLEF) | Pp. 592-601

CUHK at ImageCLEF 2005: Cross-Language and Cross-Media Image Retrieval

Steven C. H. Hoi; Jianke Zhu; Michael R. Lyu

In this paper, we describe our studies of cross-language and cross-media image retrieval at the ImageCLEF 2005. This is the first participation of our CUHK (The Chinese University of Hong Kong) group at ImageCLEF. The task in which we participated is the “bilingual ad hoc retrieval” task. There are three major focuses and contributions in our participation. The first is the empirical evaluation of language models and smoothing strategies for cross-language image retrieval. The second is the evaluation of cross-media image retrieval, i.e., combining text and visual contents for image retrieval. The last is the evaluation of bilingual image retrieval between English and Chinese. We provide an empirical analysis of our experimental results, in which our approach achieves the best mean average precision result in the monolingual query task in the campaign. Finally we summarize our empirical experience and address the future improvement of our work.

- Part V. Cross-Language Retrieval In Image Collections (ImageCLEF) | Pp. 602-611

The University of Jaén at ImageCLEF 2005: Adhoc and Medical Tasks

M. T. Martín-Valdivia; M. A. García-Cumbreras; M. C. Díaz-Galiano; L. A. Ureña-López; A. Montejo-Raez

In this paper, we describe our first participation in the ImageCLEF campaign. The SINAI research group participated in both the ad hoc task and the medical task. For the first task, we have used several translation schemes as well as experiments with and without Pseudo Relevance Feedback (PRF). A voting-based system has been developed, for the ad hoc task, joining three different systems of participant Universities. For the medical task, we have also submitted runs with and without PRF, and experiments using only textual query and using textual mixing with visual query.

- Part V. Cross-Language Retrieval In Image Collections (ImageCLEF) | Pp. 612-621

Data Fusion of Retrieval Results from Different Media: Experiments at ImageCLEF 2005

Romaric Besançon; Christophe Millet

The CEA-LIST/LIC2M develops both multilingual text retrieval systems and content-based image indexing and retrieval systems. These systems are developed independently. The merging of the results of the two systems is one of the important research interests in our lab. We tested several simple merging techniques in the ImageCLEF 2005 campaign. The analysis of our results show that improved performance can be obtained by appropriately merging the two media. However, an a-priori tuning of the merging parameters is difficult because the performance of each system highly depends on the corpus and queries.

- Part V. Cross-Language Retrieval In Image Collections (ImageCLEF) | Pp. 622-631

Combining Visual Features for Medical Image Retrieval and Annotation

Wei Xiong; Bo Qiu; Qi Tian; Changsheng Xu; S. H. Ong; Kelvin Foong

In this paper we report our work using visual feature fusion for the tasks of medical image retrieval and annotation in the benchmark of ImageCLEF 2005. In the retrieval task, we use visual features without text information, having no relevance feedback. Both local and global features in terms of both structural and statistical nature are captured. We first identify visually similar images manually and form templates for each query topic. A pre-filtering process is utilized for a coarse retrieval. In the fine retrieval, two similarity measuring channels with different visual features are used in parallel and then combined in the decision level to produce a final score for image ranking. Our approach is evaluated over all 25 query topics with each containing example image(s) and topic textual statements. Over 50,000 images we achieved a mean average precision of 14.6%, as one of the best performed runs. In the annotation task, visual features are fused in an early stage by concatenation with normalization. We use support vector machines (SVM) with RBF kernels for the classification. Our approach is trained over a 9,000 image training set and tested over the given test set with 1000 images and on 57 classes with a correct classification rate of about 80%.

- Part V. Cross-Language Retrieval In Image Collections (ImageCLEF) | Pp. 632-641

A Structured Visual Learning Approach Mixed with Ontology Dimensions for Medical Queries

Jean-Pierre Chevallet; Joo-Hwee Lim; Saïd Radhouani

Precise image and text indexing requires domain knowledge and a learning process. In this paper, we present the use of an ontology to filter medical documents and of visual concepts to describe and index associated images. These visual concepts are meaningful medical terms with associated visual appearance from image samples that are manually designed and learned from examples. Text and image indexing processes are performed in parallel and merged to answer mixed-mode queries. We show that fusion of these two methods are of a great benefit and that external knowledge stored in an ontology is mandatory to solve precise queries and provide the overall best results.

- Part V. Cross-Language Retrieval In Image Collections (ImageCLEF) | Pp. 642-651

FIRE in ImageCLEF 2005: Combining Content-Based Image Retrieval with Textual Information Retrieval

Thomas Deselaers; Tobias Weyand; Daniel Keysers; Wolfgang Macherey; Hermann Ney

In this paper the methods we used in the 2005 ImageCLEF content-based image retrieval evaluation are described. For the medical retrieval task, we combined several low-level image features with textual information retrieval. Combining these two information sources, clear improvements over the use of one of these sources alone are possible.

Additionally we participated in the automatic annotation task, where our content-based image retrieval system, FIRE, was used as well as a second subimage based method for object classification. The results we achieved are very convincing. Our submissions ranked first and the third in the automatic annotation task out of a total of 44 submissions from 12 groups.

- Part V. Cross-Language Retrieval In Image Collections (ImageCLEF) | Pp. 652-661

A Clustered Retrieval Approach for Categorizing and Annotating Images

Lisa Ballesteros; Desislava Petkova

Images are difficult to classify and annotate but the availability of digital image databases creates a constant demand for tools that automatically analyze image content and describe it with either a category or set of words. We develop two cluster-based cross-media relevance models that effectively categorize and annotate images by adapting a cross-lingual retrieval technique to choose the terms most likely associated with the visual features of an image.

- Part V. Cross-Language Retrieval In Image Collections (ImageCLEF) | Pp. 662-672

Manual Query Modification and Data Fusion for Medical Image Retrieval

Jeffery R. Jensen; William R. Hersh

Image retrieval has great potential for a variety of tasks in medicine but is currently underdeveloped. For the ImageCLEF 2005 medical task, we used a text retrieval system as the foundation of our experiments to assess retrieval of images from the test collection. We conducted experiments using automatic queries, manual queries, and manual queries augmented with results from visual queries. The best performance was obtained from manual modification of queries. The combination of manual and visual retrieval results resulted in lower performance based on mean average precision but higher precision within the top 30 results. Further research is needed not only to sort out the relative benefit of textual and visual methods in image retrieval but also to determine which performance measures are most relevant to the operational setting.

- Part V. Cross-Language Retrieval In Image Collections (ImageCLEF) | Pp. 673-679

Combining Textual and Visual Features for Image Retrieval

J. L. Martínez-Fernández; Julio Villena Román; Ana M. García-Serrano; José Carlos González-Cristóbal

This paper presents the approaches used by the MIRACLE team to image retrieval at ImageCLEF 2005. Text-based and content-based techniques have been tested, along with combination of both types of methods to improve image retrieval. The text-based experiments defined this year try to use semantic information sources, like thesaurus with semantic data or text structure. On the other hand, content-based techniques are not part of the main expertise of the MIRACLE team, but multidisciplinary participation in all aspects of information retrieval has been pursued. We rely on a publicly available image retrieval system (GIFT 4) when needed.

- Part V. Cross-Language Retrieval In Image Collections (ImageCLEF) | Pp. 680-691

Supervised Machine Learning Based Medical Image Annotation and Retrieval in ImageCLEFmed 2005

Md. Mahmudur Rahman; Bipin C. Desai; Prabir Bhattacharya

This paper presents the methods and experimental results for the automatic medical image annotation and retrieval task of ImageCLEFmed 2005. A supervised machine learning approach to associate low-level image features with their high level visual and/or semantic categories is investigated. For automatic image annotation, the input images are presented as a combined feature vector of texture, edge and shape features. A multi-class classifier based on pairwise coupling of several binary support vector machine is trained on these inputs to predict the categories of test images. For visual only retrieval, a combined feature vector of color, texture and edge features is utilized in low dimensional PCA sub-space. Based on the online category prediction of query and database images by the classifier, pre-computed category specific first and second order statistical parameters are utilized in a Bhattacharyya distance measure. Experimental results of both image annotation and retrieval are reported in this paper.

- Part V. Cross-Language Retrieval In Image Collections (ImageCLEF) | Pp. 692-701

Content-Based Retrieval of Medical Images by Combining Global Features

Mark O Güld; Christian Thies; Benedikt Fischer; Thomas M. Lehmann

A combination of several classifiers using global features for the content description of medical images is proposed. Beside well known texture histogram features, downscaled representations of the original images are used, which preserve spatial information and utilize distance measures which are robust with regard to common variations in radiation dose, translation, and local deformation. These features were evaluated for the annotation task and the retrieval task in ImageCLEF 2005 without using additional textual information or query refinement mechanisms. For the annotation task, a categorization rate of 86.7% was obtained, which ranks second among all submissions. When applied in the retrieval task, the image content descriptors yielded a mean average precision (MAP) of 0.0751, which is rank 14 of 28 submitted runs. As the image deformation model is not fit for interactive retrieval tasks, two mechanisms are evaluated with regard to the trade-off between loss of accuracy and speed increase: hierarchical filtering and prototype selection.

- Part V. Cross-Language Retrieval In Image Collections (ImageCLEF) | Pp. 702-711

Combining Textual and Visual Features for Cross-Language Medical Image Retrieval

Pei-Cheng Cheng; Been-Chian Chien; Hao-Ren Ke; Wei-Pang Yang

In this paper we describe the technologies and experimental results for the medical retrieval task and automatic annotation task. We combine textual and content-based approaches to retrieve relevant medical images. The content-based approach containing four image features and the text-based approach using word expansion are developed to accomplish these tasks. Experimental results show that combining both the content-based and text-based approaches is better than using only one approach. In the automatic annotation task we use Support Vector Machines (SVM) to learn image feature characteristics for assisting the task of image classification. Based on the SVM model, we analyze which image feature is more promising in medical image retrieval. The results show that the spatial relationship between pixels is an important feature in medical image data because medical image data always has similar anatomic regions. Therefore, image features emphasizing spatial relationship have better results than others.

- Part V. Cross-Language Retrieval In Image Collections (ImageCLEF) | Pp. 712-723

The Use of MedGIFT and EasyIR for ImageCLEF 2005

Henning Müller; Antoine Geissbühler; Johan Marty; Christian Lovis; Patrick Ruch

This article describes the use of and for three of four 2005 tasks. All results rely on two systems: the GNU Image Finding Tool () for visual retrieval, and for text. For ad–hoc retrieval, two visual runs were submitted. No textual retrieval was attempted, resulting in lower scores than those using text retrieval. For medical retrieval, visual retrieval was performed with several configurations of Gabor filters and grey level/color quantisations as well as combinations of text and visual features. Due to a lack of resources no feedback runs were created, an area where performed best in 2004. For classification, a retrieval with the target image was performed and the first = 1; 5; 10 results used to calculate scores for classes by simply adding up the scores for each class. No machine learning was performed, so results were surprisingly good and only topped by systems with optimised learning strategies.

- Part V. Cross-Language Retrieval In Image Collections (ImageCLEF) | Pp. 724-732

Retrieving Images Using Cross-Language Text and Image Features

Mirna Adriani; Framadhan Arnely

We present a report on our participation in the English-Indonesian image ad-hoc task of the 2005 Cross-Language Evaluation Forum (CLEF). We chose to translate an Indonesian query set into English using a commercial machine translation tool called We used an approach that combines the retrieval results of the query on text and on image. We used query expansion in our effort to improve the retrieval effectiveness. However, worse retrieval effectiveness was resulted.

- Part V. Cross-Language Retrieval In Image Collections (ImageCLEF) | Pp. 733-736

UB at CLEF 2005: Bilingual CLIR and Medical Image Retrieval Tasks

Miguel E. Ruiz; Silvia B. Southwick

This paper presents the results of the State University of New York at Buffalo in the Cross Language Evaluation Forum 2005 (CLEF 2005). We participated in monolingual Portuguese, bilingual English-Portuguese and in the medical image retrieval tasks. We used the SMART retrieval system for text retrieval in the mono and bilingual retrieval tasks on Portuguese documents. The main goal of this part was to test formally the support for Portuguese that had been added to our system. Our results show an acceptable level of performance in the monolingual task. For the retrieval of medical images with multilingual annotations our main goal was to explore the combination of Content-Based Image Retrieval (CBIR) and text retrieval to retrieve medical images that have clinical annotations in English, French and German. We used a system that combines the content based image retrieval systems GIFT and the well known SMART system for text retrieval. Translation of English topics to French was performed by mapping the English text to UMLS concepts using MetaMap and the UMLS Metathesaurus. Our results on this task confirms that the combination of CBIR and text retrieval improves results significantly with respect to using either image or text retrieval alone.

- Part V. Cross-Language Retrieval In Image Collections (ImageCLEF) | Pp. 737-743

Overview of the CLEF-2005 Cross-Language Speech Retrieval Track

Ryen W. White; Douglas W. Oard; Gareth J. F. Jones; Dagobert Soergel; Xiaoli Huang

The task for the CLEF-2005 cross-language speech retrieval track was to identify topically coherent segments of English interviews in a known-boundary condition. Seven teams participated, performing both monolingual and cross-language searches of ASR transcripts, automatically generated metadata, and manually generated metadata. Results indicate that monolingual search technology is sufficiently accurate to be useful for some purposes (the best mean average precision was 0.13) and cross-language searching yielded results typical of those seen in other applications (with the best systems approximating monolingual mean average precision).

- Part VI. Cross-Language Speech Retrieval (CL-SR) | Pp. 744-759

Using Various Indexing Schemes and Multiple Translations in the CL-SR Task at CLEF 2005

Diana Inkpen; Muath Alzghool; Aminul Islam

We present the participation of the University of Ottawa in the Cross-Language Spoken Document Retrieval task at CLEF 2005. In order to translate the queries, we combined the results of several online Machine Translation tools. For the Information Retrieval component we used the SMART system [1], with several weighting schemes for indexing the documents and the queries. One scheme in particular led to better results than other combinations. We present the results of the submitted runs and of many un-official runs. We compare the effect of several translations from each language. We present results on phonetic transcripts of the collection and queries and on the combination of text and phonetic transcripts. We also include the results when the manual summaries and keywords are indexed.

- Part VI. Cross-Language Speech Retrieval (CL-SR) | Pp. 760-768

The University of Alicante at CL-SR Track

Rafael M. Terol; Manuel Palomar; Patricio Martinez-Barco; Fernando Llopis; Rafael Muñoz; Elisa Noguera

In this paper, the new features that IR-n system applies on the topic processing for CL-SR are described. This set of features are based on applying logic forms to topics with the aim of incrementing the weight of topic terms according to a set of syntactic rules.

- Part VI. Cross-Language Speech Retrieval (CL-SR) | Pp. 769-772

Pitt at CLEF05: Data Fusion for Spoken Document Retrieval

Daqing He; Jae-Wook Ahn

This paper describes an investigation of data fusion techniques for spoken document retrieval. The effectiveness of retrievals solely based on the outputs from automatic speech recognition (ASR) is subject to the recognition errors introduced by the ASR process. This is especially true for retrievals on Malach test collection, whose ASR outputs have average word error rate (WER) of 35%. To overcome the problem, in this year CLEF experiments, we explored data fusion techniques for integrating the manually generated metadata information, which is provided for every Malach document, with the ASR outputs. We concentrated our effort on the post-search data fusion techniques, where multiple retrieval results using automatic generated outputs or human metadata were combined. Our initial studies indicated that a simple unweighted combination method (i.e., CombMNZ) that had demonstrated to be useful in written text retrieval environment only generated significant 38% relative decrease in retrieval effectiveness (measured by Mean Average Precision) for our task by comparing to a simple retrieval baseline where all manual metadata and ASR outputs are put together. This motivated us to explore a more elaborated weighted data fusion model, where the weights are associated with each retrieval result, and can be specified by the user in advance. We also explored multiple iterations of data fusion in our weighted fusion model, and obtained further improvement at 2nd iteration. In total, our best run on data fusion obtained 31% significant relative improvement over the simple fusion baseline, and 4% relative improvement over the manual-only baseline, which is a significant difference.

- Part VI. Cross-Language Speech Retrieval (CL-SR) | Pp. 773-782

UNED@CL-SR CLEF 2005: Mixing Different Strategies to Retrieve Automatic Speech Transcriptions

Fernando López-Ostenero; Víctor Peinado; Valentín Sama; Felisa Verdejo

In this paper we describe UNED’s participation in the CLEF CL-SR 2005 track. First, we explain how we tried several strategies to clean up the automatic transcriptions. Then, we describe how we performed 84 different runs mixing these strategies with named entity recognition and different pseudo-relevance feedback approaches, in order to study the influence of each method in the retrieval process, both in monolingual and cross-lingual environments. We noticed that the influence of named entity recognition was higher in the cross-lingual environment, where MAP scores double when we take advantage of an entity recognizer. The best pseudo-relevance feedback approach was the one using manual keywords. The effects of the different cleaning strategies were very similar, except for character 3-grams, which obtained poor scores compared with other approaches.

- Part VI. Cross-Language Speech Retrieval (CL-SR) | Pp. 783-791

Dublin City University at CLEF 2005: Cross-Language Speech Retrieval (CL-SR) Experiments

Adenike M. Lam-Adesina; Gareth J. F. Jones

The Dublin City University participation in the CLEF 2005 CL-SR task concentrated on exploring the application of our existing information retrieval methods based on the Okapi model to the conversational speech data set. This required an approach to determining approximate sentence boundaries within the free-flowing automatic transcription provided to enable us to use our summary-based pseudo relevance feedback (PRF). We also performed exploratory experiments on the use of the metadata provided with the document transcriptions for indexing and relevance feedback. Topics were translated into English using Systran V3.0 machine translation. In most cases Title field only topic statements performed better than combined Title and Description topics. PRF using our adapted method is shown to be affective, and absolute performance is improved by combining the automatic document transcriptions with additional metadata fields.

- Part VI. Cross-Language Speech Retrieval (CL-SR) | Pp. 792-799

CLEF-2005 CL-SR at Maryland: Document and Query Expansion Using Side Collections and Thesauri

Jianqiang Wang; Douglas W. Oard

This paper reports results for the University of Maryland’s participation in the CLEF-2005 Cross-Language Speech Retrieval track. Techniques that were tried include: (1) document expansion with manually created metadata (thesaurus keywords and segment summaries) from a large side collection, (2) query refinement with pseudo-relevance feedback, (3) keyword expansion with thesaurus synonyms, and (4) cross-language speech retrieval using translation knowledge obtained from the statistics of a large parallel corpus. The results show that document expansion and query expansion using blind relevance feedback were effective, although optimal parameter choices differed somewhat between the training and evaluation sets. Document expansion in which manually assigned keywords were augmented with thesaurus synonyms yielded marginal gains on the training set, but no improvement on the evaluation set. Cross-language retrieval with French queries yielded 79% of monolingual mean average precision when searching manually assigned metadata despite a substantial domain mismatch between the parallel corpus and the retrieval task. Detailed failure analysis indicates that speech recognition errors for named entities were an important factor that substantially degraded retrieval effectiveness.

- Part VI. Cross-Language Speech Retrieval (CL-SR) | Pp. 800-809

Overview of WebCLEF 2005

Börkur Sigurbjörnsson; Jaap Kamps; Maarten de Rijke

We describe WebCLEF, the multilingual web track, that was introduced at CLEF 2005. We provide details of the tasks, the topics, and the results of WebCLEF participants. The mixed monolingual task proved an interesting addition to the range of tasks in cross-language information retrieval. Although it may be too early to talk about a solved problem, effective web retrieval techniques seem to carry over to the mixed monolingual setting. The multilingual task, in contrast, is still very far from being a solved problem. Remarkably, using non-translated English queries proved more successful than using translations of the English queries.

- Part VII. Multilingual Web Track (WebCLEF) | Pp. 810-824

EuroGOV: Engineering a Multilingual Web Corpus

Börkur Sigurbjörnsson; Jaap Kamps; Maarten de Rijke

is a multilingual web corpus that was created to serve as the document collection for WebCLEF, the CLEF 2005 web retrieval task. is a collection of web pages crawled from the European Union portal, European Union member state governmental web sites, and Russian governmental web sites. The corpus contains over 3 million documents written in more than 20 different European languages. In this paper we provide a detailed description of the collection.

- Part VII. Multilingual Web Track (WebCLEF) | Pp. 825-836

Web Retrieval Experiments with the EuroGOV Corpus at the University of Hildesheim

Niels Jensen; René Hackl; Thomas Mandl; Robert Strötgen

This paper describes web retrieval experiments with the EuroGOV corpus carried out at the University of Hildesheim. For both the multi-lingual and the mixed mono-lingual task, several indexing strategies were tested, all of them based on one mixed language index. After stopword removal, word and n-gram based indexes were developed based on the full document content, part of the content and the document title. Boosting the original topic language with a higher weight in the query and punishing the English translation led to better results for most settings. A title only run gave the best results during post submission runs for the multi-lingual task.

- Part VII. Multilingual Web Track (WebCLEF) | Pp. 837-845

Danish and Greek Web Search Experiments with Hummingbird SearchServer at CLEF 2005

Stephen Tomlinson

Hummingbird participated in the WebCLEF mixed monolingual retrieval task of the Cross-Language Evaluation Forum (CLEF) 2005. In this task, the system was given 547 known-item queries from 11 languages (134 Spanish, 121 English, 59 Dutch, 59 Portuguese, 57 German, 35 Hungarian, 30 Danish, 30 Russian, 16 Greek, 5 Icelandic and 1 French). The goal was to find the desired page in the 82GB EuroGOV collection (3.4 million pages crawled from government sites of 27 European domains). Our experiments found that stopword processing was more important than anticipated, perhaps because words common in one language may tend to be overweighted by inverse document frequency in a mixed language collection. Extra weight on the document title helped significantly, and extra weight on less deep urls significantly helped home page queries. Stemming was of neutral impact on average, but it made a substantial difference for some individual queries. We analyze several Danish and Greek queries in detail.

- Part VII. Multilingual Web Track (WebCLEF) | Pp. 846-855

Combination Methods for Crosslingual Web Retrieval

Jaap Kamps; Maarten de Rijke; Börkur Sigurbjörnsson

We investigate a range of crosslingual web retrieval tasks using the test suite of the CLEF 2005 WebCLEF track, which features a stream of known-item topics in various languages. Our main findings are: (i) straightforward indexing and retrieval is effective for mixed monolingual web retrieval; (ii) standard machine translation methods are effective for bilingual web retrieval; but (iii) standard combination methods are ineffective for multilingual web retrieval; we analyze the failure and suggest an alternative Z-score normalization that leads to effective multilingual retrieval results.

- Part VII. Multilingual Web Track (WebCLEF) | Pp. 856-864

University of Alicante at the CLEF 2005 WebCLEF Track

Trinitario Martínez; Elisa Noguera; Rafael Muñoz; Fernando Llopis

This paper presents the first experiment done for the CLEF2005 WebCLEF Track. In the present work, we have focused our main efforts in the Spanish part of the Mixed Monolingual task, but we have also participated in several other languages tasks and in the Bilingual English-Spanish task. A passage-based IR system is applied in the retrieval phase. Also a language identifier has been created in order to build a fully automatic system without the need of knowing the topic language.

- Part VII. Multilingual Web Track (WebCLEF) | Pp. 865-868

MIRACLE at WebCLEF 2005: Combining Web Specific and Linguistic Information

Ángel Martínez-González; José Luis Martínez-Fernández; César de Pablo-Sánchez; Julio Villena-Román

This paper describes MIRACLE approach to WebCLEF. A set of independent indexes was constructed for each top level domain of the EuroGOV collection. Each index contains information extracted from the document, like URL, title, keywords, detected named entities or HTML headers. These indexes are queried to obtain partial document rankings, which are combined with various relative weights to test the value of each index. The final aim is to identify which index (or combination of them) is more relevant for a retrieval task, avoiding the construction of a full-text index.

- Part VII. Multilingual Web Track (WebCLEF) | Pp. 869-872

BUAP-UPV TPIRS: A System for Document Indexing Reduction at WebCLEF

David Pinto; Héctor Jiménez-Salazar; Paolo Rosso; Emilio Sanchis

In this paper we present the results of BUAP/UPV universities in WebCLEF, a particular task of CLEF 2005. Particularly, we evaluate our information retrieval system at the bilingual “English to Spanish” task. Our system uses a term reduction process based on the Transition Point technique. Our results show that it is possible to reduce the number of terms to index, thereby improving the performance of our system. We evaluate different percentages of reduction over a subset of EuroGOV, in order to determine the best one. We observed that after reducing the 82.55% of the corpus, a Mean Reciprocal Rank of 0.0844 was obtained, compared with 0.0465 of such evaluation with full documents.

- Part VII. Multilingual Web Track (WebCLEF) | Pp. 873-879

Web Page Retrieval by Combining Evidence

Carlos G. Figuerola; José L. Alonso Berrocal; Angel F. Zazo; Emilio Rodríguez Vázquez de Aldana

The participation of the REINA Research Group in WebCLEF 2005 focused in the monolingual mixed task. Queries or topics are of two types: and . For both, we first perform a search by thematic contents; for the same query, we do a search in several elements of information from every page (title, some meta tags, anchor text) and then we combine the results. For queries about , we try to detect using a method based in some keywords and their patterns of use. After, a re-rank of the results of the thematic contents retrieval is performed, based on Page-Rank and Centrality coeficients.

- Part VII. Multilingual Web Track (WebCLEF) | Pp. 880-887

UNED at WebCLEF 2005

Javier Artiles; Víctor Peinado; Anselmo Peñas; Julio Gonzalo; Felisa Verdejo

This paper describes the experiments submitted by UNED’s NLP Group to the WebCLEF 2005 track in the bilingual English to Spanish task. We present two different runs: i) a simply search over the whole content of the documents; ii) a series of restricted searches over given fields according to their descriptiveness. Our newly developed approach for searching ordered fields performs 80% better than the baseline. We also describe a non-supervised approach to translate out-of-vocabulary words.

- Part VII. Multilingual Web Track (WebCLEF) | Pp. 888-891

Using the Web Information Structure for Retrieving Web Pages

Mirna Adriani; Rama Pandugita

We present a report on our participation in the mixed monolingual web task of the 2005 Cross-Language Evaluation Forum (CLEF). We compared the result of web page retrieval based on the page content, page title, and a combination of page content and page title. The result shows that using the combination of page title resulted in the best retrieval performance compared to using only page content or page title. Taking into account the number of links referring to a web page and the depth of the directory path in its URL did not result in any significant improvement to the retrieval performance.

- Part VII. Multilingual Web Track (WebCLEF) | Pp. 892-897

University of Glasgow at WebCLEF 2005: Experiments in Per-Field Normalisation and Language Specific Stemming

Craig Macdonald; Vassilis Plachouras; Ben He; Christina Lioma; Iadh Ounis

We participated in the WebCLEF 2005 monolingual task. In this task, a search system aims to retrieve relevant documents from a multilingual corpus of Web documents from Web sites of European governments. Both the documents and the queries are written in a wide range of European languages. A challenge in this setting is to detect the language of documents and topics, and to process them appropriately. We develop a language specific technique for applying the correct stemming approach, as well as for removing the correct stopwords from the queries. We represent documents using three fields, namely content, title, and anchor text of incoming hyperlinks. We use a technique called per-field normalisation, which extends the Divergence From Randomness (DFR) framework, to normalise the term frequencies, and to combine them across the three fields. We also employ the length of the URL path of Web documents. The ranking is based on combinations of both the language specific stemming, if applied, and the per-field normalisation. We use our Terrier platform for all our experiments. The overall performance of our techniques is outstanding, achieving the overall top four performing runs, as well as the top performing run without metadata in the monolingual task. The best run only uses per-field normalisation, without applying stemming.

- Part VII. Multilingual Web Track (WebCLEF) | Pp. 898-907

GeoCLEF: The CLEF 2005 Cross-Language Geographic Information Retrieval Track Overview

Fredric Gey; Ray Larson; Mark Sanderson; Hideo Joho; Paul Clough; Vivien Petras

GeoCLEF was a new pilot track in CLEF 2005. GeoCLEF was to test and evaluate cross-language geographic information retrieval (GIR) of text. Geographic information retrieval is retrieval oriented toward the geographic specification in the description of the search topic and returns documents which satisfy this geographic information need. For GeoCLEF 2005, twenty-five search topics were defined for searching against the English and German ad-hoc document collections of CLEF. Topic languages were English, German, Portuguese and Spanish. Eleven groups submitted runs and about 25,000 documents (half English and half German) in the pooled runs were judged by the organizers. The groups used a variety of approaches, including geographic bounding boxes and external knowledge bases (geographic thesauri and ontologies and gazetteers). The results were encouraging but showed that additional work needs to be done to refine the task for GeoCLEF in 2006.

- Part VIII. Cross-Language Geographical Retrieval (GeoCLEF) | Pp. 908-919

MIRACLE at GeoCLEF 2005: First Experiments in Geographical IR

Sara Lana-Serrano; José M. Goñi-Menoyo; José C. González-Cristóbal

This paper presents the 2005 MIRACLE team’s approach to Cross-Language Geographical Retrieval (GeoCLEF). The main goal of the GeoCLEF participation of the MIRACLE team was to test the effect that geographical information retrieval techniques have on information retrieval. The baseline approach is based on the development of named entity recognition and geospatial information retrieval tools and on its combination with linguistic techniques to carry out indexing and retrieval tasks.

- Part VIII. Cross-Language Geographical Retrieval (GeoCLEF) | Pp. 920-923

University of Alicante at GeoCLEF 2005

Óscar Ferrández; Zornitsa Kozareva; Antonio Toral; Elisa Noguera; Andrés Montoyo; Rafael Muñoz; Fernando Llopis

For our participation in GeoCLEF 2005 we have developed a system made up of three modules. One of them is an Information Retrieval module and the others are Named Entity Recognition modules based on machine learning and based on knowledge. We have carried out several runs with different combinations of these modules for resolving the proposed tasks. The system scored second position for the tasks against German collections and third position for the tasks against English collections.

- Part VIII. Cross-Language Geographical Retrieval (GeoCLEF) | Pp. 924-927

Evaluating Geographic Information Retrieval

András Kornai

The processing steps required for geographic information retrieval include many steps that are common to all forms of information retrieval, e.g. stopword filtering, stemming, vocabulary enrichment, understanding Booleans, and fluff removal. Only a few steps, in particular the detection of geographic entities and the assignment of bounding boxes to these, are specific to geographic IR. The paper presents the results of experiments designed to evaluate the geography-specificity of the GeoCLEF 2005 task, and suggests some methods to increase the sensitivity of the evaluation.

- Part VIII. Cross-Language Geographical Retrieval (GeoCLEF) | Pp. 928-938

Using the WordNet Ontology in the GeoCLEF Geographical Information Retrieval Task

Davide Buscaldi; Paolo Rosso; Emilio Sanchis Arnal

This paper describes how we managed to use the WordNet ontology for the GeoCLEF 2005 English monolingual task. Both a query expansion method, based on the expansion of geographical terms by means of WordNet synonyms and meronyms, and a method based on the expansion of index terms, which exploits WordNet synonyms and holonyms. The obtained results show that the query expansion method was not suitable for the GeoCLEF track, while WordNet could be used in a more effective way during the indexing phase.

- Part VIII. Cross-Language Geographical Retrieval (GeoCLEF) | Pp. 939-946

The GeoTALP-IR System at GeoCLEF 2005: Experiments Using a QA-Based IR System, Linguistic Analysis, and a Geographical Thesaurus

Daniel Ferrés; Alicia Ageno; Horacio Rodríguez

This paper describes GeoTALP-IR system, a Geographical Information Retrieval (GIR) system. The system is described and evaluated in the context of our participation in the CLEF 2005 GeoCLEF Monolingual English task.

The GIR system is based on and uses a modified version of the Passage Retrieval module of the TALP Question Answering (QA) system presented at CLEF 2004 and TREC 2004 QA evaluation tasks. We designed a Keyword Selection algorithm based on a Linguistic and Geographical Analysis of the topics. A Geographical Thesaurus (GT) has been built using a set of publicly available Geographical Gazetteers and a Geographical Ontology. Our experiments show that the use of a Geographical Thesaurus for Geographical Indexing and Retrieval has improved the performance of our GIR system.

- Part VIII. Cross-Language Geographical Retrieval (GeoCLEF) | Pp. 947-955

CSUSM Experiments in GeoCLEF2005: Monolingual and Bilingual Tasks

Rocio Guillén

This paper presents the results of our initial experiments in the monolingual English task and the Bilingual Spanish → English task. We used the Terrier Information Retrieval Platform to run experiments for both tasks using the Inverse Document Frequency model with Laplace after-effect and normalization 2. Additional experiments were run with Indri, a retrieval engine that combines inference networks with language modelling. For the bilingual task we developed a component to first translate the topics from Spanish into English. No spatial analysis was carried out for any of the tasks. One of our goals is to have a baseline to compare further experiments with term translation of georeferences and spatial analysis. Another goal is to use ontologies for Integrated Geographic Information Systems adapted to the IR task. Our initial results show that the geographic information as provided does not improve significantly retrieval performance. We included the geographical terms appearing in all the fields. Duplication of terms might have decreased gain of information and affected the ranking.

- Part VIII. Cross-Language Geographical Retrieval (GeoCLEF) | Pp. 956-962

Berkeley at GeoCLEF: Logistic Regression and Fusion for Geographic Information Retrieval

Ray R. Larson; Fredric C. Gey; Vivien Petras

In this paper we will describe the Berkeley (groups 1 and 2 combined) submissions and approaches to the GeoCLEF task for CLEF 2005. The two Berkeley groups used different systems and approaches for GeoCLEF with some common themes. For Berkeley group 1 (Larson) the main technique used was fusion of multiple probabilistic searches against different XML components using both Logistic Regression (LR) algorithms and a version of the Okapi BM-25 algorithm. The Berkeley group 2 (Gey and Petras) employed tested CLIR methods from previous CLEF evaluations using Logistic Regression with Blind Feedback. Both groups used multiple translations of queries in for cross-language searching, and the primary geographically-based approaches taken by both involved query expansion with additional place names. The Berkeley1 group used GIR indexing techniques to georeference proper nouns in the text using a gazetteer derived from the World Gazetteer (with both English and German names for each place), and automatically expanded place names in topics for regions or countries in the queries by the names of the countries or cities in those regions or countries. The Berkeley2 group used manual expansion of queries, adding additional place names.

- Part VIII. Cross-Language Geographical Retrieval (GeoCLEF) | Pp. 963-976

Using Semantic Networks for Geographic Information Retrieval

Johannes Leveling; Sven Hartrumpf; Dirk Veiel

This paper describes our work for the participation at the GeoCLEF task of CLEF 2005. We employ multilayered extended semantic networks for the representation of background knowledge, queries, and documents for geographic information retrieval (GIR). In our approach, geographic concepts from the query network are expanded with concepts which are semantically connected via topological, directional, and proximity relations. We started with an existing geographic knowledge base represented as a semantic network and expanded it with concepts automatically extracted from the GEOnet Names Server.

Several experiments for GIR on German documents have been performed: a baseline corresponding to a traditional information retrieval approach; a variant expanding thematic, temporal, and geographic descriptors from the semantic network representation of the query; and an adaptation of a question answering (QA) algorithm based on semantic networks. The second experiment is based on a representation of the natural language description of a topic as a semantic network, which is achieved by a deep linguistic analysis. The semantic network is transformed into an intermediate representation of a database query explicitly representing thematic, temporal, and local restrictions. This experiment showed the best performance with respect to mean average precision: 10.53% using the topic title and description. The third experiment, adapting a QA algorithm, uses a modified version of the QA system InSicht. The system matches deep semantic representations of queries or their equivalent or similar variants to semantic networks for document sentences.

- Part VIII. Cross-Language Geographical Retrieval (GeoCLEF) | Pp. 977-986

Experiments with Geo-Filtering Predicates for IR

Jochen L. Leidner

This paper describes a set of experiments for monolingual English retrieval at 2005, evaluating a technique for spatial retrieval based on named entity tagging, toponym resolution, and re-ranking by means of geographic filtering. To this end, a series of systematic experiments in the Vector Space paradigm are presented. Plain bag-of-words versus phrasal retrieval and the potential of meronymy query expansion as a recall-enhancing device are investigated, and three alternative geo-spatial filtering techniques based on spatial clipping are compared and evaluated on 25 monolingual English queries. Preliminary results show that always choosing toponym referents based on a simple “maximum population” heuristic to approximate the salience of a referent fails to outperform TF*IDF baselines with the 2005 dataset when combined with three geo-filtering predicates. Conservative geo-filtering outperforms more aggressive predicates. The evidence further seems to suggest that query expansion with WordNet meronyms is not effective in combination with the method described. A post-hoc analysis indicates that responsible factors for the low performance include sparseness of available population data, gaps in the gazetteer that associates Minimum Bounding Rectangles with geo-terms in the query, and the composition of the 2005 dataset itself.

- Part VIII. Cross-Language Geographical Retrieval (GeoCLEF) | Pp. 987-996

The XLDB Group at GeoCLEF 2005

Nuno Cardoso; Bruno Martins; Marcirio Chaves; Leonardo Andrade; Mário J. Silva

This paper describes our participation at GeoCLEF 2005. We detail the main software components of our Geo-IR system, its adaptation for GeoCLEF and the obtained results. The software architecture includes a geographic knowledge base, a text mining tool for geo-referencing documents, and a geo-ranking component. Results show that geo-ranking is heavily dependent on the information in the knowledge base and on the ranking algorithm involved.

- Part VIII. Cross-Language Geographical Retrieval (GeoCLEF) | Pp. 997-1006

Portuguese at CLEF 2005

Diana Santos; Nuno Cardoso

In this paper, we comment on the addition of Portuguese to three new tracks in CLEF 2005, namely WebCLEF, GeoCLEF and ImageCLEF, and discuss differences and new features in the adhoc IR and the QA tracks, presenting a new Brazilian collection.

- Evaluation Issues | Pp. 1007-1010


Tipo: libros

ISBN impreso


ISBN electrónico


Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación