Catálogo de publicaciones - libros

Compartir en
redes sociales


Advances in Information Retrieval: 27th European Conference on IR Research, ECIR 2005, Santiago de Compostela, Spain, March 21-23, 2005, Proceedings

David E. Losada ; Juan M. Fernández-Luna (eds.)

En conferencia: 27º European Conference on Information Retrieval (ECIR) . Santiago de Compostela, Spain . March 21, 2005 - March 23, 2005

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

Information Storage and Retrieval; Artificial Intelligence (incl. Robotics); Database Management; Information Systems Applications (incl. Internet); Multimedia Information Systems; Document Preparation and Text Processing

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2005 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-25295-5

ISBN electrónico

978-3-540-31865-1

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

Información sobre derechos de publicación

© Springer-Verlag Berlin Heidelberg 2005

Tabla de contenidos

Using Restrictive Classification and Meta Classification for Junk Elimination

Stefan Siersdorfer; Gerhard Weikum

This paper addresses the problem of performing supervised classification on document collections containing also . With ”junk documents” we mean documents that do belong to the topic categories (classes) we are interested in. This type of documents can typically not be covered by the training set; nevertheless in many real world applications (e.g. classification of web or intranet content, focused crawling etc.) such documents occur quite often and a classifier has to make a decision about them. We tackle this problem by using restrictive methods and ensemble-based meta methods that may decide to leave out some documents rather than assigning them to inappropriate classes with low confidence. Our experiments with four different data sets show that the proposed techniques can eliminate a relatively large fraction of junk documents while dismissing only a significantly smaller fraction of potentially interesting documents.

- Text Classification and Fusion | Pp. 287-299

On Compression-Based Text Classification

Yuval Marton; Ning Wu; Lisa Hellerstein

Compression-based text classification methods are easy to apply, requiring virtually no preprocessing of the data. Most such methods are character-based, and thus have the potential to automatically capture non-word features of a document, such as punctuation, word-stems, and features spanning more than one word. However, compression-based classification methods have drawbacks (such as slow running time), and not all such methods are equally effective. We present the results of a number of experiments designed to evaluate the effectiveness and behavior of different compression-based text classification methods on English text. Among our experiments are some specifically designed to test whether the ability to capture non-word (including super-word) features causes character-based text compression methods to achieve more accurate classification.

- Text Classification and Fusion | Pp. 300-314

Ontology as a Search-Tool: A Study of Real Users’ Query Formulation With and Without Conceptual Support

Sari Suomela; Jaana Kekäläinen

This study examines 16 real users’ use of an ontology as a search tool. The users’ queries constructed with the help of a Concept-based Information Retrieval Interface (CIRI) were compared to queries created independently based on the same search task description. Also the effectiveness of the CIRI queries was compared to the users’ unaided queries. The simulated search task method was used to make the searching situations as close to real as possible. Due to CIRI’s query expansion feature the number of search terms was remarkably higher in ontology queries than in Direct interface queries. The search results were evaluated with generalised precision and generalised relative recall as well as precision based on personal assessments. The Direct interface queries performed better in all methods of comparison.

- User Studies and Evaluation | Pp. 315-329

An Analysis of Query Similarity in Collaborative Web Search

Evelyn Balfe; Barry Smyth

Web search logs provide an invaluable source of information regarding the search behaviour of users. This information can be reused to aid future searches, especially when these logs contain the searching histories of specific communities of users. To date this information is rarely exploited as most Web search techniques continue to rely on the more traditional term-based IR approaches. In contrast, the I-SPY system attempts to reuse past search behaviours as a means to re-rank result-lists according to the implied preferences of like-minded communities of users. It relies on the ability to recognise previous search sessions that are related to the current target search by looking for similarities between past and current queries. We have previously shown how a simple model of query similarity can significantly improve search performance by implementing this reuse approach. In this paper we build on previous work by evaluating alternative query similarity models.

- User Studies and Evaluation | Pp. 330-344

A Probabilistic Interpretation of Precision, Recall and -Score, with Implication for Evaluation

Cyril Goutte; Eric Gaussier

We address the problems of 1/ assessing the confidence of the standard point estimates, precision, recall and -score, and 2/ comparing the results, in terms of precision, recall and -score, obtained using two different methods. To do so, we use a probabilistic setting which allows us to obtain posterior distributions on these performance indicators, rather than point estimates. This framework is applied to the case where different methods are run on different datasets from the same source, as well as the standard situation where competing results are obtained on the same data.

- User Studies and Evaluation | Pp. 345-359

Exploring Cost-Effective Approaches to Human Evaluation of Search Engine Relevance

Kamal Ali; Chi-Chao Chang; Yunfang Juan

In this paper, we examine novel and less expensive methods for search engine evaluation that do not rely on document relevance judgments. These methods, described within a proposed framework, are motivated by the increasing focus on search results presentation, by the growing diversity of documents and content sources, and by the need to measure effectiveness relative to other search engines. Correlation analysis of the data obtained from actual tests using a subset of the methods in the framework suggest that these methods measure different aspects of the search engine. In practice, we argue that the selection of the test method is a tradeoff between measurement intent and cost.

- User Studies and Evaluation | Pp. 360-374

Document Identifier Reassignment Through Dimensionality Reduction

Roi Blanco; Álvaro Barreiro

Most modern retrieval systems use compressed (IF) for indexing. Recent works demonstrated that it is possible to reduce IF sizes by reassigning the document identifiers of the original collection, as it lowers the average distance between documents related to a single term. Variable-bit encoding schemes can exploit the average gap reduction and decrease the total amount of bits per document pointer. However, approximations developed so far requires great amounts of time or use an uncontrolled memory size. This paper presents an efficient solution to the reassignment problem consisting in reducing the input data dimensionality using a SVD transformation. We tested this approximation with the Greedy-NN TSP algorithm and one more efficient variant based on dividing the original problem in sub-problems. We present experimental tests and performance results in two TREC collections, obtaining good compression ratios with low running times. We also show experimental results about the tradeoff between dimensionality reduction and compression, and time performance.

- Information Retrieval Methods (II) | Pp. 375-387

Scalability Influence on Retrieval Models: An Experimental Methodology

Amélie Imafouo; Michel Beigbeder

Few works in Information Retrieval (IR) tackled the questions of Information Retrieval Systems (IRS) and in the context of scalability in corpus size.

We propose a general experimental methodology to study the scalability influence on IR models. This methodology is based on the construction of a collection on which a given characteristic is the same whatever be the portion of collection selected. This new collection called uniform can be split into sub-collection of growing size on which some given properties will be studied.

We apply our methodology to WT10G (TREC9 collection) and consider the characteristic to be the distribution of relevant documents on a collection. We build a uniform WT10G, sample it into sub-collections of increasing size and use these sub-collections to study the impact of corpus volume increase on standards IRS evaluation measures (recall/precision, high precision).

- Information Retrieval Methods (II) | Pp. 388-402

The Role of Multi-word Units in Interactive Information Retrieval

Olga Vechtomova

The paper presents several techniques for selecting noun phrases for interactive query expansion following pseudo-relevance feedback and a new phrase search method. A combined syntactico-statistical method was used for the selection of phrases. First, noun phrases were selected using a part-of-speech tagger and a noun-phrase chunker, and secondly, different statistical measures were applied to select phrases for query expansion. Experiments were also conducted studying the effectiveness of noun phrases in document ranking. We analyse the problems of phrase weighting and suggest new ways of addressing them. A new method of phrase matching and weighting was developed, which specifically addresses the problem of weighting overlapping and non-contiguous word sequences in documents.

- Information Retrieval Methods (II) | Pp. 403-420

Dictionary-Based CLIR Loses Highly Relevant Documents

Raija Lehtokangas; Heikki Keskustalo; Kalervo Järvelin

Research on cross-language information retrieval (CLIR) has typically been restricted to settings using binary relevance assessments. In this paper, we present evaluation results for dictionary-based CLIR using graded relevance assessments in a best match retrieval environment. A text database containing newspaper articles and a related set of 35 search topics were used in the tests. First, monolingual baseline queries were automatically formed from the topics. Secondly, source language topics (in English, German, and Swedish) were automatically translated into the target language (Finnish), using both structured and unstructured queries. Effectiveness of the translated queries was compared to that of the monolingual queries. CLIR performance was evaluated using three relevance criteria: stringent, regular, and liberal. When regular or liberal criteria were used, a reasonable performance was achieved. Adopting stringent criteria caused a considerable loss of performance, when compared to monolingual Finnish performance.

- Information Retrieval Methods (II) | Pp. 421-432