Catálogo de publicaciones - libros
Advances in Information Retrieval: 27th European Conference on IR Research, ECIR 2005, Santiago de Compostela, Spain, March 21-23, 2005, Proceedings
David E. Losada ; Juan M. Fernández-Luna (eds.)
En conferencia: 27º European Conference on Information Retrieval (ECIR) . Santiago de Compostela, Spain . March 21, 2005 - March 23, 2005
Resumen/Descripción – provisto por la editorial
No disponible.
Palabras clave – provistas por la editorial
Information Storage and Retrieval; Artificial Intelligence (incl. Robotics); Database Management; Information Systems Applications (incl. Internet); Multimedia Information Systems; Document Preparation and Text Processing
Disponibilidad
Institución detectada | Año de publicación | Navegá | Descargá | Solicitá |
---|---|---|---|---|
No detectada | 2005 | SpringerLink |
Información
Tipo de recurso:
libros
ISBN impreso
978-3-540-25295-5
ISBN electrónico
978-3-540-31865-1
Editor responsable
Springer Nature
País de edición
Reino Unido
Fecha de publicación
2005
Información sobre derechos de publicación
© Springer-Verlag Berlin Heidelberg 2005
Tabla de contenidos
Using Restrictive Classification and Meta Classification for Junk Elimination
Stefan Siersdorfer; Gerhard Weikum
This paper addresses the problem of performing supervised classification on document collections containing also . With ”junk documents” we mean documents that do belong to the topic categories (classes) we are interested in. This type of documents can typically not be covered by the training set; nevertheless in many real world applications (e.g. classification of web or intranet content, focused crawling etc.) such documents occur quite often and a classifier has to make a decision about them. We tackle this problem by using restrictive methods and ensemble-based meta methods that may decide to leave out some documents rather than assigning them to inappropriate classes with low confidence. Our experiments with four different data sets show that the proposed techniques can eliminate a relatively large fraction of junk documents while dismissing only a significantly smaller fraction of potentially interesting documents.
- Text Classification and Fusion | Pp. 287-299
On Compression-Based Text Classification
Yuval Marton; Ning Wu; Lisa Hellerstein
Compression-based text classification methods are easy to apply, requiring virtually no preprocessing of the data. Most such methods are character-based, and thus have the potential to automatically capture non-word features of a document, such as punctuation, word-stems, and features spanning more than one word. However, compression-based classification methods have drawbacks (such as slow running time), and not all such methods are equally effective. We present the results of a number of experiments designed to evaluate the effectiveness and behavior of different compression-based text classification methods on English text. Among our experiments are some specifically designed to test whether the ability to capture non-word (including super-word) features causes character-based text compression methods to achieve more accurate classification.
- Text Classification and Fusion | Pp. 300-314
Ontology as a Search-Tool: A Study of Real Users’ Query Formulation With and Without Conceptual Support
Sari Suomela; Jaana Kekäläinen
This study examines 16 real users’ use of an ontology as a search tool. The users’ queries constructed with the help of a Concept-based Information Retrieval Interface (CIRI) were compared to queries created independently based on the same search task description. Also the effectiveness of the CIRI queries was compared to the users’ unaided queries. The simulated search task method was used to make the searching situations as close to real as possible. Due to CIRI’s query expansion feature the number of search terms was remarkably higher in ontology queries than in Direct interface queries. The search results were evaluated with generalised precision and generalised relative recall as well as precision based on personal assessments. The Direct interface queries performed better in all methods of comparison.
- User Studies and Evaluation | Pp. 315-329
An Analysis of Query Similarity in Collaborative Web Search
Evelyn Balfe; Barry Smyth
Web search logs provide an invaluable source of information regarding the search behaviour of users. This information can be reused to aid future searches, especially when these logs contain the searching histories of specific communities of users. To date this information is rarely exploited as most Web search techniques continue to rely on the more traditional term-based IR approaches. In contrast, the I-SPY system attempts to reuse past search behaviours as a means to re-rank result-lists according to the implied preferences of like-minded communities of users. It relies on the ability to recognise previous search sessions that are related to the current target search by looking for similarities between past and current queries. We have previously shown how a simple model of query similarity can significantly improve search performance by implementing this reuse approach. In this paper we build on previous work by evaluating alternative query similarity models.
- User Studies and Evaluation | Pp. 330-344
A Probabilistic Interpretation of Precision, Recall and -Score, with Implication for Evaluation
Cyril Goutte; Eric Gaussier
We address the problems of 1/ assessing the confidence of the standard point estimates, precision, recall and -score, and 2/ comparing the results, in terms of precision, recall and -score, obtained using two different methods. To do so, we use a probabilistic setting which allows us to obtain posterior distributions on these performance indicators, rather than point estimates. This framework is applied to the case where different methods are run on different datasets from the same source, as well as the standard situation where competing results are obtained on the same data.
- User Studies and Evaluation | Pp. 345-359
Exploring Cost-Effective Approaches to Human Evaluation of Search Engine Relevance
Kamal Ali; Chi-Chao Chang; Yunfang Juan
In this paper, we examine novel and less expensive methods for search engine evaluation that do not rely on document relevance judgments. These methods, described within a proposed framework, are motivated by the increasing focus on search results presentation, by the growing diversity of documents and content sources, and by the need to measure effectiveness relative to other search engines. Correlation analysis of the data obtained from actual tests using a subset of the methods in the framework suggest that these methods measure different aspects of the search engine. In practice, we argue that the selection of the test method is a tradeoff between measurement intent and cost.
- User Studies and Evaluation | Pp. 360-374
Document Identifier Reassignment Through Dimensionality Reduction
Roi Blanco; Álvaro Barreiro
Most modern retrieval systems use compressed (IF) for indexing. Recent works demonstrated that it is possible to reduce IF sizes by reassigning the document identifiers of the original collection, as it lowers the average distance between documents related to a single term. Variable-bit encoding schemes can exploit the average gap reduction and decrease the total amount of bits per document pointer. However, approximations developed so far requires great amounts of time or use an uncontrolled memory size. This paper presents an efficient solution to the reassignment problem consisting in reducing the input data dimensionality using a SVD transformation. We tested this approximation with the Greedy-NN TSP algorithm and one more efficient variant based on dividing the original problem in sub-problems. We present experimental tests and performance results in two TREC collections, obtaining good compression ratios with low running times. We also show experimental results about the tradeoff between dimensionality reduction and compression, and time performance.
- Information Retrieval Methods (II) | Pp. 375-387
Scalability Influence on Retrieval Models: An Experimental Methodology
Amélie Imafouo; Michel Beigbeder
Few works in Information Retrieval (IR) tackled the questions of Information Retrieval Systems (IRS) and in the context of scalability in corpus size.
We propose a general experimental methodology to study the scalability influence on IR models. This methodology is based on the construction of a collection on which a given characteristic is the same whatever be the portion of collection selected. This new collection called uniform can be split into sub-collection of growing size on which some given properties will be studied.
We apply our methodology to WT10G (TREC9 collection) and consider the characteristic to be the distribution of relevant documents on a collection. We build a uniform WT10G, sample it into sub-collections of increasing size and use these sub-collections to study the impact of corpus volume increase on standards IRS evaluation measures (recall/precision, high precision).
- Information Retrieval Methods (II) | Pp. 388-402
The Role of Multi-word Units in Interactive Information Retrieval
Olga Vechtomova
The paper presents several techniques for selecting noun phrases for interactive query expansion following pseudo-relevance feedback and a new phrase search method. A combined syntactico-statistical method was used for the selection of phrases. First, noun phrases were selected using a part-of-speech tagger and a noun-phrase chunker, and secondly, different statistical measures were applied to select phrases for query expansion. Experiments were also conducted studying the effectiveness of noun phrases in document ranking. We analyse the problems of phrase weighting and suggest new ways of addressing them. A new method of phrase matching and weighting was developed, which specifically addresses the problem of weighting overlapping and non-contiguous word sequences in documents.
- Information Retrieval Methods (II) | Pp. 403-420
Dictionary-Based CLIR Loses Highly Relevant Documents
Raija Lehtokangas; Heikki Keskustalo; Kalervo Järvelin
Research on cross-language information retrieval (CLIR) has typically been restricted to settings using binary relevance assessments. In this paper, we present evaluation results for dictionary-based CLIR using graded relevance assessments in a best match retrieval environment. A text database containing newspaper articles and a related set of 35 search topics were used in the tests. First, monolingual baseline queries were automatically formed from the topics. Secondly, source language topics (in English, German, and Swedish) were automatically translated into the target language (Finnish), using both structured and unstructured queries. Effectiveness of the translated queries was compared to that of the monolingual queries. CLIR performance was evaluated using three relevance criteria: stringent, regular, and liberal. When regular or liberal criteria were used, a reasonable performance was achieved. Adopting stringent criteria caused a considerable loss of performance, when compared to monolingual Finnish performance.
- Information Retrieval Methods (II) | Pp. 421-432