Catálogo de publicaciones - libros
Advances in Information Retrieval: 27th European Conference on IR Research, ECIR 2005, Santiago de Compostela, Spain, March 21-23, 2005, Proceedings
David E. Losada ; Juan M. Fernández-Luna (eds.)
En conferencia: 27º European Conference on Information Retrieval (ECIR) . Santiago de Compostela, Spain . March 21, 2005 - March 23, 2005
Resumen/Descripción – provisto por la editorial
No disponible.
Palabras clave – provistas por la editorial
Information Storage and Retrieval; Artificial Intelligence (incl. Robotics); Database Management; Information Systems Applications (incl. Internet); Multimedia Information Systems; Document Preparation and Text Processing
Disponibilidad
Institución detectada | Año de publicación | Navegá | Descargá | Solicitá |
---|---|---|---|---|
No detectada | 2005 | SpringerLink |
Información
Tipo de recurso:
libros
ISBN impreso
978-3-540-25295-5
ISBN electrónico
978-3-540-31865-1
Editor responsable
Springer Nature
País de edición
Reino Unido
Fecha de publicación
2005
Información sobre derechos de publicación
© Springer-Verlag Berlin Heidelberg 2005
Tabla de contenidos
Automatic Text Summarization Based on Word-Clusters and Ranking Algorithms
Massih R. Amini; Nicolas Usunier; Patrick Gallinari
This paper investigates a new approach for Single Document Summarization based on a Machine Learning ranking algorithm. The use of machine learning techniques for this task allows one to adapt summaries to the user needs and to the corpus characteristics. These desirable properties have motivated an increasing amount of work in this field over the last few years. Most approaches attempt to generate summaries by extracting text-spans (sentences in our case) and adopt the classification framework which consists to train a classifier in order to discriminate between relevant and irrelevant spans of a document. A set of features is first used to produce a vector of scores for each sentence in a given document and a classifier is trained in order to make a global combination of these scores. We believe that the classification criterion for training a classifier is not adapted for SDS and propose an original framework based on ranking for this task. A ranking algorithm also combines the scores of different features but its criterion tends to reduce the relative misordering of sentences within a document. Features we use here are either based on the state-of-the-art or built upon word-clusters. These clusters are groups of words which often co-occur with each other, and can serve to expand a query or to enrich the representation of the sentences of the documents. We analyze the performance of our ranking algorithm on two data sets – the Computation and Language (cmp_lg) collection of TIPSTER SUMMAC and the WIPO collection. We perform comparisons with different baseline – non learning – systems, and a reference trainable summarizer system based on the classification framework. The experiments show that the learning algorithms perform better than the non-learning systems while the ranking algorithm outperforms the classifier. The difference of performance between the two learning algorithms depends on the nature of datasets. We give an explanation of this fact by the different separability hypothesis of the data made by the two learning algorithms.
- Text Summarization | Pp. 142-156
Comparing Topiary-Style Approaches to Headline Generation
Ruichao Wang; Nicola Stokes; William P. Doran; Eamonn Newman; Joe Carthy; John Dunnion
In this paper we compare a number of Topiary-style headline generation systems. The Topiary system, developed at the University of Maryland with BBN, was the top performing headline generation system at DUC 2004. Topiary-style headlines consist of a number of general topic labels followed by a compressed version of the lead sentence of a news story. The Topiary system uses a statistical learning approach to finding topic labels for headlines, while our approach, the LexTrim system, identifies key summary words by analysing the lexical cohesive structure of a text. The performance of these systems is evaluated using the ROUGE evaluation suite on the DUC 2004 news stories collection. The results of these experiments show that a baseline system that identifies topic descriptors for headlines using term frequency counts outperforms the LexTrim and Topiary systems. A manual evaluation of the headlines also confirms this result.
- Text Summarization | Pp. 157-168
Improving Retrieval Effectiveness by Using Key Terms in Top Retrieved Documents
Yang Lingpeng; Ji Donghong; Zhou Guodong; Nie Yu
In this paper, we propose a method to improve the precision of top retrieved documents in Chinese information retrieval where the query is a short description by re-ordering retrieved documents in the initial retrieval. To re-order the documents, we firstly find out terms in query and their importance scales by making use of the information derived from top (<=30) retrieved documents in the initial retrieval; secondly, we re-order retrieved (<<) documents by what kinds of terms of query they contain. That is, we first automatically extract key terms from top retrieved documents, then we collect key terms that occur in query and their document frequencies in the retrieved documents, finally we use these collected terms to re-order the initially retrieved documents. Each collected term is assigned a weight by its length and its document frequency in top retrieved documents. Each document is re-ranked by the sum of weights of collected terms it contains. In our experiments on 42 query topics in NTCIR3 Cross Lingual Information Retrieval (CLIR) dataset, an average 17.8%-27.5% improvement can be made for top 10 documents and an average 6.6%-26.9% improvement can be made for top 100 documents at relax/rigid relevance judgment and different parameter setting.
- Information Retrieval Methods (I) | Pp. 169-184
Evaluating Relevance Feedback Algorithms for Searching on Small Displays
Vishwa Vinay; Ingemar J. Cox; Natasa Milic-Frayling; Ken Wood
Searching online information resources using mobile devices is affected by displays on which only a small fraction of the set of ranked documents can be displayed. In this paper, we ask whether the search effort can be reduced, on average, by user feedback indicating a single most relevant document in each display. For small display sizes and limited user actions, we are able to construct a tree representing all possible outcomes. Examination of the tree permits us to compute an upper limit on relevance feedback performance. Three standard feedback algorithms are considered – Rocchio, Robertson/Sparck-Jones and a Bayesian algorithm. Two display strategies are considered, one based on maximizing the immediate information gain and the other on most likely documents. Our results bring out the strengths and weaknesses of the algorithms, and the need for exploratory display strategies with conservative feedback algorithms.
- Information Retrieval Methods (I) | Pp. 185-199
Term Frequency Normalisation Tuning for BM25 and DFR Models
Ben He; Iadh Ounis
The term frequency normalisation parameter tuning is a crucial issue in information retrieval (IR), which has an important impact on the retrieval performance. The classical pivoted normalisation approach suffers from the collection-dependence problem. As a consequence, it requires relevance assessment for each given collection to obtain the optimal parameter setting. In this paper, we tackle the collection-dependence problem by proposing a new tuning method by measuring the normalisation effect. The proposed method refines and extends our methodology described in [7]. In our experiments, we evaluate our proposed tuning method on various TREC collections, for both the normalisation 2 of the Divergence From Randomness (DFR) models and the BM25’s normalisation method. Results show that for both normalisation methods, our tuning method significantly outperforms the robust empirically-obtained baselines over diverse TREC collections, while having a marginal computational cost.
- Information Retrieval Methods (I) | Pp. 200-214
Improving the Context-Based Influence Diagram Model for Structured Document Retrieval: Removing Topological Restrictions and Adding New Evaluation Methods
Luis M. de Campos; Juan M. Fernández-Luna; Juan F. Huete
In this paper we present the theoretical developments necessary to extend the existing Context-based Influence Diagram Model for Structured Documents (CID model), in order to improve its retrieval performance and expressiveness. Firstly, we make it more flexible and general by removing the original restrictions on the type of structured documents that CID represents. This extension requires the design of a new algorithm to compute the posterior probabilities of relevance. Another contribution is related to the evaluation of the influence diagram. The computation of the expected utilities in the original CID model was approximated by applying an independence criterion. We present another approximation that does not assume independence, as well as an exact evaluation method.
- Information Retrieval Models (II) | Pp. 215-229
Knowing-Aboutness: Question-Answering Using a Logic-Based Framework
Terence Clifton; William Teahan
We describe the background and motivation for a logic-based framework, based on the theory of “Knowing-Aboutness”, and its specific application to Question-Answering. We present the salient features of our system, and outline the benefits of our framework in terms of a more integrated architecture that is more easily evaluated. Favourable results are presented in the TREC 2004 Question-Answering evaluation.
- Information Retrieval Models (II) | Pp. 230-244
Modified LSI Model for Efficient Search by Metric Access Methods
Tomáš Skopal; Pavel Moravec
Text collections represented in LSI model are hard to search efficiently (i.e. quickly), since there exists no indexing method for the LSI matrices. The inverted file, often used in both boolean and classic vector model, cannot be effectively utilized, because query vectors in LSI model are dense. A possible way for efficient search in LSI matrices could be the usage of metric access methods (MAMs). Instead of cosine measure, the MAMs can utilize the deviation metric for query processing as an equivalent dissimilarity measure. However, the intrinsic dimensionality of collections represented by LSI matrices is often large, which decreases MAMs’ performance in searching. In this paper we introduce -LSI, a modification of LSI in which we artificially decrease the intrinsic dimensionality of LSI matrices. This is achieved by an adjustment of singular values produced by SVD. We show that suitable adjustments could dramatically improve the efficiency when searching by MAMs, while the precision/recall values remain preserved or get only slightly worse.
- Information Retrieval Models (II) | Pp. 245-259
PIRE: An Extensible IR Engine Based on Probabilistic Datalog
Henrik Nottelmann
This paper introduces PIRE, a probabilistic IR engine. For both document indexing and retrieval, PIRE makes heavy use of probabilistic Datalog, a probabilistic extension of predicate Horn logics. Using such a logical framework together with probability theory allows for defining and using data types (e.g. text, names, numbers), different weighting schemes (e.g. normalised tf, tf.idf or BM25) and retrieval functions (e.g. uncertain inference, language models). Extending the system thus is reduced to adding new rules. Furthermore, this logical framework provide a powerful tool for including additional background knowledge into the retrieval process.
- Information Retrieval Models (II) | Pp. 260-274
Data Fusion with Correlation Weights
Shengli Wu; Sally McClean
This paper is focused on the effect of correlation on data fusion for multiple retrieval results. If some of the retrieval results involved in data fusion correlate more strongly than the others, their common opinion will dominate the voting process in data fusion. This may degrade the effectiveness of data fusion in many cases, especially when very good results appear to be a minority. For solving this problem, we assign each result a weight, which is derived from the correlation coefficient of that result to the other results, then the linear combination method can be used for data fusion. The evaluation of the effectiveness of the proposed method with TREC 5 ( ad hoc track) results is reported. Furthermore, we explore the relationship between results correlation and data fusion by some experiments, and demonstrate that a relationship between them does exists.
- Text Classification and Fusion | Pp. 275-286