Catálogo de publicaciones - libros
Information Retrieval Technology: Second Asia Information Retrieval Symposium, AIRS 2005, Jeju Island, Korea, October 13-15, 2005, Proceedings
Gary Geunbae Lee ; Akio Yamada ; Helen Meng ; Sung Hyon Myaeng (eds.)
En conferencia: 2º Asia Information Retrieval Symposium (AIRS) . Jeju Island, South Korea . October 13, 2005 - October 15, 2005
Resumen/Descripción – provisto por la editorial
No disponible.
Palabras clave – provistas por la editorial
Information Storage and Retrieval; Library Science; Theory of Computation; Information Systems Applications (incl. Internet); Algorithm Analysis and Problem Complexity; Data Structures
Disponibilidad
Institución detectada | Año de publicación | Navegá | Descargá | Solicitá |
---|---|---|---|---|
No detectada | 2005 | SpringerLink |
Información
Tipo de recurso:
libros
ISBN impreso
978-3-540-29186-2
ISBN electrónico
978-3-540-32001-2
Editor responsable
Springer Nature
País de edición
Reino Unido
Fecha de publicación
2005
Información sobre derechos de publicación
© Springer-Verlag Berlin Heidelberg 2005
Cobertura temática
Tabla de contenidos
doi: 10.1007/11562382_11
Filtering Contents with Bigrams and Named Entities to Improve Text Classification
François Paradis; Jian-Yun Nie
We present a new method for the classification of “noisy” documents, based on filtering contents with bigrams and named entities. The method is applied to documents, but we claim it would be useful for many other Web collections, which also contain non-topical contents. Different variations of the method are discussed. We obtain the best results by filtering out a window around the least relevant bigrams. We find a significant increase of the micro-F1 measure on our collection of call for tenders, as well as on the “4-Universities” collection. Another approach, to reject sentences based on the presence of some named entities, also shows a moderate increase. Finally, we try combining the two approaches, but do not get conclusive results so far.
- Session 2A: Natural Language Processing in IR | Pp. 135-146
doi: 10.1007/11562382_12
The Spatial Indexing Method for Supporting a Circular Location Property of Object
Hwi-Joon Seon; Hong-Ki Kim
To increase the retrieval performance in spatial and multimedia database systems, it is required to develop spatial indexing methods considering the spatial locality. The spatial locality is related to the location property of objects. Most spatial indexing methods, however, were not considered the circular location property of objects. In this paper, we propose a dynamic spatial index structure, called CR-tree. It is a new spatial index structure to support the circular location property of objects in which a search space is organized with the circular and linear domains. We include the performance test results that verify this advantage of the CR-tree and show that the CR-tree outperforms the R-tree.
- Session 2A: Natural Language Processing in IR | Pp. 147-159
doi: 10.1007/11562382_13
Supervised Categorization of JavaScript Using Program Analysis Features
Wei Lu; Min-Yen Kan
Web pages often embed scripts for a variety of purposes, including advertising and dynamic interaction. Understanding embedded scripts and their purpose can often help to interpret or provide crucial information about the web page. We have developed a functionality-based categorization of JavaScript, the most widely used web page scripting language. We then view understanding embedded scripts as a text categorization problem. We show how traditional information retrieval methods can be augmented with the features distilled from the domain knowledge of JavaScript and software analysis to improve classification performance. We perform experiments on the standard WT10G web page corpus, and show that our techniques eliminate over 50% of errors over a standard text classification baseline.
- Session 2A: Natural Language Processing in IR | Pp. 160-173
doi: 10.1007/11562382_14
Effective and Scalable Authorship Attribution Using Function Words
Ying Zhao; Justin Zobel
Techniques for identifying the author of an unattributed document can be applied to problems in information analysis and in academic scholarship. A range of methods have been proposed in the research literature, using a variety of features and machine learning approaches, but the methods have been tested on very different data and the results cannot be compared. It is not even clear whether the differences in performance are due to feature selection or other variables. In this paper we examine the use of a large publicly available collection of newswire articles as a benchmark for comparing authorship attribution methods. To demonstrate the value of having a benchmark, we experimentally compare several recent feature-based techniques for authorship attribution, and test how well these methods perform as the volume of data is increased. We show that the benchmark is able to clearly distinguish between different approaches, and that the scalability of the best methods based on using function words features is acceptable, with only moderate decline as the difficulty of the problem is increased.
- Session 2A: Natural Language Processing in IR | Pp. 174-189
doi: 10.1007/11562382_15
Learning to Integrate Web Taxonomies with Fine-Grained Relations: A Case Study Using Maximum Entropy Model
Chia-Wei Wu; Tzong-Han Tsai; Wen-Lian Hsu
As web taxonomy integration is an emerging issue on the Internet, many research topics, such as personalization, web searches, and electronic markets, would benefit from further development of taxonomy integration techniques. The integration task is to transfer documents from a source web taxonomy to a target web taxonomy. In most current techniques, integration performance is enhanced by referring to the relations between corresponding categories in the source and target taxonomies. However, the techniques may not be effective, since the concepts of the corresponding categories may overlap partially. In this paper we present an effective approach for integrating taxonomies and alleviating the partial overlap problem by considering fine-grained relations using a Maximum Entropy Model. The experiment results show that the proposed approach improves the classification accuracy of taxonomies over previous approaches.
- Session 3A: Web IR | Pp. 190-205
doi: 10.1007/11562382_16
WIDIT: Fusion-Based Approach to Web Search Optimization
Kiduk Yang; Ning Yu
To facilitate both the understanding and the discovery of information, we need to utilize multiple sources of evidence, integrate a variety of methodologies, and combine human capabilities with those of the machine. The Web Information Discovery Integrated Tool (WIDIT) Laboratory at the School of Library and Information Science, Indiana University-Bloomington, houses several projects that employ this idea of multi-level fusion in the areas of information retrieval and knowledge discovery. This paper describes a Web search optimization study by the TREC research group of WIDIT, who explores a fusion-based approach to enhancing retrieval performance on the Web. In the study, we employed both static and dynamic tuning methods to optimize the fusion formula that combines multiple sources of evidence. By static tuning, we refer to the typical stepwise tuning of system parameters based on training data. “Dynamic tuning”, the key idea of which is to combine the human intelligence, especially pattern recognition ability, with the computational power of the machine, involves an interactive system tuning process that facilitates fine-tuning of the system parameters based on the cognitive analysis of immediate system feedback. The rest of the paper is organized as follows. The next section discusses related work in Web information retrieval (IR). Section 3 details the WIDIT approach to Web IR, followed by the description of our experiment using the TREC .gov data in section 4 and the discussion of results in section 5.
- Session 3A: Web IR | Pp. 206-220
doi: 10.1007/11562382_17
Transactional Query Identification in Web Search
In-Ho Kang
User queries on the Web can be classified into three types according to user’s intention: informational query, navigational query and transactional query. In this paper, a query type classification method and Service Link information for transactional queries are proposed. Web mediated activity is usually implemented by hyperlinks. Hyperlinks can be good indicators in classifying queries and retrieving good answer pages for transactional queries. A hyperlink related to an anchor text has an anticipated action with a linked object. Possible actions are reading, visiting and downloading a linked object. We can assign a possible action to each anchor text. These tagged anchor texts can be used as training data for query type classification. We can collect a large-scale and dynamic train query set automatically. To see the accuracy of the proposing classification method, various experiments were conducted. From experiments, I could achieve 91% of possible improvement for transactional queries with our classification method.
- Session 3A: Web IR | Pp. 221-232
doi: 10.1007/11562382_18
Improving FAQ Retrieval Using Query Log Clustering in Latent Semantic Space
Harksoo Kim; Hyunjung Lee; Jungyun Seo
Lexical disagreement problems often occur in FAQ retrieval because FAQs unlike general documents consist of just one or two sentences. To resolve lexical disagreement problems, we propose a high-performance FAQ retrieval system using query log clustering. During indexing time, using latent semantic analysis techniques, the proposed system classifies and groups the logs of users’ queries into predefined FAQ categories. During retrieval time, the proposed system uses the query log clusters as a form of FAQ smoothing. In our experiment, we found that the proposed system could resolve some lexical disagreement problems between queries and FAQs.
- Session 3B: Question Answering | Pp. 233-245
doi: 10.1007/11562382_19
Phrase-Based Definitional Question Answering Using Definition Terminology
Kyoung-Soo Han; Young-In Song; Sang-Bum Kim; Hae-Chang Rim
We propose a definitional question answering method using linguistic information and definition terminology-based ranking. We introduce syntactic definition patterns which are easily constructed and reduce the coverage problem. Phrases are extracted using the syntactic patterns, and the redundancy is eliminated based on lexical overlap and semantic matching. In order to rank the phrases, we used several evidences including external definitions and definition terminology. Although external definitions are useful, it is obvious that they cannot cover all the possible targets. The definition terminology score, reflecting how the phrase is definition-like, is devised to assist the incomplete external definitions. Experimental results support our method is effective.
- Session 3B: Question Answering | Pp. 246-259
doi: 10.1007/11562382_20
Enhanced Question Answering with Combination of Pre-acquired Answers
Hyo-Jung Oh; Chung-Hee Lee; Hyeon-Jin Kim; Myung-Gil Jang
Recently there is a need for QA system to answer various types of user questions. Among these questions, we focus on record questions and descriptive questions. For these questions, pre-acquired answers should be prepared, while traditional QA finds appropriate answers in real-time. In this paper, we propose enhanced QA model by combining various pre-acquired answers in encyclopedia. We defined pre-acquired answer types, 55 Record Type(RT)s and 10 Descriptive Answer Type(DAT)s, in advance. To construct answer units, we built 183 Record Answer Indexing Templates and 3,254 descriptive patterns. We discussed how our proposed model was applied to the record and descriptive questions with some experiments.
- Session 3B: Question Answering | Pp. 260-273