Catálogo de publicaciones - libros

Compartir en
redes sociales


Knowledge Discovery in Databases: PKDD 2005: 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, Porto, Portugal, October 3-7, 2005, Proceedings

Alípio Mário Jorge ; Luís Torgo ; Pavel Brazdil ; Rui Camacho ; João Gama (eds.)

En conferencia: 9º European Conference on Principles of Data Mining and Knowledge Discovery (PKDD) . Porto, Portugal . October 3, 2005 - October 7, 2005

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

No disponibles.

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2005 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-29244-9

ISBN electrónico

978-3-540-31665-7

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

Información sobre derechos de publicación

© Springer-Verlag Berlin Heidelberg 2005

Tabla de contenidos

Object Identification with Attribute-Mediated Dependences

Parag Singla; Pedro Domingos

Object identification is the problem of determining whether different observations correspond to the same object. It occurs in a wide variety of fields, including vision, natural language, citation matching, and information integration. Traditionally, the problem is solved separately for each pair of observations, followed by transitive closure. We propose solving it collectively, performing simultaneous inference for all candidate match pairs, and allowing information to propagate from one candidate match to another via the attributes they have in common. Our formulation is based on conditional random fields, and allows an optimal solution to be found in polynomial time using a graph cut algorithm. Parameters are learned using a voted perceptron algorithm. Experiments on real and synthetic datasets show that this approach outperforms the standard one.

Palabras clave: Transitive Closure; Conditional Random Field; Collective Model; Candidate Pair; Record Pair.

- Long Papers | Pp. 297-308

Weka4WS: A WSRF-Enabled Weka Toolkit for Distributed Data Mining on Grids

Domenico Talia; Paolo Trunfio; Oreste Verta

This paper presents Weka4WS, a framework that extends the Weka toolkit for supporting distributed data mining on Grid environments. Weka4WS adopts the emerging Web Services Resource Framework (WSRF) for accessing remote data mining algorithms and managing distributed computations. The Weka4WS user interface is a modified Weka Explorer environment that supports the execution of both local and remote data mining tasks. On every computing node, a WSRF-compliant Web Service is used to expose all the data mining algorithms provided by the Weka library. The paper describes the design and the implementation of Weka4WS using a first release of the WSRF library. To evaluate the efficiency of the proposed system, a performance analysis of Weka4WS for executing distributed data mining tasks in different network scenarios is presented.

Palabras clave: Execution Time; Data Mining; Association Rule; Computing Node; Total Execution Time.

- Long Papers | Pp. 309-320

Using Inductive Logic Programming for Predicting Protein-Protein Interactions from Multiple Genomic Data

Tuan Nam Tran; Kenji Satou; Tu Bao Ho

Protein-protein interactions play an important role in many fundamental biological processes. Computational approaches for predicting protein-protein interactions are essential to infer the functions of unknown proteins, and to validate the results obtained of experimental methods on protein-protein interactions. We have developed an approach using Inductive Logic Programming (ILP) for protein-protein interaction prediction by exploiting multiple genomic data including protein-protein interaction data, SWISS-PROT database, cell cycle expression data, Gene Ontology, and InterPro database. The proposed approach demonstrates a promising result in terms of obtaining high sensitivity/specificity and comprehensible rules that are useful for predicting novel protein-protein interactions. We have also applied our method to a number of protein-protein interaction data, demonstrating an improvement on the expression profile reliability (EPR) index.

Palabras clave: Inductive Logic Program; Domain Pair; Protein Secondary Structure Prediction; Inductive Logic Program System; Bottom Clause.

- Long Papers | Pp. 321-330

ISOLLE: Locally Linear Embedding with Geodesic Distance

Claudio Varini; Andreas Degenhard; Tim Nattkemper

Locally Linear Embedding (LLE) has recently been proposed as a method for dimensional reduction of high-dimensional nonlinear data sets. In LLE each data point is reconstructed from a linear combination of its n nearest neighbors, which are typically found using the Euclidean Distance. We propose an extension of LLE which consists in performing the search for the neighbors with respect to the geodesic distance (ISOLLE). In this study we show that the usage of this metric can lead to a more accurate preservation of the data structure. The proposed approach is validated on both real-world and synthetic data.

Palabras clave: Short Circuit; Geodesic Distance; Locally Linear Embedding; Linear Embedding; Source Vertex.

- Long Papers | Pp. 331-342

Active Sampling for Knowledge Discovery from Biomedical Data

Sriharsha Veeramachaneni; Francesca Demichelis; Emanuele Olivetti; Paolo Avesani

We describe work aimed at cost-constrained knowledge discovery in the biomedical domain. To improve the diagnostic/prognostic models of cancer, new biomarkers are studied by researchers that might provide predictive information. Biological samples from monitored patients are selected and analyzed for determining the predictive power of the biomarker. During the process of biomarker evaluation, portions of the samples are consumed, limiting the number of measurements that can be performed. The biological samples obtained from carefully monitored patients, that are well annotated with pathological information, are a valuable resource that must be conserved. We present an active sampling algorithm derived from statistical first principles to incrementally choose the samples that are most informative in estimating the efficacy of the candidate biomarker. We provide empirical evidence on real biomedical data that our active sampling algorithm requires significantly fewer samples than random sampling to ascertain the efficacy of the new biomarker.

Palabras clave: Mean Square Error; Knowledge Discovery; Class Label; Active Sampling; Biomedical Data.

- Long Papers | Pp. 343-354

A Multi-metric Index for Euclidean and Periodic Matching

Michail Vlachos; Zografoula Vagena; Vittorio Castelli; Philip S. Yu

In many classification and data-mining applications the user does not know a priori which distance measure is the most appropriate for the task at hand without examining the produced results. Also, in several cases, different distance functions can provide diverse but equally intuitive results (according to the specific focus of each measure). In order to address the above issues, we elaborate on the construction of a hybrid index structure that supports query-by-example on shape and structural distance measures, therefore lending enhanced exploratory power to the system user. The shape distance measure that the index supports is the ubiquitous Euclidean distance, while the structural distance measure that we utilize is based on important periodic features extracted from a sequence. This new measure is phase-invariant and can provide flexible sequence characterizations, loosely resembling the Dynamic Time Warping, requiring only a fraction of the computational cost of the latter. Exploiting the relationship between the Euclidean and periodic measure, the new hybrid index allows for powerful query processing, enabling the efficient answering of kNN queries on both measures in a single index scan. We envision that our system can provide a basis for fast tracking of correlated time-delayed events, with applications in data visualization, financial market analysis, machine monitoring/diagnostics and gene expression data analysis.

Palabras clave: Periodic Measure; Index Structure; Vantage Point; Dynamic Time Warping; Hybrid Index.

- Long Papers | Pp. 355-367

Fast Burst Correlation of Financial Data

Michail Vlachos; Kun-Lung Wu; Shyh-Kwei Chen; Philip S. Yu

We examine the problem of monitoring and identification of correlated burst patterns in multi-stream time series databases. Our methodology is comprised of two steps: a burst detection part, followed by a burst indexing step. The burst detection scheme imposes a variable threshold on the examined data and takes advantage of the skewed distribution that is typically encountered in many applications. The indexing step utilizes a memory-based interval index for effectively identifying the overlapping burst regions. While the focus of this work is on financial data, the proposed methods and data-structures can find applications for anomaly or novelty detection in telecommunications and network traffic, as well as in medical data. Finally, we manifest the real-time response of our burst indexing technique, and demonstrate the usefulness of the approach for correlating surprising volume trading events at the NY stock exchange.

Palabras clave: Trading Volume; Concept Drift; Burst Detection; Input Interval; Burst Interval.

- Long Papers | Pp. 368-379

A Propositional Approach to Textual Case Indexing

Nirmalie Wiratunga; Rob Lothian; Sutanu Chakraborti; Ivan Koychev

Problem solving with experiences that are recorded in text form requires a mapping from text to structured cases, so that case comparison can provide informed feedback for reasoning. One of the challenges is to acquire an indexing vocabulary to describe cases. We explore the use of machine learning and statistical techniques to automate aspects of this acquisition task. A propositional semantic indexing tool, Psi , which forms its indexing vocabulary from new features extracted as logical combinations of existing keywords, is presented. We propose that such logical combinations correspond more closely to natural concepts and are more transparent than linear combinations. Experiments show Psi -derived case representations to have superior retrieval performance to the original keyword-based representations. Psi also has comparable performance to Latent Semantic Indexing, a popular dimensionality reduction technique for text, which unlike Psi generates linear combinations of the original features.

Palabras clave: Feature Selection; Feature Extraction; Association Rule; Information Gain; Logical Combination.

- Long Papers | Pp. 380-391

A Quantitative Comparison of the Subgraph Miners MoFa, gSpan, FFSM, and Gaston

Marc Wörlein; Thorsten Meinl; Ingrid Fischer; Michael Philippsen

Several new miners for frequent subgraphs have been published recently. Whereas new approaches are presented in detail, the quantitative evaluations are often of limited value: only the performance on a small set of graph databases is discussed and the new algorithm is often only compared to a single competitor based on an executable. It remains unclear, how the algorithms work on bigger/other graph databases and which of their distinctive features is best suited for which database. We have re-implemented the subgraph miners MoFa, gSpan, FFSM, and Gaston within a common code base and with the same level of programming expertise and optimization effort. This paper presents the results of a comparative benchmarking that ran the algorithms on a comprehensive set of graph databases.

- Long Papers | Pp. 392-403

Efficient Classification from Multiple Heterogeneous Databases

Xiaoxin Yin; Jiawei Han

With the fast expansion of computer networks, it is inevitable to study data mining on heterogeneous databases. In this paper we propose MDBM , an accurate and efficient approach for classification on multiple heterogeneous databases. We propose a regression-based method for predicting the usefulness of inter-database links that serve as bridges for information transfer, because such links are automatically detected and may or may not be useful or even valid. Because of the high cost of inter-database communication, MDBM employs a new strategy for cross-database classification, which finds and performs actions with high benefit-to-cost ratios. The experiments show that MDBM achieves high accuracy in cross-database classification, with much higher efficiency than previous approaches.

Palabras clave: Class Label; Association Rule Mining; Privacy Preserve; Loan Application; Heterogeneous Database.

- Long Papers | Pp. 404-416